Coding In Medicine

Agent-Sourced: A Provenance Tag for the Agent Era

Gyasi Sutton, MD, MPH — Mon, 08 Jun 2026 12:55:00 GMT

This is the short version of an idea I'm putting on record. I make the case more fully on the TwiceData blog, and in full — with the tier criteria, the transition and licensing rules, and a call to codify it in the open — in a white paper. I'm coining a term here: agent-sourced.

The fight

Open source has a trust problem, and it arrived faster than anyone was ready for. Through 2025 and into 2026, maintainers started drowning. curl ended its bug-bounty program in early 2026 after AI “slop” overwhelmed its security queue — not even one in twenty submissions was a genuine vulnerability. Zig adopted an outright no-LLM policy; Gentoo forbids contributions made with AI tools; NetBSD presumes LLM-generated code “tainted”; QEMU banned AI contributions on licensing grounds (a policy it’s now reconsidering); and GIMP and Flathub banned them outright. LLVM took the opposite tack — a human-in-the-loop policy that permits AI-assisted code so long as a person vouches for it.

It crystallized into a public fight. On one side, DHH: banning AI betrays open source’s founding mission — everyone’s right to change software. On the other, ThePrimeagen: the bans are triage, and quality has to stay with a person. Both are right, which is exactly why the argument doesn’t resolve. The frame forces a blunt binary: ban everything, or drown.

The wrong question

“Should agents be allowed to contribute” is the wrong question. The right one is how their contributions are labeled.

I spend a lot of my time in healthcare data, where provenance is not an abstraction. In clinical and biomedical software, “where did this come from, and who checked it” is the difference between a number you can act on and one you can’t. We already accept that a result is only as trustworthy as its lineage. The code agents are now writing deserves the same treatment — not a verdict on whether it’s allowed, but an honest label on where it came from.

Agent-sourced

Agent-sourced (n., adj.) — a change, project, or artifact created mostly autonomously by an agent (one or more, not necessarily a population), with human input extremely low. It is a provenance label, the agent-era analog to crowdsourced: it tells the audience what they’re looking at, so the work can be trusted, reviewed, and used accordingly.

That’s the whole move. When the producers of code change, the first thing you owe everyone downstream is honest provenance.

A tag does two things at once. It flags the contribution for deeper review, so a maintainer knows exactly where to look harder. And it still lets agent-driven work through — welcomed, not forbidden. Honest provenance instead of gatekeeping or blind faith.

It’s also the low-commitment option, which is why it could actually be adopted. Today a maintainer’s only moves are expensive: fully review and include a contribution, or ban the whole category. An agent-sourced tag is the lighter middle — you neither vet everything nor forbid everyone; you label, and let the label route attention. And crucially, it ships as an add-on, not another bot: a tag on a pull request or commit, not one more AI that fixes issues or reviews PRs. Those add load and noise — the very thing maintainers are banning. The tag changes almost nothing in the workflow; it just makes origin legible.

Two tiers, and a gate that can’t be skipped

The useful distinction isn’t a long ladder of labels — it’s two tiers, and the gate between them.

Tier 1 — raw agent-sourced. The agent’s output as submitted, not yet vetted by a person. No one vouches for it. A maintainer can fully ignore it, guilt-free, or browse it when they have time.

Tier 2 — human-verified agent-sourced. A person has reviewed it and vouches for it — but it is still agent-sourced. Verification doesn’t erase provenance; both facts travel together (made by an agent, checked by a human). Tier 2 is a required, non-bypassable designator: nothing reaches a release until it has earned it.

It honors both sides

Agent-sourced code is a double-edged sword, and the tag denies neither edge. It can hide bug-bombs — defects that, merged unverified, take weeks to dig out (exactly ThePrimeagen’s fear). It can also deliver deep, fast innovation (exactly DHH’s hope). The tag validates both: flag it, verify before you trust, and welcome it, don’t ban it. In practice the buffer is the fork — agent work lives on a tagged fork, quarantined from the base, and when something is genuinely good a human cherry-picks it into the base. The base stays clean, innovation stays free, and a person still holds the gate.

Not another bot — an identifier and a verifier

Concretely, agent-sourced is two things you can build. First, an identifier — the provenance tag itself, carried on a commit or pull request: a signed trailer, a label, a field a tool can read. Second, a verifier / PR framework — the machinery that promotes Tier 1 to Tier 2: who attests, what gets checked, and where the gate sits in the pull-request flow. That’s what makes it real rather than rhetorical.

The open question — and an invitation

The honest hard part is governance: who decides what is agent-sourced versus agent-assisted versus human, and who verifies it? The line is genuinely contested — and it shouldn’t be decreed by any one of us. A real standard would need attestation, a clear boundary between assisted and sourced, transition rules for when a human edits agent work, and licensing terms that travel with the tag — and it must be set by a body of industry leaders, the way the OSI defined “open source,” C2PA defined content provenance, and SPDX standardized license identifiers.

So consider this an invitation. If you maintain a project, steward a foundation, build the agents doing the contributing, or work the licensing side — let’s convene and define it. The open-source AI fight doesn’t have to end with a ban or a flood. It can end with graduated, verifiable trust: agent work welcomed, labeled, and reviewed in proportion to where it came from.

A fuller treatment is on the TwiceData blog, and the complete technical argument — tiers, transition, licensing, and the call to codify — is the Agent-Sourced white paper (also available as a PDF).

— Gyasi Sutton, MD, MPH

My AI Engineering Philosophy: Why I Never Get Locked In

Gyasi Sutton, MD, MPH — Mon, 11 Aug 2025 17:49:14 GMT

How I Learned This Lesson the Hard Way

When OpenAI first launched ChatGPT in November 2022, I was amazed. Here was a GPT-3.5-powered chatbot that could actually hold conversations. When GPT-4 launched in March 2023, I happily signed up for the $20/month Plus plan. It was a bargain — cutting-edge AI at my fingertips.

But this wasn't my first encounter with AI. I'd been working with mostly PyTorch and TensorFlow for some time, training my own models for medical metrics and research. That was the deep learning era — open-source frameworks, transparent architectures, and a truly open way of doing things. You trained your own models, you owned your own code, you controlled your own destiny.

Then ChatGPT happened, and everything shifted. The center moved from open research to vendor-trained models. The term "AI" exploded in popularity, but it meant something different now. Instead of building and training models, we were calling APIs. Instead of understanding architectures, we were optimizing prompts. Instead of owning our models, we were renting them.

Like everyone else, I followed the "standard" instructions: use OpenAI's API, optimize prompts for their model, ship fast.

But then I made what I now consider a rookie mistake: I invested deeply in LangChain, building flows tied to its abstractions. Then model version updates broke things. Same story with Guidance AI, DSPy, and Autogen — all great tools that I still keep in my toolkit, but each with its own quirks, dependencies, and upgrade pains.

It was a wake-up call: the deeper you go into one stack without guardrails, the harder it is to adapt when something changes — and in AI, everything changes fast.

The Trap Most Developers Fall Into

Every developer I know has been there. You start with a tool that feels perfect — maybe it's OpenAI, AWS Bedrock, or Google Vertex AI. It works beautifully, you build fast, and you think you've found the platform.

Then one day you realize:

Your prompts only work with one model.
Your entire pipeline is bound to one API.
Your deployment lives and dies on one vendor's infrastructure.

That's vendor lock-in, and it's the silent killer of AI agility.

My Philosophy: Vendor-Neutral, Model-Agnostic Development

I live by one rule:

Never build anything that can't run on any model, any platform, any time.

Why? Because I've been burned — and I know how expensive "re-platforming" can be.

1. Models Change Faster Than You Think

Claude, LLaMA, Mistral, and new models launch every month. There are dedicated YouTube channels and major online publications dedicated to the latest releases. If you're locked to one, you're already obsolete.

2. The Wood and Paper Analogy

Think of it this way: if the lowest level of AI capability is like a bunch of 5-year-olds gathering wood pieces, does it take a PhD to do this task? No. But if you need them to use a chipper and process to convert those shavings into paper, that's when you need an advanced agent and its processing power.

Simple tasks don't require advanced AI. But complex processes — like converting raw data into structured insights, or orchestrating multi-step workflows — absolutely need the processing power of models like Kimi K2 or DeepSeek-R1.

The problem is that you're paying premium prices for access to these advanced capabilities, but you don't own them. You're renting processing power that could be taken away or changed at any moment.

3. Vendors Change Their Terms

Prices go up. APIs get deprecated. Features disappear. If you're locked in, you're powerless.

How This Philosophy Plays Out in My Work

Abstraction Layers Everywhere

I never call a model API directly. I route through an abstraction that can swap models on the fly.

Litellm goes a long way toward truly unified, model-agnostic calls. I've also forked token.js to do the same thing on the js side. Rust has some interesting possibilities in this space for performance-heavy pipelines — I'm keeping an eye on it and experimenting where it makes sense.

Prompt Engineering That Travels

No model-specific hacks
Structured, portable formats
Fallback logic for weaker models

Universal Data Formats

JSON schemas anyone can consume
Embeddings from multiple providers
Vector formats compatible with any DB

Why Medium-Sized Businesses Must Train In-House

If you're running a medium-sized business, building in-house AI capability isn't a luxury — it's survival insurance.

You own the models, not just the API keys.
You keep your IP private.
You avoid scaling costs that explode as usage grows.

Small businesses might get by with vendor tools — quick to deploy, easy to use — but they're locked into someone else's feature set. That's fine for early stage speed, but it's a trap if you grow.

The Economics That Changed My Mind

At one point, my API and SaaS bills (across OpenAI, Claude, and others) were over $400/month. I run a lot of experiments, and the cost added up fast.

Now?

I dedicate half my local storage to open-source models — some fine-tuned on synthetic data. My total monthly AI infrastructure cost (all APIs, all SaaS, all cloud) is under $150 — with more control, more flexibility, and no lock-in.

That's the power of open source. That's the power of local models. That's why Chinese open-source LLMs are exploding in popularity — no gatekeepers, no monthly ransom, if you have the hardware to run it. There is a race between computer manufacturers and API services that now rent high-capacity GPUs to host open source models.

The Real Cost of Vendor Lock-In

I've watched teams waste months migrating to new APIs. I've seen products collapse because a vendor killed a feature. I've seen developers sidelined because they couldn't adapt.

The cost isn't just technical. It's existential.

My Development Principles

Always build abstraction layers — APIs are replaceable
Test with multiple models — at least three
Standardize your formats — avoid vendor-only data shapes
Plan your escape routes — migration is inevitable
Document dependencies — know exactly where you're locked

The Bottom Line

Vendor lock-in is death by a thousand paper cuts. It starts with one "quick" API call, and ends with you rewriting your stack to survive.

The antidote? Build for freedom from day one.

Freedom to experiment
Freedom to negotiate
Freedom to pivot
Freedom to scale

Because the best AI systems aren't the ones that just work today — they're the ones that will still work tomorrow, next year, and in the next wave of change.

Manipulating time like a TimeLord with Flux

Gyasi Sutton, MD, MPH — Fri, 27 Jun 2025 09:06:47 GMT

Testing is a cornerstone of robust software development, but it presents unique challenges when time is a critical factor. How do you verify that a notification is sent exactly 24 hours after an event without making your test suite wait for an entire day? How do you ensure that time-sensitive logic is immune to the small, unpredictable delays of real-world execution? Mocking Python's built-in time module can quickly become a tangled mess of patched calls and fragile tests.

This is precisely where Flux comes in. Flux is a Python library designed to give you complete and deterministic control over time in your tests. It allows you to create virtual timelines, making it trivial to fast-forward, freeze, and schedule events without the wait. In this article, we'll explore how Flux works, dive into real-world use cases, and walk through how to replace unwieldy, time-based tests with elegant, fast, and reliable ones.

A Bird’s-Eye View of Flux

Flux is all about giving you full control over time in your Python tests and simulations. In a nutshell, it:

Provides a virtual clock for your code to run against.
Allows you to schedule callbacks to fire at specific points in the virtual timeline.
Supports time factors, letting you speed up or slow down how quickly the virtual timeline progresses relative to real-world time.
Optionally offers a global “current timeline” that can transparently replace calls to time.sleep() or time.time() across different modules.

This means you can test things like “what if my function runs for a whole day?” without waiting 24 real hours. You can also freeze time for deterministic tests or accelerate time to watch processes play out in a fraction of their usual duration.

Real-World Use Cases

Here are just a few places where Flux can make your life easier:

Unit Testing Long Waits

Code that triggers an alert after an hour/day/week can be tested immediately by fast-forwarding the virtual clock.

Accelerating Simulations

Got a simulation that’s meant to run for days? Crank up the “time factor” so an hour of virtual time passes in just seconds of real time.

Freezing Time for Deterministic Tests

By setting the time factor to zero (or using freeze()), your time-based tests can make exact comparisons without worrying about the overhead of Python or other system delays.

Scheduling Automated Callbacks

Want a function to run automatically once the clock hits a certain timestamp? Flux’s scheduling mechanism has you covered.

Seamless Global Integration

With Flux, you can replace all time module calls in your entire project (or specific modules) with a single global timeline, making your application’s notion of time fully under your control.

Hello, Flux: Your First Tutorial

1. Installation

If Flux is on PyPI or you have it locally, install it with:

pip install flux

(Or adjust accordingly if you have a different installation process.)

2. Meet `Timeline`

Flux offers a class called Timeline. This is the hero of the story—an object that represents a virtual clock. Let’s see it in action:


from flux import Timeline

# Create a timeline instance
timeline = Timeline()

# Get the current virtual time
print(f"Current virtual time: {timeline.time()}")

At creation, the timeline starts at a default epoch (similar to the real time’s epoch). You can query it just like time.time() in standard Python.

3. Sleeping Virtually

Timeline.sleep() behaves similarly to time.sleep(), but it advances virtual time:


print(f"Before sleep: {timeline.time()}")

# Sleep 10 virtual seconds
timeline.sleep(10)

print(f"After sleep: {timeline.time()}")

By default, s>time factor = 1, so sleeping 10 virtual seconds takes 10 real seconds—unless you change the rules.

4. Changing the Time Factor

Think of this like the classic sci-fi or fantasy trope where time moves differently in another dimension—five years in Narnia might be only two minutes in the real world. The time factor in Flux gives you that exact power over your code's timeline. It determines how many virtual seconds pass per real second. For example:


timeline.set_time_factor(5)
print(f"Time factor: {timeline.get_time_factor()}")

# Now, sleeping 2 virtual seconds will only take 0.4 real seconds.
timeline.sleep(2)

This is a game-changer if you want to speed up or slow down time-dependent logic.

5. Freezing Time

Sometimes, you don’t want the timeline to progress automatically at all. This is where freezing comes in. When time is frozen (by setting the time factor to 0), timeline.sleep() is the only way to advance the virtual clock, and it does so instantly with no real-world delay.


timeline.freeze()  # Sets time factor to 0
print(f"Frozen time: {timeline.time()}")

# Sleeping will instantly advance the virtual clock with no real waiting
timeline.sleep(30)

print(f"Time after 'sleeping' on frozen timeline: {timeline.time()}")

Perfect for avoiding flaky tests caused by unpredictable real-world time offsets.

6. Scheduling a Callback

Timeline.schedule_callback(when, callback_function) allows you to set a future point in the virtual timeline to run a function. If you need to wait until all scheduled callbacks have triggered, the library provides a convenient sleep_wait_all_scheduled() method.


def say_hello():
    print(f"Hello at virtual time: {timeline.time()}")

timeline.schedule_callback(timeline.time() + 50, say_hello)

timeline.sleep(60)  # The callback fires after we've passed the scheduled time

Seven Practical Code Snippets

Let’s walk through how Flux shines in typical situations you might face.

1. Testing a Long-Running Function

Ever needed to test that a function notifies someone after a day’s wait? With Flux, no problem:


import time

def long_running_func(_sleep=time.sleep, _time=time.time):
    start = _time()
    while True:
        if _time() - start > 60 * 60 * 24:  # 24 hours
            print("Notification triggered!")
            break
        _sleep(30)

# Test using a virtual timeline
from flux import Timeline

def test_long_wait():
    timeline = Timeline()
    timeline.set_time_factor(0)  # freeze time to skip real waits

    # After 24 hours + 1 second in virtual time, we expect a notification
    timeline.schedule_callback((60 * 60 * 24) + 1, lambda: None)

    # Run the function with the timeline's time and sleep
    long_running_func(_sleep=timeline.sleep, _time=timeline.time)

    print("Test passed without waiting 24 real hours!")

Notice how we never truly wait 24 hours in real-time. The virtual clock leaps to the future instantly.

2. Speeding Up a Simulation

Simulations often need to “hurry up” in test or demo mode. Enter time factor:


def simulate(duration):
    print("Simulation started.")
    time.sleep(duration)
    print("Simulation ended.")

from flux import Timeline, current_timeline

def test_simulation_speed():
    # Use the global timeline for convenience
    current_timeline.set(Timeline())
    current_timeline.set_time_factor(1000)

    # If we call simulate(3600), it should take around 3.6s in real time
    simulate(3600)

    print("Sim finished quickly!")

3. Scheduling Periodic Tasks

Want to “ping” a sensor every 10 virtual seconds, three times in total?


def sensor_ping():
    print("Sensor ping at virtual time")

timeline = Timeline()
timeline.set_time_factor(1)  # Normal speed (just for illustration)

# Schedule pings
for i in range(3):
    timeline.schedule_callback(timeline.time() + 10*(i+1), sensor_ping)

# Sleep long enough for all pings
timeline.sleep(40)

print("All sensor pings done!")

4. Freezing Time for Precise Assertions

When time is unfrozen, your code typically runs a few milliseconds slower or faster than you expect. Flux can remove that uncertainty:


timeline = Timeline()
timeline.freeze()

start = timeline.time()
timeline.sleep(10)
end = timeline.time()

# Perfectly deterministic test
assert (end - start) == 10, "Time delta should be exactly 10!"

print("Freeze test successful!")

5. Using the Global Timeline Across Modules

In a large project, you might want multiple modules to share the same timeline. The current_timeline proxy is perfect for this. It allows multiple modules to share the same virtual clock without having to pass the Timeline object around. Here’s the pattern:


# moduleA.py
try:
    from flux import current_timeline as time
except ImportError:
    import time # fallback in case flux is not installed

def do_something_after_a_while():
    time.sleep(100)
    print("Done!")

# test_moduleA.py
from flux import Timeline, current_timeline
import moduleA

tl = Timeline()
tl.freeze() # Instantly advance time
current_timeline.set(tl)

moduleA.do_something_after_a_while()
# "Done!" will print immediately

6. Simulating a 5-Year Stock Prediction Algorithm

The true power of Flux shines when testing complex systems where the outcome isn’t predictable by a simple formula. Imagine you've built a sophisticated stock prediction algorithm. You need to test how it performs over five years of simulated market data, a process that would be impossible to run in real-time. With Flux, you can validate its long-term behavior in minutes.


import time
import random

def run_prediction_algo(market_data, duration_days):
    """
    A complex function that simulates running a prediction algorithm.
    The logic here would be your proprietary model.
    """
    print("Starting 5-year market prediction simulation...")
    for day in range(duration_days):
        # Simulate complex daily processing
        # e.g., fetching data, running models, making trades
        _ = [random.random() ** 2 for _ in range(1000)] 
        time.sleep(86400) # Advance one virtual day

    print("Simulation complete.")
    return "final_portfolio_value"

# --- Test File ---
from flux import Timeline, current_timeline

def test_five_year_simulation():
    # Set up a global timeline that is frozen to run instantly
    tl = Timeline()
    tl.freeze()
    current_timeline.set(tl)

    five_years_in_days = 365 * 5
    
    # Run the five-year simulation. Since the timeline is frozen,
    # the 1825 calls to sleep(86400) happen instantly.
    result = run_prediction_algo(
        market_data="mock_data_source",
        duration_days=five_years_in_days
    )

    # The entire 5-year test completes in a fraction of a real second
    print(f"Test finished. Result from 5-year simulation: {result}")
    assert result == "final_portfolio_value"

test_five_year_simulation()

7. Testing a Cron-Like Scheduler

How do you test a function that's supposed to run on a complex schedule, like "every second Friday of the month"? Writing a test that waits for real Fridays to pass is not an option. This is a perfect use case for Flux, pairing it with Python's `datetime` module to check the logic of the scheduler.


import datetime
import time

# --- The Function to Test ---
def cleanup_job_runner(get_time):
    """
    A runner that executes a cleanup job, but only on the 2nd Friday of any month.
    """
    last_run_day = -1
    jobs_fired = 0
    
    # Run for a simulated year
    for _ in range(365):
        now = datetime.datetime.fromtimestamp(get_time())
        
        # Prevent running multiple times on the same day
        if now.day == last_run_day:
            time.sleep(86400) # Move to the next day
            continue

        # Logic: Is it Friday? And is it in the second week of the month?
        is_friday = (now.weekday() == 4)
        is_second_week = (now.day > 7 and now.day <= 14)

        if is_friday and is_second_week:
            print(f"Job fired on {now.strftime('%Y-%m-%d')}, the 2nd Friday.")
            jobs_fired += 1
        
        last_run_day = now.day
        time.sleep(86400) # Move to the next virtual day
        
    return jobs_fired


# --- The Test File ---
from flux import Timeline, current_timeline

def test_cleanup_job_fires_correctly_over_one_year():
    # Start on a known date: Jan 1, 2024 (a Monday)
    tl = Timeline(start_time=1704067200)
    tl.freeze()
    current_timeline.set(tl)

    # Run the scheduler function over a virtual year
    fire_count = cleanup_job_runner(get_time=current_timeline.time)

    print(f"Test complete. The job fired {fire_count} times in a virtual year.")
    
    # There are 12 months, so the job should fire 12 times in a year.
    assert fire_count == 12

test_cleanup_job_fires_correctly_over_one_year()

A Data Engineer's Perspective: Timezones and Data Synchronization

As a data engineer, one of the most persistent challenges is managing time across distributed systems and diverse data sources. While Flux brilliantly solves the problem of testing time-dependent logic within an application, the real world often introduces a far more chaotic variable: timezones.

Different databases, APIs, and services often have their own idea of what 'now' is. Cloud-based providers are notorious for this, frequently defaulting to regional timestamps (like PST or EST) depending on where a server is located. This can lead to maddening bugs where data seems to arrive out of order or disappears entirely, simply because timestamps aren't being compared on a level playing field.

This is why it is an iron-clad best practice to standardize all time data to Coordinated Universal Time (UTC). By converting every timestamp to UTC as early as possible in your data pipeline, you create a single source of truth. This practice eliminates ambiguity and ensures that when you compare timestamps from a database in Virginia and a log file from Singapore, you're comparing apples to apples. While Flux helps you control time inside your tests, a strict UTC-first policy will help you tame it in your production systems.

Final Thoughts

Flux transforms the way you handle time-based tests and simulations. By mocking time progression, you sidestep the messy patching of Python’s time module in scattered places. You can freeze time, fast-forward through days in a blink, or schedule future events with simple, expressive code.

Key Takeaways

Easy Setup: A single Timeline object does everything you need.
Full Control: Adjust the time factor, freeze time, or schedule callbacks.
Save Time: Test day-long processes in mere seconds (or instantly!).
Reduce Flakes: Eliminate the uncertainty of real-world timing.
Seamless Integration: Use the global timeline to unify time usage across your codebase.

With Flux, you’ll find testing long or complex time-based scenarios becomes almost trivial—no more waiting overnight for tests to pass, and no more dealing with frustrating “close enough” validations in your assertions.

So the next time you have a feature that waits hours or days to do something, consider Flux. Your future self (and your test suite) will thank you.

SQL Alchemy for pythonic pipelines

Gyasi Sutton, MD, MPH — Sat, 11 Jan 2025 20:25:03 GMT

I've primarily been developing code, benchmarks, and data tables in Python, as platforms like Snowflake often present quirks and limitations in data processing that make it challenging to produce the desired deliverables. However, with the recent addition of a new Data Director, I've been focusing more on writing SQL queries rather than relying on my usual NumPy and Pandas stack. This shift was motivated by the desire to make the processing code easier to share and understand. Given the complexity of a major project I'm working on, which involves the details I'm outlining here, I feel the need to revisit SQLAlchemy from a beginner's perspective.

What is SQL Alchemy

SQL Alchemy is a Python library that interacts with relational databases using a high-level Pythonic approach. It provides two main components:

Core: A low-level component that offers SQL abstraction and execution.
ORM (Object Relational Mapper): A higher-level abstraction that maps Python classes to database tables, enabling developers to interact with the database in terms of objects rather than raw SQL queries.

SQL Alchemy is one of the most widely used database libraries in Python because of its flexibility, performance, and support for multiple database systems (e.g., SQLite, PostgreSQL, MySQL, Oracle).

Advantages of SQL Alchemy

Database Agnosticism:
SQLAlchemy supports multiple database engines, so you can easily switch between databases (e.g., SQLite, MySQL, PostgreSQL) by changing the connection string.
ORM Abstraction:
SQLAlchemy’s ORM lets you work with Python objects instead of writing raw SQL, making the code cleaner and easier to maintain.
SQLAlchemy Core:
Provides flexibility for advanced users to write custom SQL queries if the ORM does not fit a particular use case.
Ease of Use with Relationships:
SQLAlchemy makes defining and managing relationships between tables intuitive through Python objects.
Query Composition:
You can build queries dynamically using Python expressions, making complex query construction more manageable.
Better Security:
It handles SQL injection prevention automatically by using parameterized queries.
Scalability and Performance:
SQLAlchemy uses connection pooling and efficient query generation to work well in high-performance scenarios.

Differences Between SQLAlchemy and Straight SQL

Feature	SQLAlchemy	Straight SQL
Abstraction	Provides Pythonic abstraction over SQL, especially with the ORM.	Requires writing raw SQL queries for all interactions.
Database Agnosticism	Supports multiple database engines without changing code structure.	SQL queries are typically specific to a database engine.
Ease of Relationships	Relationships between tables are handled with Python objects and defined declaratively.	Relationships must be handled explicitly in SQL.
Code Readability	Queries are more readable and integrated into Python's syntax.	SQL queries are separate and less Pythonic.
Complex Query Building	Queries can be built dynamically using Python.	Queries must be written explicitly in SQL.
Security	Automatically escapes parameters to prevent SQL injection.	Requires manual handling to prevent SQL injection.
Performance Tuning	Supports features like connection pooling and lazy loading.	Performance tuning requires manual effort and expertise.

Examples: SQLAlchemy vs. Straight SQL

Example 1: Fetching Data

Straight SQL

import sqlite3

connection = sqlite3.connect('example.db')
cursor = connection.cursor()

# Execute a raw SQL query
cursor.execute("SELECT * FROM users WHERE age > ?", (25,))
results = cursor.fetchall()

for row in results:
    print(row)

connection.close()

SQLAlchemy

from sqlalchemy import create_engine, Table, MetaData
from sqlalchemy.orm import sessionmaker

# Create database engine and session
engine = create_engine('sqlite:///example.db')
Session = sessionmaker(bind=engine)
session = Session()

# ORM Example
from sqlalchemy import Column, Integer, String
from sqlalchemy.ext.declarative import declarative_base

Base = declarative_base()

class User(Base):
    __tablename__ = 'users'
    id = Column(Integer, primary_key=True)
    name = Column(String)
    age = Column(Integer)

# Query using ORM
results = session.query(User).filter(User.age > 25).all()
for user in results:
    print(user.name, user.age)

# Alternatively, using Core
metadata = MetaData(bind=engine)
users_table = Table('users', metadata, autoload_with=engine)
query = users_table.select().where(users_table.c.age > 25)
connection = engine.connect()
results = connection.execute(query)

for row in results:
    print(row)

Example 2: Inserting Data

Straight SQL

connection = sqlite3.connect('example.db')
cursor = connection.cursor()

# Insert a record
cursor.execute("INSERT INTO users (name, age) VALUES (?, ?)", ('Alice', 30))
connection.commit()
connection.close()

SQLAlchemy

# ORM Example
new_user = User(name='Alice', age=30)
session.add(new_user)
session.commit()

# Alternatively, using Core
insert_query = users_table.insert().values(name='Alice', age=30)
connection = engine.connect()
connection.execute(insert_query)

Example 3: Relationships Between Tables

Straight SQL

-- Example SQL
CREATE TABLE users (id INTEGER PRIMARY KEY, name TEXT);
CREATE TABLE orders (id INTEGER PRIMARY KEY, user_id INTEGER, description TEXT,
                     FOREIGN KEY(user_id) REFERENCES users(id));

SELECT users.name, orders.description
FROM users
JOIN orders ON users.id = orders.user_id;

SQLAlchemy

from sqlalchemy import ForeignKey
from sqlalchemy.orm import relationship

class Order(Base):
    __tablename__ = 'orders'
    id = Column(Integer, primary_key=True)
    user_id = Column(Integer, ForeignKey('users.id'))
    description = Column(String)
    user = relationship("User", back_populates="orders")

User.orders = relationship("Order", order_by=Order.id, back_populates="user")

# Querying
results = session.query(User).join(User.orders).filter(Order.description == 'Some order').all()
for user in results:
    print(user.name)

When to Use SQLAlchemy vs. Straight SQL

Use SQLAlchemy:

When you need portability across different databases.
When the project requires object-oriented, maintainable code.
When you’re building complex applications with relationships.

Use Straight SQL:

When performance is critical, you can optimize raw queries.
For simple scripts or one-off database operations.
When you need to execute non-standard SQL specific to the database engine.

SQLAlchemy provides a balance between flexibility and abstraction, making it an excellent choice for most Python projects that involve databases.

Two different paths to SQLAlchemy

SQLAlchemy provides two distinct ways to interact with a database: ORM (Object Relational Mapper) using classes and Core using tables. Both approaches are powerful and suited to different use cases. Here’s an explanation of the differences between declaring classes in SQLAlchemy ORM and using the Core approach, with examples to clarify.

1. Declaring Classes (SQLAlchemy ORM)

When using SQLAlchemy ORM, you define database tables as Python classes. These classes are mapped to database tables whose attributes represent the table’s columns.

How It Works:

Classes are declared using the declarative_base() class or the Declarative Base system.
Each class represents a table in the database.
Relationships between tables are defined using Python attributes and ForeignKey or relationship.

Advantages:

Object-Oriented Approach: Work directly with Python objects, which is more intuitive for many developers.
Ease of Use: Queries and relationships are handled in terms of objects, simplifying code.
Abstraction: Abstracts away SQL queries, making it easier to focus on business logic.

Example:

from sqlalchemy import Column, Integer, String, ForeignKey, create_engine
from sqlalchemy.orm import declarative_base, relationship, sessionmaker

# Declare a base class
Base = declarative_base()

# Define User and Order classes (tables)
class User(Base):
    __tablename__ = 'users'
    id = Column(Integer, primary_key=True)
    name = Column(String)

    # Relationship to orders
    orders = relationship('Order', back_populates='user')


class Order(Base):
    __tablename__ = 'orders'
    id = Column(Integer, primary_key=True)
    user_id = Column(Integer, ForeignKey('users.id'))
    description = Column(String)

    user = relationship('User', back_populates='orders')

# Create tables and add data
engine = create_engine('sqlite:///example.db')
Base.metadata.create_all(engine)

Session = sessionmaker(bind=engine)
session = Session()

new_user = User(name='Alice', orders=[Order(description='Order 1')])
session.add(new_user)
session.commit()

# Querying
user = session.query(User).filter_by(name='Alice').first()
print(user.orders[0].description)  # Output: Order 1

2. Using Core (Table-Based)

In SQLAlchemy Core, you define tables explicitly using the Table class. Instead of objects, you work with raw SQL expressions and the table objects directly.

How It Works:

Tables are declared using the Table class.
SQL queries are built using SQLAlchemy’s expression language.
No direct Python object representation exists for database rows; results are just dictionaries or tuples.

Advantages:

Flexibility: Gives you more control over raw SQL queries.
Performance: No ORM overhead; you work directly with SQL and table definitions.
Minimalism: Useful for simple scripts or scenarios where an ORM is unnecessary.

Example:

from sqlalchemy import Table, Column, Integer, String, ForeignKey, MetaData, create_engine

# Define tables
metadata = MetaData()

users = Table(
    'users', metadata,
    Column('id', Integer, primary_key=True),
    Column('name', String)
)

orders = Table(
    'orders', metadata,
    Column('id', Integer, primary_key=True),
    Column('user_id', Integer, ForeignKey('users.id')),
    Column('description', String)
)

# Create tables and add data
engine = create_engine('sqlite:///example.db')
metadata.create_all(engine)

# Insert data
with engine.connect() as conn:
    conn.execute(users.insert().values(name='Alice'))
    conn.execute(orders.insert().values(user_id=1, description='Order 1'))

# Querying
with engine.connect() as conn:
    result = conn.execute(users.select().where(users.c.name == 'Alice'))
    for row in result:
        print(row)  # Output: (1, 'Alice')

Key Differences Between Declaring Classes (ORM) and Core

Aspect	Declaring Classes (ORM)	Core (Table-Based)
Representation	Python classes represent database tables.	Explicit `Table` objects represent database tables.
Data Access	Returns objects with attributes (e.g., `user.name`).	Returns rows as dictionaries or tuples (e.g., `row['name']`).
Relationships	Handled via `relationship()` and `ForeignKey`.	Handled explicitly via joins in SQL.
Query Syntax	High-level, object-oriented queries (e.g., `session.query(User).filter(...)`).	Low-level, SQL-like queries (e.g., `users.select().where(...)`).
Flexibility	Abstracts SQL; better for simpler or object-heavy use cases.	Offers fine-grained control for custom or complex queries.
Performance	Slightly slower due to ORM overhead.	Faster due to the absence of an ORM layer.
Learning Curve	Easier for Python developers, but adds abstraction.	Requires more knowledge of SQL and the Core API.
Use Case	Complex applications with many relationships.	Simple scripts or scenarios requiring precise SQL control.

When to Use Which?

Use ORM (Classes):

When your application involves complex relationships between tables.
When working with Python objects fits naturally with the problem domain.
When you want to reduce boilerplate and make your code more Pythonic.

Use Core:

When you need fine-grained control over SQL queries.
For performance-critical scenarios where ORM overhead is unacceptable.
When you need flexibility to execute complex or database-specific queries.

Blending ORM and Core

You can also mix the two approaches in SQLAlchemy. For example, you might define tables using Core but use ORM for parts of the application.

# Define a table using Core
users = Table(
    'users', metadata,
    Column('id', Integer, primary_key=True),
    Column('name', String)
)

# Map the table to an ORM class
from sqlalchemy.orm import mapper

class User:
    pass

mapper(User, users)

# Now you can use the ORM with the mapped class
session.add(User(name='Alice'))

Other Industry Standard Data Engines

Using less common data engines like Snowflake or other database systems with SQLAlchemy requires leveraging its database-agnostic architecture. SQLAlchemy supports a wide range of database engines through dialects, which are plugins that enable communication with specific databases. Snowflake, for instance, has a dedicated SQLAlchemy dialect.

Here’s how you can use SQLAlchemy with non-common databases like Snowflake or others, including any nuances or key considerations.

1. Snowflake with SQLAlchemy

Setup

To connect SQLAlchemy to Snowflake, you need the Snowflake SQLAlchemy dialect. Install the required libraries:

pip install snowflake-sqlalchemy

Connection Setup

Here’s how you can connect SQLAlchemy to a Snowflake database:

from sqlalchemy import create_engine

# Snowflake connection string
engine = create_engine(
    'snowflake://{user}:{password}@{account}/{database}/{schema}'.format(
        user='YOUR_USER',
        password='YOUR_PASSWORD',
        account='YOUR_ACCOUNT',
        database='YOUR_DATABASE',
        schema='YOUR_SCHEMA'
    )
)

# Test connection
connection = engine.connect()
result = connection.execute("SELECT CURRENT_VERSION()")
for row in result:
    print(row)
connection.close()

Using ORM with Snowflake

from sqlalchemy import Column, Integer, String
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm import sessionmaker

Base = declarative_base()

# Define a table using ORM
class Employee(Base):
    __tablename__ = 'employees'
    id = Column(Integer, primary_key=True)
    name = Column(String)
    salary = Column(Integer)

# Bind engine and create tables
Base.metadata.create_all(engine)

# Insert data
Session = sessionmaker(bind=engine)
session = Session()
new_employee = Employee(name="Alice", salary=100000)
session.add(new_employee)
session.commit()

# Query data
employees = session.query(Employee).all()
for employee in employees:
    print(employee.name, employee.salary)

Using Core with Snowflake

from sqlalchemy import Table, MetaData, Column, Integer, String

metadata = MetaData()

# Define the table
employees = Table(
    'employees', metadata,
    Column('id', Integer, primary_key=True),
    Column('name', String),
    Column('salary', Integer)
)

# Create the table in Snowflake
metadata.create_all(engine)

# Insert data
with engine.connect() as conn:
    conn.execute(employees.insert().values(name='Alice', salary=100000))

# Query data
with engine.connect() as conn:
    results = conn.execute(employees.select())
    for row in results:
        print(row)

2. General Approach for Other Databases

For other databases, SQLAlchemy relies on a specific dialect. Here's how to adapt to various data engines:

Steps:

Install the Dialect for the Database:
Search for a SQLAlchemy-compatible dialect for your database. Many major databases have official dialects, while others have community-maintained ones.

Examples:

MongoDB: pip install sqlalchemy-mongo
ClickHouse: pip install clickhouse-sqlalchemy
BigQuery: pip install pybigquery

Find the Connection String Format:
Check the dialect documentation for the correct connection string format. Each database has unique requirements, such as authentication details or specific parameters.
Use SQLAlchemy's Core or ORM:
Use the same create_engine() syntax and define tables or ORM classes as needed.

Examples of Other Databases

MongoDB (with SQLAlchemy-mongo)

While MongoDB is a NoSQL database, some projects provide an SQL-like interface for MongoDB.

from sqlalchemy import create_engine, MetaData, Table, Column, String, Integer

# Install MongoDB dialect first: pip install sqlalchemy-mongo
engine = create_engine('mongodb://localhost:27017/mydatabase')
metadata = MetaData()

# Define a table-like structure
users = Table(
    'users', metadata,
    Column('id', Integer, primary_key=True),
    Column('name', String),
    Column('age', Integer)
)

# Insert and query data
with engine.connect() as conn:
    conn.execute(users.insert().values(id=1, name='Alice', age=30))
    result = conn.execute(users.select())
    for row in result:
        print(row)

ClickHouse

ClickHouse is a fast columnar database for analytical queries.

from sqlalchemy import create_engine, Table, MetaData, Column, Integer, String

# Install ClickHouse dialect first: pip install clickhouse-sqlalchemy
engine = create_engine('clickhouse://default:@localhost/test')
metadata = MetaData()

# Define a table
users = Table(
    'users', metadata,
    Column('id', Integer, primary_key=True),
    Column('name', String),
    Column('age', Integer)
)

# Create the table in ClickHouse
metadata.create_all(engine)

# Insert and query data
with engine.connect() as conn:
    conn.execute(users.insert().values(id=1, name='Alice', age=30))
    results = conn.execute(users.select())
    for row in results:
        print(row)

Google BigQuery

BigQuery works with SQLAlchemy via the pybigquery dialect.

from sqlalchemy import create_engine

# Install BigQuery dialect: pip install pybigquery
engine = create_engine('bigquery://project_id/dataset')

# Query BigQuery
with engine.connect() as conn:
    result = conn.execute("SELECT * FROM my_table LIMIT 10")
    for row in result:
        print(row)

Key Considerations for Non-Common Data Engines

Install the Correct Dialect:
Ensure the dialect matches your database engine. For some niche engines, the dialect might not be actively maintained, so check compatibility with SQLAlchemy’s current version.
Understand Database-Specific Features:
Some databases (e.g., Snowflake, ClickHouse) have unique capabilities like columnar storage or serverless queries. SQLAlchemy’s Core can often handle these features better than ORM.
Performance Tuning:
For large-scale or analytical databases (e.g., Snowflake, BigQuery), batch inserts and asynchronous connections (using libraries like asyncio) can improve performance.
Custom SQL:
For specialized databases, you might need to write raw SQL queries when SQLAlchemy abstractions fall short. Use SQLAlchemy’s text() for this purpose.

Fallback Raw SQL Examples in SQLAlchemy

When working with non-common data engines, certain scenarios may require raw SQL queries to leverage database-specific features or handle operations that SQLAlchemy’s abstractions don’t natively support. SQLAlchemy allows you to execute raw SQL through the text() function, which provides full flexibility to write and execute custom queries.

Here are two examples of using raw SQL with SQLAlchemy:

Example 1: Custom Query for Analytical Databases (e.g., Snowflake, BigQuery)

In analytical databases, you might need to use specialized SQL features, such as window functions or partitioning, that are not straightforward to implement with SQLAlchemy’s ORM or Core.

from sqlalchemy import create_engine, text

# Connect to the database
engine = create_engine(
    'snowflake://{user}:{password}@{account}/{database}/{schema}'.format(
        user='YOUR_USER',
        password='YOUR_PASSWORD',
        account='YOUR_ACCOUNT',
        database='YOUR_DATABASE',
        schema='YOUR_SCHEMA'
    )
)

# Write and execute a custom SQL query
query = text("""
    SELECT 
        department, 
        COUNT(employee_id) AS total_employees,
        AVG(salary) AS avg_salary
    FROM employees
    WHERE hire_date >= :start_date
    GROUP BY department
    ORDER BY avg_salary DESC
""")

# Execute the query and pass parameters
with engine.connect() as conn:
    result = conn.execute(query, {"start_date": "2022-01-01"})
    for row in result:
        print(row)

Example 2: Bulk Data Insertion Using `text()`

For databases optimized for high-volume data processing, such as ClickHouse or Snowflake, you might want to perform bulk inserts directly using raw SQL.

from sqlalchemy import create_engine, text

# Connect to the database
engine = create_engine('clickhouse://default:@localhost/test')

# Define raw SQL for bulk insertion
bulk_insert_sql = text("""
    INSERT INTO users (id, name, age)
    VALUES 
        (:id1, :name1, :age1),
        (:id2, :name2, :age2),
        (:id3, :name3, :age3)
""")

# Data to be inserted
data = {
    "id1": 1, "name1": "Alice", "age1": 30,
    "id2": 2, "name2": "Bob", "age2": 25,
    "id3": 3, "name3": "Charlie", "age3": 35
}

# Execute the raw SQL
with engine.connect() as conn:
    conn.execute(bulk_insert_sql, data)

# Verify the insertion
select_query = text("SELECT * FROM users")
with engine.connect() as conn:
    result = conn.execute(select_query)
    for row in result:
        print(row)

Using text() with raw SQL gives you complete control when SQLAlchemy abstractions are not enough, while still benefiting from connection pooling and parameterized queries for security and performance.. Of note raw queries can utilize native features of the data engines--for example--time travel in Snowflake

Conclusion

As we've explored, SQLAlchemy is a versatile tool that offers a Pythonic approach to managing databases, providing both ORM and Core interfaces for different use cases. While its abstractions make working with relational databases cleaner and more intuitive, certain scenarios, especially when working with less common data engines like Snowflake, ClickHouse, or BigQuery, may still require fallback raw SQL queries to unlock engine-specific features.

Use Cases May Vary:

If your goal is to avoid database-specific errors and ensure portability, SQLAlchemy's ORM and Core provide excellent abstraction layers, making it easier to share and maintain code.
On the other hand, when you want to lock down pipeline logic and leverage the full power of a specific database's unique features (like Snowflake's time travel or BigQuery's partitioned tables), raw SQL combined with SQLAlchemy's text() function offers the flexibility you need.

A Primer for Experimentation:
If you’re new to SQLAlchemy, think of this article as a primer to help you get started. Dive into the ORM to handle objects and relationships intuitively, experiment with Core for raw control, and don't hesitate to mix in text() queries when needed. The flexibility of SQLAlchemy allows you to balance simplicity with power, scaling up as your needs evolve.

Ultimately, SQLAlchemy is a foundation for building robust, maintainable pipelines in Python. Its ability to combine high-level abstractions with low-level control makes it a critical tool for any data engineer or developer. Whether you're optimizing for performance or building cross-database solutions, SQLAlchemy’s features and flexibility ensure you're well-equipped for the task.

So, keep experimenting, refining your workflows, and finding new ways to streamline your database interactions. The journey of mastering SQLAlchemy is as rewarding as the pipelines it powers. Hey everybody! Keep learning!

Creating a USMLE-style question-and-answer generator

Gyasi Sutton, MD, MPH — Fri, 08 Nov 2024 23:00:00 GMT

We are going to start a project to use current technology to vastly increase our knowledge retention. We will do this by teaching ourselves, not ChatGPT, how to score high on a USMLE exam. We will do this by first defining a prompt and agent that will output a single best answer mcq. Once we get that working, we can move on to things like RAG and think of ways to tweak output numbers and difficulty.

The United States Medical Licensing Examination (USMLE) is a three-step examination for medical licensure in the United States. It assesses a physician's ability to apply knowledge, concepts, and principles, and to demonstrate fundamental patient-centered skills, that are important in health and disease and constitute the basis of safe and effective patient care. The exam is sponsored by the Federation of State Medical Boards (FSMB) and the National Board of Medical Examiners (NBME).

Collect the Research

Now let's gather all the information we can about the layout and structure of the questions, Perplexity, Phind, or ChatGpt works well for this step.

Multiple-Choice Questions (MCQs)

Components of MCQs:

Stem: The clinical vignette or lead-in question that provides context.
Lead-in: The direct question posed to the examinee.
Options: A set of possible answers, typically five choices (A to E), where one is the best or most correct answer.

Distractors:

Distractors are incorrect options provided in multiple-choice questions designed to mislead or challenge the examinee. They are plausible enough to be considered but incorrect.
Purpose: They test the depth of the examinee’s knowledge and ability to discern the correct answer from similar but incorrect choices.

Difficulty Levels:

USMLE questions range from basic to highly complex. The difficulty is determined by factors such as the integration of multiple concepts, the specificity of the clinical scenario, and the level of critical thinking required.
Questions can be categorized into three levels:
Low Difficulty: Requires recall of basic facts and straightforward application of knowledge.
Moderate Difficulty: Involves understanding and applying multiple concepts to a clinical scenario.
High Difficulty: Requires synthesis of information, advanced clinical reasoning, and decision-making skills.

Measurement of Difficulty:

Item Response Theory (IRT): This statistical method is used to calibrate question difficulty and discriminate between different levels of examinee ability.
Parameters: Difficulty (b-parameter), discrimination (a-parameter), and guessing (c-parameter).
Calibration: Based on examinee'responses, the difficulty of each question is adjusted to ensure accurate measurement of ability.

Example Deep Dive on Question Difficulty

Basic Question:

Stem: A 24-year-old female presents with a sore throat and fever. Physical examination reveals pharyngeal erythema and exudates.
Lead-in: What is the most likely diagnosis?
Options:
A) Streptococcal pharyngitis (correct)
B) Viral pharyngitis
C) Mononucleosis
D) Allergic rhinitis
E) GERD

Moderate Difficulty Question:

Stem: A 50-year-old male with a history of chronic obstructive pulmonary disease (COPD) presents with increased dyspnea and productive cough. His temperature is 38.3°C, and he has coarse breath sounds with wheezing.
Lead-in: What is the best initial treatment?
Options:
A) Antibiotics (correct)
B) Inhaled corticosteroids
C) Beta-blockers
D) Diuretics
E) Antihistamines

High Difficulty Question:

Stem: A 68-year-old female with a history of diabetes and hypertension presents with sudden onset of right-sided weakness and difficulty speaking. She was last seen normal 3 hours ago. Her blood pressure is 180/100 mmHg, and CT scan shows no hemorrhage.
Lead-in: What is the next best step in management?
Options:
A) IV thrombolytics (correct)
B) Aspirin
C) Clopidogrel
D) Heparin
E) Blood pressure control

All of this content took 4 minutes with prompting. I knew exactly what I was looking for and could verify most of the returned information. Ok now that we've learned all have all that info we know that we need certain instructions in the output, namely the stem, lead-in options, but I also want to add some extra step. We also need the reasoning about why each answer is right or wrong and for the right answer we want a deeper more detailed reason about why this answer is right compared to others. These will help guide the structure that we need. Thinking about getting this output to conform to a json structure now will speed up frontend development at a later point. We will save the difficulty problem for another tutorial.

Prompt Engineering

First let me plug two exciting technologies, that I hope to get to in another consumer. Fabric and Dspy. Fabric has made me in and even bigger consumer of digital information as I've use the prompts in pipelines to quickly summarize YouTube videos and research papers and even summarize medium articles. Dspy is an interesting project where the prompt is not needed, and it's abstracted away, but fine tunes that hidden prompt to get better output. I'm underselling this a lot and I will have a tutorial soon about this as well.
Now for the purposes of being transparent I came across this article that had a decent starting prompt template for what I wanted to do.

import os 
import os
from dotenv import load_dotenv
from guidance import models, gen


load_dotenv(
    ".env",
    override=True,
)


def generate_usmle_prompt(subject):

    initial_prompt = f"""
    You are developing a question bank for medical exams focusing on the topic of {subject}. The subject may be broad , so the first thing to thing about is refining the question based on a specific refined subject, like a disease, medication, sign, symptom, marker, chemical process, etc, though noting that it should always come back to a disease process. When creating the answers make sure to create distractors that could be part of a differential diagnosis 

    Generate a high-quality Single Best Answer (SBA) question using the following framework:

    
    Create the question from the stem in parentheses below, while the stem is separated to guide you, the basis introductory part of the question should be the paragraph comprised of the stems
    
    (Stem:
    Patient details (gender/age)
    Presenting complaint (PC)
    Relevant clinical history (HPC)
    Physical examination findings
    Diagnostic test results (optional)
    
    Lead in Question stem: 
    )
    
    
    Answer options:
    [Insert plausible answer option]
    [Insert plausible answer option]
    [Insert plausible answer option]
    [Insert plausible answer option]
    [Insert plausible answer option]

    Explanation:
    For each answer option, provide a brief explanation of why it is correct or incorrect.
    
    Why this is the most appropriate answer:
    (this is an optional section to explain why the correct answer is the best choice in more detail)
    Given the clinical picture and the likely diagnosis of [insert condition], the most probable cause of the patient's symptoms is [insert correct answer option]."""

    lllm = models.LiteLLMCompletion("gpt-4o", echo=False)

    lllm += initial_prompt + gen(name="definition")

    question = lllm["definition"]

    return question


def main():
    subject = input("Enter a medical subject: ")
    print("Generating initial prompt...")
    initial_prompt = generate_usmle_prompt(subject)

    print(initial_prompt)


if __name__ == "__main__":
    main()

If I run the program with the input "serious congenital virus"

**Question:**

A 3-day-old male neonate presents with lethargy, poor feeding, and a fever of 38.5°C. The mother reports that the baby was born at term via vaginal delivery without complications. The mother had no significant prenatal history and tested negative for Group B Streptococcus. On physical examination, the neonate is jaundiced, has hepatosplenomegaly, and exhibits a petechial rash. A complete blood count reveals thrombocytopenia and elevated liver enzymes. 

What is the most likely diagnosis?

A) Cytomegalovirus (CMV) infection
B) Group B Streptococcus (GBS) sepsis
C) Neonatal herpes simplex virus (HSV) infection
D) Toxoplasmosis
E) Rubella

**Explanation:**

A) Cytomegalovirus (CMV) infection
- **Incorrect:** While CMV can cause jaundice, hepatosplenomegaly, and thrombocytopenia, it is less likely to present with a petechial rash and fever in the neonatal period.

B) Group B Streptococcus (GBS) sepsis
- **Incorrect:** GBS sepsis can present with fever and lethargy, but it is less likely to cause jaundice, hepatosplenomegaly, and a petechial rash.

C) Neonatal herpes simplex virus (HSV) infection
- **Correct:** HSV infection in neonates can present with fever, lethargy, poor feeding, jaundice, hepatosplenomegaly, and a petechial rash. The presence of these symptoms along with the elevated liver enzymes and thrombocytopenia makes HSV the most likely diagnosis.

D) Toxoplasmosis
- **Incorrect:** Toxoplasmosis can cause jaundice and hepatosplenomegaly, but it typically presents with other findings such as chorioretinitis and intracranial calcifications, which are not mentioned in this case.

E) Rubella
- **Incorrect:** Congenital rubella syndrome can cause jaundice and hepatosplenomegaly, but it is more commonly associated with cataracts, congenital heart defects, and sensorineural deafness, which are not described in this case.

**Why this is the most appropriate answer:**
Given the clinical picture and the likely diagnosis of neonatal herpes simplex virus (HSV) infection, the most probable cause of the patient's symptoms is HSV infection. The combination of fever, lethargy, poor feeding, jaundice, hepatosplenomegaly, petechial rash, thrombocytopenia, and elevated liver enzymes strongly suggests HSV as the underlying cause.

The output is decent, but not perfect yet. this is with the Gpt 4 model: what would be interesting is to see how other models compare to each other. We will test that in another session but now, I'll just rewrite the code a little to get strict json output. This wa I can properly save the output to a database, in case i want this particularly generated question to pop up again. Running the program again gives me this output:

Enter a medical subject: diabetes
Generating initial prompt...

{
    "question": "A 45-year-old male presents to the clinic with excessive thirst and frequent urination. He reports a history of slowly progressive kidney failure. On physical examination, his blood pressure is 130/85 mmHg, and he appears well-hydrated. Laboratory tests reveal normal blood glucose levels, but a low urine osmolality. Which of the following is the most likely diagnosis?",
    "choices": [
        {
            "choice": "Diabetes Mellitus Type 1",
            "explanation": "Diabetes Mellitus Type 1 is characterized by hyperglycemia due to autoimmune destruction of insulin-producing beta cells. This patient has normal blood glucose levels, making this diagnosis unlikely."
        },
        {
            "choice": "Diabetes Mellitus Type 2",
            "explanation": "Diabetes Mellitus Type 2 involves insulin resistance and is also characterized by hyperglycemia. The patient's normal blood glucose levels do not support this diagnosis."
        },
        {
            "choice": "Diabetes Insipidus",
            "explanation": "Diabetes Insipidus is characterized by excessive thirst and urination due to a deficiency of antidiuretic hormone (ADH) or renal insensitivity to ADH, leading to dilute urine. The patient's symptoms and low urine osmolality are consistent with this condition.",
            "reasoning": "Given the clinical picture and the likely diagnosis of Diabetes Insipidus, the most probable cause of the patient's symptoms is a deficiency or insensitivity to antidiuretic hormone, leading to the production of large volumes of dilute urine.",
            "correct_answer": true
        },
        {
            "choice": "Chronic Kidney Disease",
            "explanation": "Chronic Kidney Disease can cause polyuria, but it is usually associated with other symptoms such as hypertension and electrolyte imbalances. The patient's normal blood pressure and low urine osmolality suggest a different diagnosis."
        },
        {
            "choice": "Primary Polydipsia",
            "explanation": "Primary Polydipsia involves excessive fluid intake leading to dilute urine. However, it is less likely in the context of slowly progressive kidney failure and the specific laboratory findings presented."
        }
    ]
}

This output has the structure I want, though I'm not necessarily satisfied with the length of the clinical vignette for the questions, they should be longer with more details. We can explore expanding this by prompt engineering which we will tackle next time.

In this session, we embarked on an exciting journey to enhance our knowledge retention while preparing for the United States Medical Licensing Examination (USMLE). We began by dissecting the structure and components of multiple-choice questions (MCQs), which are crucial for the exam. We identified the essential elements of MCQs—stems, lead-ins, options, and distractors—along with the varying levels of difficulty that characterize USMLE questions. By using statistical methods such as Item Response Theory (IRT), we aim to calibrate question difficulty and improve our assessment strategies. Our initial efforts yielded a structured JSON output that aligns with our goal of having a MCQ database, though I'm not necessarily satisfied with the length of the clinical vignette for the questions--they should be longer with more details. We can explore expanding this by prompt engineering which we will tackle next time.

Hey everyone, let's keep teaching and learning!

Understanding Gists for Sharing Code

Gyasi Sutton, MD, MPH — Wed, 06 Nov 2024 13:01:00 GMT

Introduction

Sharing code snippets efficiently is crucial for collaboration and learning. GitHub Gists offer a streamlined way to share code or text snippets, making them accessible to anyone with a link. This article explores the concept of Gists, their advantages, and how to manage them effectively.

What are Gists?

Gists are a feature of GitHub that allows users to share small pieces of code or text. They can be public, allowing anyone to view them, or private, restricting access to the creator and those with the link. Gists are ideal for sharing code examples, notes, or scripts quickly and easily.

Advantages of Using Gists

Simplicity: Gists are easy to create and share, requiring only a GitHub account.
Ease of Sharing: A simple URL makes Gists accessible to anyone with the link.
Version Control: Like full GitHub repositories, Gists support version control, enabling users to track changes and revert to previous versions if needed.
Comparison with Full Repositories: While full repositories are suitable for larger projects, Gists are perfect for smaller, standalone snippets.

Creating and Managing Gists with GitHub CLI

The GitHub CLI (gh) provides a powerful way to manage Gists directly from the terminal. Here are some example commands:

Creating a Gist:

gh gist create my_script.sh --public --desc "A simple bash script"

This command creates a public Gist with the contents of my_script.sh and the specified description. For a private Gist, omit the --public flag.

Editing a Gist:

gh gist edit  --add new_file.txt

This command adds a new file to an existing Gist.

Deleting a Gist:

gh gist delete

This command deletes the specified Gist.

Cloning a Gist:

git clone https://gist.github.com/.git

This command clones the Gist repository locally.

Advanced Gist Management with Gistyc

For more robust Gist management, the Gistyc library offers additional functionalities. Gistyc is a Python-based toolkit that allows users to create, update, and delete Gists from the command line or within a Python program. It can be integrated into CI/CD pipelines for automated Gist management.

Creating a Gist with Gistyc:

from gistyc import GISTyc
gist_api = GISTyc(auth_token="YOUR_GITHUB_TOKEN")
response_data = gist_api.create_gist(file_name="my_script.py")

Updating a Gist with Gistyc:

gist_api.update_gist(file_name="my_script.py", gist_id="GIST_ID")

Deleting a Gist with Gistyc:

gist_api.delete_gist(gist_id="GIST_ID")

Listing All Gists:

gist_list = gist_api.get_gists()

Gistyc provides a more programmatic approach to managing Gists. It supports operations like creating, updating, and deleting Gists using a GitHub personal access token. Additionally, Gistyc can handle multiple files and directories, making it ideal for managing large sets of Gists.

Use Cases and Limitations

Gists are commonly used for sharing code snippets in tutorials, documentation, and collaborative projects. However, they have limitations, such as the inability to manage large projects or complex version histories. Users should consider these factors when choosing between Gists and full repositories.

Conclusion

Gists provide a convenient and efficient way to share code snippets, with the added benefits of simplicity and version control. Whether you're collaborating with a team or sharing a quick code example, Gists are a valuable tool in any developer's toolkit. Explore Gists today to enhance your code-sharing capabilities.

Setting up Airflow for ETL development

Gyasi Sutton, MD, MPH — Sun, 03 Nov 2024 10:44:22 GMT

This is part of a larger series of development and setup steps to build a self-contained Medical Record Database using ETL and other technologies. While a lot of what I do here can easily be scripted with pure Python, I am always exploring open-source ways to do things in a structured manner. I hope you find the content useful.

Introduction to Apache Airflow

Apache Airflow is a powerful platform used to programmatically author, schedule, and monitor workflows. It is particularly popular in the data engineering and data science communities for orchestrating complex data workflows, including data ingestion and transformation tasks, such as running dbt (data build tool) models. Airflow allows users to define workflows as code, making it easy to manage, version, and share workflows across teams.

Key Features of Apache Airflow

Directed Acyclic Graphs (DAGs)

At the core of Apache Airflow is the concept of Directed Acyclic Graphs (DAGs). A DAG is a collection of all the tasks you want to run, organized in a way that reflects their relationships and dependencies. Each node in the DAG represents a task, and the edges define the order in which tasks should be executed. DAGs ensure that tasks are executed in a specific sequence without any cycles, which means a task cannot depend on itself, either directly or indirectly.

Simple Pipeline example. Each Node(Task) can be a seperate script processing the previous nodes output

Integration with Various Data Tools

Apache Airflow is designed to be extensible and integrates seamlessly with a wide range of data tools and services. It supports various operators and hooks that allow you to interact with databases, cloud services, and other data processing tools. This makes it easy to create workflows that involve multiple systems, such as extracting data from a database, transforming it using a tool like dbt, and loading it into a data warehouse.

Web Interface for Monitoring

Airflow provides a rich web-based user interface that allows users to monitor and manage workflows. The UI provides a clear view of the DAGs, their status, and the logs of each task. Users can trigger tasks manually, view task dependencies, and even retry failed tasks directly from the interface. This makes it easy to keep track of complex workflows and quickly identify and resolve issues.

Setting the Environment with Docker-compose

Docker-Compose is particularly useful for setting up Apache Airflow in a development or testing environment, as it allows you to define all the necessary services in a single configuration file and start them with a single command. We will download the official download image, edit it, create a dockerfile, and add some plugins, with one simple bash script.

#!/bin/bash

# Define custom ports
WEB_SERVER_PORT=8081
FLOWER_PORT=5556

# Create necessary directories
mkdir -p ./dags ./logs ./plugins

# Set environment variable for Airflow user
echo -e "AIRFLOW_UID=$(id -u)" > .env

# Download the docker-compose.yaml file
curl -LfO 'https://airflow.apache.org/docs/apache-airflow/2.10.2/docker-compose.yaml'

# Change permissions to make the file writable
chmod u+w docker-compose.yaml

# Modify the docker-compose.yaml file to use custom ports
sed -i "s/8080:8080/${WEB_SERVER_PORT}:8080/g" docker-compose.yaml
sed -i "s/5555:5555/${FLOWER_PORT}:5555/g" docker-compose.yaml

# Uncomment the build line and remove the image line
sed -i '/^\s*image:.*apache\/airflow/d' docker-compose.yaml
sed -i 's/^\s*# build: \./  build: ./' docker-compose.yaml

# Verify the modification
if grep -q '^\s*build: \.$' docker-compose.yaml; then
    echo "Successfully updated docker-compose.yaml with build context."
    sleep 5
else
    echo "Failed to update docker-compose.yaml with build context." >&2
    exit 1
fi
# Create a Dockerfile to install dbt and plugins
cat < Dockerfile
FROM apache/airflow:2.10.2

# Install dbt
RUN pip install dbt-core dbt-postgres

# Install Airflow plugins
RUN pip install apache-airflow-providers-slack
RUN pip install apache-airflow-providers-amazon
RUN pip install apache-airflow-providers-google
RUN pip install apache-airflow-providers-postgres

# Switch to the airflow user
USER airflow
EOF

# Build the custom Docker image
docker-compose build

# Initialize the Airflow database
docker-compose up airflow-init

# Start Airflow services in detached mode
docker-compose up -d

# Output the access information
echo "Airflow is running on http://localhost:${WEB_SERVER_PORT}"
echo "Flower is running on http://localhost:${FLOWER_PORT}"

In essence, the script sets env variables and modifies the incoming docker-compose file, and executes. I have dbt installed along with some other airflow plugins orchestrated in the created Dockerfile, which you can modify to your preferences. I'll share this bash script with this gist, make sure if your environment is Linux, you 'chmod +x ' to make it executable. It goes without saying to have docker and docker-compose installed on your system. Quick point--you can set your port variables which can be needed if you are running a lot of docker-containers. I have many different containers, and I don't want the ports on my host machine conflicting with each other.

A quick explanation of the Docker-Compose file

When setting up Apache Airflow using Docker-Compose, several services are typically included:

Scheduler: The scheduler is responsible for scheduling jobs and ensuring that tasks are executed according to their dependencies and schedules. It continuously monitors the DAGs and triggers tasks when their dependencies are met.
Webserver: The webserver hosts the Airflow web interface, allowing users to interact with the DAGs, view logs, and manage tasks.
Worker: Workers are responsible for executing the tasks defined in the DAGs. In a distributed setup, multiple workers can be used to parallelize task execution and improve performance.
Init Service: The init service is used to initialize the Airflow database and ensure that all necessary configurations are in place before starting the other services.

Running Services

Once you execute the script and there are no errors the echo statements have something like this:

You can now navigate to localhost:8081 and view this:

Feel free to navigate the interface or bash into the container to verify the 'dbt --version'

You will see a lot of example dags.

Recap

In this tutorial, we explored how to set up Apache Airflow. By leveraging Airflow's powerful workflow orchestration capabilities, you can manage complex data processes more effectively and efficiently. We covered the essential features of Airflow, including Directed Acyclic Graphs (DAGs), integration with various data tools, and the intuitive web interface for monitoring workflows.

Additionally, we walked through the process of using Docker-Compose to simplify the environment setup, ensuring that all necessary services are configured and running seamlessly. With the provided bash script, you can easily customize your Airflow instance to fit your specific needs, including adding plugins and managing port configurations to prevent conflicts.

Now that you have a solid foundation for using Apache Airflow, you can begin to implement it in your own data projects. Whether you're automating data ingestion, transformation, or analysis, Airflow provides a structured and scalable solution that can grow with your needs. I hope this tutorial has equipped you with the knowledge and tools to harness the full potential of Apache Airflow in your workflows.

Hey everyone, let's keep teaching and learning!

"Creating and Managing Scalable Application Environments with Traefik and Docker: A Comprehensive Tutorial"

Gyasi Sutton, MD, MPH — Sun, 11 Jun 2023 10:00:00 GMT

Traefik is an open-source reverse proxy and load balancer that enables you to define your own route rules, SSL certificates, and more for your applications. It integrates perfectly with containerized environments such as Docker, Kubernetes, etc.

In this series, we are going to cover how to set up Traefik in a Docker environment, how to route and balance traffic, and how to secure your applications.

Setting Up the Environment

First, make sure that you have Docker and Docker Compose installed on your machine. If not, you can follow the official Docker documentation to install it.

Creating a Traefik Configuration

Let's begin by creating a configuration file for Traefik. We're going to use YAML for this:

Create a file named traefik.yml and add the following content:


entryPoints:
  web:
    address: ":80"
  websecure:
    address: ":443"

providers:
  docker:
    endpoint: "unix:///var/run/docker.sock"
    watch: true
    exposedByDefault: false

In this configuration, we have defined two entry points: web for HTTP (port 80) and websecure for HTTPS (port 443), comparable to the body's pain pathways. These pathways, much like their biological counterparts, serve unique purposes. The "web" pathway handles standard, non-secure communication, while the "websecure" pathway is designed for encrypted, secure communication. Their functionalities parallel how acute pain warns the body of immediate harm, and chronic pain indicates ongoing issues.

We introduce a Docker provider into this setup, which mirrors the role of the nervous system. This provider monitors changes in the Docker environment, regulating data flow and adapting as needed, akin to how the nervous system modulates and responds to body signals.

We can extend the analogy by comparing these networks to specific regions in the brain, each fulfilling a unique function. Much like the frontal lobe's role in decision making and the occipital lobe's in visual processing, each network caters to a particular type of web service or application.

To ensure precise communication—similar to how nerve signals are directed to the right part of the brain—we implement routing rules in Traefik. These rules guide incoming requests, much as the nervous system routes signals to the appropriate brain regions for processing.

The flexibility of this system is its main strength, allowing us to establish as many networks as necessary. This mirrors how the body manages different kinds of pain signals, directing them through various channels based on their unique characteristics. In doing so, we optimize our system's functionality and responsiveness, leading to an efficient environment for our applications.

Creating the Docker Compose File

Next, create a docker-compose.yml file in the same directory with the following content:

version: '3'

services:
  traefik:
    image: traefik:v2.4
    command:
      - "--api.insecure=true"
    ports:
      - "80:80"
      - "443:443"
      - "8080:8080"
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
      - ./traefik.yml:/etc/traefik/traefik.yml
    networks:
      - web

networks:
  web:
    external: true

Here, we define a service named traefik that uses the Traefik image from Docker Hub. We expose ports 80 (HTTP), 443 (HTTPS), and 8080 (Traefik's dashboard). We mount the Docker socket and our Traefik configuration file into the container.

Starting Traefik

You can start Traefik by running this command:

docker-compose up -d

This command will pull the necessary images and start the Traefik service in the background.

You should now be able to access Traefik's dashboard by going to http://localhost:8080/dashboard.

Creating and Deploying an Example Application

Now, let's create an example application and deploy it with Docker Compose.

Create a new docker-compose.yml file in a different directory with the following content:

version: '3'

services:
  app:
    image: nginx
    labels:
      - "traefik.enable=true"
      - "traefik.http.routers.app.rule=Host(`app.localhost`)"
    networks:
      - web

networks:
  web:
    external: true

This configuration defines an application named app that uses the Nginx image from Docker Hub. The labels given to the service instruct Traefik to create a route rule for this application. It tells Traefik that the application can be accessed by going to http://app.localhost.

Note: In a real-world scenario, you would replace localhost with your domain name.

Now, let's run this application by using the following command:

docker-compose up -d

Testing the Application

If you have configured everything correctly, you should now be able to access your application by going to http://app.localhost in your web browser. Note that you may need to add a mapping in your hosts file if your operating system doesn't automatically resolve *.localhost addresses.

The Nginx start page should be displayed.

Inspecting the Traefik Dashboard

If you go back to the Traefik dashboard (http://localhost:8080/dashboard), you should see your app service listed there. You can inspect the route rules and other configuration details.

This was a basic tutorial to set up Traefik in a Docker environment and deploy a simple application.

In further parts of this series, we will discuss more advanced topics such as:

Load balancing between multiple instances of an application
Adding SSL certificates with Let's Encrypt
Redirecting HTTP traffic to HTTPS
Routing based on path or headers

Conclusion

By using Traefik with Docker, we can create flexible and scalable environments for our applications. Traefik's integration with Docker makes it easy to define route rules, load balance traffic, and add SSL, giving us full control over the networking of our containerized applications.

Harnessing the Power of C++: Journey into New Territory

Gyasi Sutton, MD, MPH — Mon, 29 May 2023 15:39:19 GMT

Akin to the diverse specialties in medicine, the programming world is vast, each language offering unique perspectives and tools to solve problems. Despite having a strong grasp on Python, Rust, Node.js, and R, I often found myself seeking something more, akin to a physician needing a specialized diagnostic tool to identify and treat a unique disease. I needed a language that could offer me raw power, control, and a direct interface with system architecture. After careful introspection, I realized that my next foray into the programming landscape would be C++.

C++ - the language I had admired from afar - was now in the spotlight of my programming journey. Although my knowledge of Python, Rust, Node.js, and R served me well, I found myself constantly needing a plugin that could do something specific for a program I hadn't yet created. I yearned for the capability to extend functionality, to mold and shape software in ways that went beyond the boundaries of my existing skill set.

Just like a physician encountering a patient presenting unique symptoms might think of a rare but specific medical condition, such as Barth syndrome, I started to recognize that my programming needs were pointing me towards C++. The unique attributes of the language - raw power and control, the ability to manage resources and memory, and direct system level access - were all clear indications that C++ could be the panacea for my plugin development needs.

Moreover, my personal interest in computer vision projects drew me closer to C++. Libraries such as OpenCV have native support in C++, making it a go-to language for real-time computer vision, which relies heavily on speed and efficiency. Not only does C++ offer a wide range of libraries for various applications, but it also provides the high performance needed for computationally intensive tasks. It felt like the missing piece of the puzzle, the highly specialized tool that would augment my programming arsenal.

Now, let's dive deeper into the world of C++ and understand its advantages, how to set up a C++ environment on Linux using Conda, and how to write plugins and create our first 'Hello, World!' program in C++.

Advantages of C++

If the world of programming languages was a hospital, C++ would be the highly respected and versatile surgeon, known for its performance, flexibility, and efficiency.

Like a medical professional needing to understand human anatomy in depth, mastering C++ requires an intimate knowledge of the underlying system architecture. However, this level of understanding pays off in the form of more robust and efficient programs.

High Performance and Efficiency

Much like a rare but specific medical condition, Barth syndrome, which presents with cardiolipin deficiency, neutropenia, and muscle weakness, C++ has unique attributes that set it apart from other languages. In C++, the idiosyncratic signs of raw power and control, ability to manage resources and memory, and direct system level access are all clear indications of the language's high performance and efficiency.

Flexibility and Versatility

C++ is a multi-paradigm language. This means that it supports different styles of programming including procedural, object-oriented, and generic, offering programmers a wide array of tools to solve problems. It's like a diagnostician capable of recognizing and treating a wide variety of diseases.

Writing Plugins with C++

Just as medical researchers often develop therapies to tackle specific diseases, C++ programmers can develop plugins to extend the functionality of existing software.

Plugins are dynamic libraries loaded by an application at runtime. Writing them in C++ can be particularly useful due to the language's efficiency and control. This is similar to administering targeted therapies in medical treatment, which work by targeting specific genes or proteins involved in the growth and survival of cancer cells, resulting in improved treatment outcomes.

Requirements for a Linux, Conda Environment

A Conda environment can be thought of as a private room in a hospital. It isolates a specific space for each patient (or software project in our case) to avoid cross-contamination (interference between different software versions or dependencies).

To set up a C++ environment on Linux using Conda, you'll need the following:

A Linux operating system. We'll assume a Ubuntu-based distribution for this guide.
Anaconda or Miniconda installed on your system. You can download it from their official website.
Basic familiarity with the terminal commands.

Once these prerequisites are in place, setting up the C++ environment is as simple as creating a new Conda environment, and installing the necessary C++ libraries.

Writing, Compiling, and Running a C++ Program

Let's write a simple 'Hello, World!' program in C++. This quintessential program is an excellent way to test our development environment.

First, we'll create the program. Open a text editor, type the following code, and then save the file as hello_world.cpp:

#include

int main() {
    std::cout << "Hello, World!" << std::endl;
    return 0;
}

In this program, #include includes the input-output stream header file (iostream) into our program. The main function is the entry point of our program. std::cout prints the string "Hello, World!" to the console.

Once we have written the program, we'll need to compile it. In the terminal, navigate to the directory containing hello_world.cpp. Compile the C++ file using the g++ compiler:

g++ hello_world.cpp -o hello_world

The -o option lets us specify the output file name. In this case, the compiled program will be named hello_world.Upon successful compilation, a new executable file named hello_world will be created in the same directory.

Next, we'll run the program. In the terminal, type the following command:

./hello_world

This command will execute the hello_world program, which will print "Hello, World!" to the console.

That's it! We've written, compiled, and run our first C++ program.

Remember, you need the g++ compiler installed on your system to compile C++ code. If you're using the Linux Conda environment setup we discussed earlier, the g++ compiler should already be available. If not, you can install it with the package manager of your Linux distribution. For instance, on Ubuntu, use the command sudo apt install g++.

This basic 'Hello, World!' program is just the beginning. With the power and flexibility of C++, the possibilities for creating complex applications, plugins, and more are virtually endless. It's an investment that provides the opportunity to create more efficient, powerful, and flexible solutions. I'm eager to harness the full potential of this robust language, utilizing its native libraries for my computer vision projects and beyond. Stay tuned for more about this powerful, albeit difficult(in my opinion) language. Happy Coding!

Mastering Classes in Python with a Medical Twist: A Comprehensive Tutorial

Gyasi Sutton, MD, MPH — Tue, 16 May 2023 01:04:18 GMT

Introduction:

Python, with its simplicity and readability, is a versatile programming language that has found its way into diverse fields, including medical science. One of its key features is the ability to create classes, encapsulating data and functions into an organized and reusable unit. This tutorial delves into Python classes, their syntax, attributes, the essential concept of "self", and how they can be applied in medical scenarios, particularly pharmacology and pathology.

Understanding the Basics of Classes:

Classes, in Python, are blueprints for creating objects (instances), each equipped with a predefined set of attributes and methods. They form the backbone of object-oriented programming in Python, providing a structure to package data and functions together.

Consider a class as a template for a real-world entity. For instance, if we are creating a pharmacology application, a 'Drug' class could represent the real-world concept of a drug. This class would have attributes (characteristics of the drug) and methods (actions that can be performed with the drug).

Defining a Class:

To define a class in Python, we use the 'class' keyword followed by the class name. Here's an example, creating a 'Drug' class:

class Drug:
    pass

This class currently does nothing, but we have laid the foundation for our 'Drug' class.

Adding Attributes:

Attributes represent the properties or characteristics associated with a class. They store essential data and represent the state of an object. For our 'Drug' class, we can add attributes like 'name', 'dosage', and 'side_effects':

class Drug:
    def __init__(self, name, dosage, side_effects):
        self.name = name
        self.dosage = dosage
        self.side_effects = side_effects

The init method, a constructor, is executed when an object is created from the class. Inside this method, we initialize the attributes using the 'self' keyword.

Understanding "self":

The 'self' keyword in Python is a convention that references the instance of the class. It allows access to the attributes and methods within the class. When defining a method inside a class, 'self' must be included as the first parameter, even though it's not explicitly passed when calling the method.

Let's explain with our 'Drug' class. When we create a Drug object (let's say Paracetamol), 'self' refers to this object, enabling us to set and access its attributes.

Accessing Attributes:

Having defined the attributes in our 'Drug' class, we can access them using dot notation. Here's how we create an instance of the 'Drug' class and print the attribute values:

paracetamol = Drug("Paracetamol", "500mg", ["Nausea", "Rashes"])
print(paracetamol.name)  # Output: Paracetamol
print(paracetamol.dosage)  # Output: 500mg
print(paracetamol.side_effects)  # Output: ['Nausea', 'Rashes']

Here, 'paracetamol' is an instance (object) of the 'Drug' class, and we access its attributes using dot notation.

Adding Methods:

Methods define the behavior of the objects created from the class. They are essentially functions defined within a class. For our 'Drug' class, we can add a method 'display_info' that prints the drug's information:

class Drug:

    def __init__(self, name, dosage, side_effects):
        self.name = name
        self.dosage = dosage
        self.side_effects = side_effects
        

    def display_info(self):
        print(f"Drug Name: {self.name}")
        print(f"Dosage: {self.dosage}")
        print(f"Side Effects: {self.side_effects}")

Notice that we have used 'self' in our method definition to access the attributes. Now, we can call this method on an instance of the 'Drug' class:

paracetamol = Drug("Paracetamol", "500mg", ["Nausea", "Rashes"])
paracetamol.display_info()

This will output:

Drug Name: Paracetamol
Dosage: 500mg
Side Effects: ['Nausea', 'Rashes']

Inheritance and Superclasses:

Inheritance is a fundamental concept in object-oriented programming that allows one class (subclass) to inherit attributes and methods from another class (superclass). This promotes code reusability and a logical structure.

Consider a situation where we have a specific drug, such as an antibiotic. This 'Antibiotic' class could inherit from the 'Drug' class:

class Antibiotic(Drug):
    def __init__(self, name, dosage, side_effects, antibiotic_class):
        super().__init__(name, dosage, side_effects)
        self.antibiotic_class = antibiotic_class

The 'super()' function calls the constructor of the superclass, allowing us to access its methods and attributes. In this case, we use it to initialize the 'name', 'dosage', and 'side_effects' attributes from the 'Drug' class. We also add an extra attribute, 'antibiotic_class'.

amoxicillin = Antibiotic("Amoxicillin", "500mg", ["Nausea", "Rashes"], "Penicillin")
amoxicillin.display_info()
print(f"Antibiotic Class: {amoxicillin.antibiotic_class}")  # Output: Penicillin

In this example, 'amoxicillin' is an instance of the 'Antibiotic' class, which is a subclass of the 'Drug' class. It inherits the attributes and methods of the 'Drug' class and adds one more attribute specific to antibiotics.

Disease Process: A Practical Example

Python classes can be used effectively to model complex real-world processes, such as the progression of diseases. Let's delve into a comprehensive example of a disease process: the progression of diabetes.

In this case, we will define a 'Patient' class, a 'Disease' class, and a 'Diabetes' subclass. A patient will have attributes such as 'name', 'age', 'weight', and 'blood_sugar_level', and the disease class will have 'name', 'symptoms', and 'treatment'. Diabetes, as a subclass of Disease, will additionally have 'type' (Type 1 or Type 2) as an attribute.

Defining the Patient Class:

class Patient:
    def __init__(self, name, age, weight, blood_sugar_level):
        self.name = name
        self.age = age
        self.weight = weight
        self.blood_sugar_level = blood_sugar_level

Defining the Disease Class:

class Disease:
    def __init__(self, name, symptoms, treatment):
        self.name = name
        self.symptoms = symptoms
        self.treatment = treatment

Defining the Diabetes Subclass:

class Diabetes(Disease):
    def __init__(self, name, symptoms, treatment, type):
        super().__init__(name, symptoms, treatment)
        self.type = type

In this model, each 'Patient' object could have a 'Disease' object as an attribute, representing the disease the patient is suffering from.

class Patient:
    def __init__(self, name, age, weight, blood_sugar_level, disease=None):
        self.name = name
        self.age = age
        self.weight = weight
        self.blood_sugar_level = blood_sugar_level
        self.disease = disease

In this case, 'disease' is an optional attribute. If a patient is diagnosed with diabetes, this attribute can be updated to an instance of the 'Diabetes' class.

Now, we can create a patient diagnosed with Type 2 Diabetes:

diabetes_type2 = Diabetes("Diabetes", ["Increased thirst", "Frequent urination"], "Insulin Therapy", "Type 2")
patient_john = Patient("John Doe", 50, 90, 180, diabetes_type2)

This example shows how Python classes can be used to model intricate real-world processes effectively, enhancing understanding and promoting better data organization.

Medical Assessment and Treatment: A Class-Based Approach

The process of assessing and treating a disease often follows a medical algorithm or decision tree, guiding healthcare professionals in their clinical decision-making. In western medicine, this approach often includes assessment, diagnosis, treatment, and follow-up.

We can create Python classes to model this process. Let's create a class 'MedicalAlgorithm' that encompasses these steps. To demonstrate, we'll continue using diabetes as our example disease.

Defining the MedicalAlgorithm Class:

class MedicalAlgorithm:
    def __init__(self, patient, disease):
        self.patient = patient
        self.disease = disease

    def assess(self):
        print(f"Assessing {self.patient.name}'s condition...")
        print(f"Blood Sugar Level: {self.patient.blood_sugar_level}")
        print(f"Symptoms: {self.disease.symptoms}")

    def diagnose(self):
        if self.patient.blood_sugar_level > 125 and "Increased thirst" in self.disease.symptoms:
            print(f"Diagnosis: {self.disease.name} - {self.disease.type}")
        else:
            print("Further tests required for definitive diagnosis.")

    def treat(self):
        print(f"Initiating treatment for {self.patient.name}...")
        print(f"Prescribed Treatment: {self.disease.treatment}")

    def follow_up(self):
        print(f"Scheduled follow-up for {self.patient.name} in 4 weeks.")

This class takes a 'Patient' object and a 'Disease' object as input. It has four methods:

1. assess(): This method prints the patient's symptoms and current blood sugar level.

2. diagnose(): This method uses a simple algorithm to diagnose the disease based on the patient's blood sugar level and symptoms. If the blood sugar level is above a certain threshold and the patient exhibits particular symptoms, a diagnosis is made.

3. treat(): This method initiates treatment for the patient based on the 'Disease' object.

4. follow_up(): This method schedules a follow-up appointment for the patient.

Now, we can apply this medical algorithm to our patient diagnosed with Type 2 Diabetes:

medical_algorithm = MedicalAlgorithm(patient_john, diabetes_type2)
medical_algorithm.assess()
medical_algorithm.diagnose()
medical_algorithm.treat()
medical_algorithm.follow_up()

This will output:

Assessing John Doe's condition...
Blood Sugar Level: 180
Symptoms: ['Increased thirst', 'Frequent urination']
Diagnosis: Diabetes - Type 2
Initiating treatment for John Doe...
Prescribed Treatment: Insulin Therapy
Scheduled follow-up for John Doe in 4 weeks.

With this approach, we can effectively encapsulate the steps of a medical algorithm into a Python class. This example provides a simplified illustration; in reality, medical algorithms can be more complex and nuanced. However, this demonstration shows how Python classes can be used to model and organize such processes effectively.

Conclusion:

Python classes, with their encapsulation of data and functions, play a vital role in structuring code for better organization and reusability. They're integral to object-oriented programming in Python, enabling you to define custom data types that match real-world entities.

The concept of 'self' is crucial in defining and working with classes, as it provides access to an instance's attributes and methods within the class. By integrating the use of classes in a medical context, we've shown how Python can be applied in diverse fields.

Keep practicing and experimenting with Python classes to solidify your understanding and explore their full potential. Mastering classes is a significant step towards becoming proficient in Python, opening the doors to more complex and powerful programming. Happy coding!

Stash Your Way to a Better Git Workflow

Gyasi Sutton, MD, MPH — Sun, 05 Mar 2023 17:57:50 GMT

Git stash is a command in Git that allows you to temporarily save changes that you've made to your working directory without committing them. Stashing is useful when you need to switch to another branch or work on something else temporarily but want to come back to the changes you were working on later.

Here are the main git stash commands you need to know:

git stash save "message":

This command saves your changes to the stash. You can add a message to describe the changes you're saving. The syntax is:

git stash save "message"

git stash list:

This command lists all the stashes that you've saved. The syntax is:

git stash list

git stash apply:

This command applies the most recent stash to your working directory. The syntax is:

git stash apply

If you want to apply a specific stash, you can use the syntax:

git stash apply stash@{n}

where n is the index number of the stash you want to apply.

git stash pop:

This command applies the most recent stash to your working directory and removes it from the stash list. Essentially, it's like applying a stash with git stash apply and then immediately removing it with git stash drop. This can be useful when you're confident that you no longer need the saved changes in the stash, or if you want to apply a stash and clean up your stash list in one step. However, it's important to use git stash pop with caution, as once you pop a stash it cannot be easily The syntax is:

git stash pop

If you want to pop a specific stash, you can use the syntax:

git stash pop stash@{n}

git stash drop:

This command removes a specific stash from the stash list. The syntax is:

git stash drop stash@{n}

If you don't specify a stash, it will remove the most recent stash.

git stash clear:

This command removes all stashes from the stash list. The syntax is:

git stash clear

git stash branch:

This command creates a new branch and applies the most recent stash to it. The syntax is:

git stash branch branchname

where branchname is the name of the new branch.

In summary, Git stash is a powerful tool that can help you save changes temporarily and improve your workflow when working with Git. Whether you need to switch to another branch, work on something else temporarily, or simply clean up your commit history, Git stash has you covered. By mastering the commands outlined in this tutorial, you'll be able to use Git stash with confidence and take your Git skills to the next level. If you have any questions or feedback, feel free to leave a comment below!

Command	Description
`git stash save "message"`	Saves changes to the stash with a message
`git stash list`	Lists all stashes
`git stash apply`	Applies the most recent stash
`git stash apply stash@{n}`	Applies the stash at index `n`
`git stash pop`	Applies and removes the most recent stash
`git stash pop stash@{n}`	Applies and removes the stash at index `n`
`git stash drop stash@{n}`	Removes the stash at index `n`
`git stash drop`	Removes the most recent stash
`git stash clear`	Removes all stashes
`git stash branch branchname`	Creates a new branch with the most recent stash applied

The Art of Docker Cleanup: How to Feng Shui Your Environment and Keep Your Containers Running Smoothly

Gyasi Sutton, MD, MPH — Sat, 18 Feb 2023 10:11:00 GMT

If you're constantly creating and destroying Docker environments, like me, you may find that your host machine's disk space fills up quickly. This can lead to unexpected errors, lost files, and reduced performance. Managing Docker environments can be a challenging task, especially when it comes to keeping your system clean and organized. This tutorial will cover some tips and best practices for reclaiming disk space in your Docker environment, including removing unused containers, images, networks, and volumes. By following these techniques, you can keep your system running smoothly and avoid running out of disk space.

Docker provides a number of tools for managing disk space, including removing unused objects, running the garbage collector, and using filters to limit the objects that will be removed. But lets get the basics out of the way first:

Docker Objects

There are three types of objects in Docker: images, containers, and volumes.

Images: An image is a read-only template that contains the application code, libraries, and dependencies required to run the application.
Containers: A container is a running instance of an image. When a container is created, it is based on an image and contains all the necessary files and libraries to run the application.
Volumes: A volume is a persistent data storage mechanism that can be shared between containers. Volumes are used to store data that needs to persist even when the container is removed.

Removing Unused Objects

To remove unused objects in Docker, you can use the docker system prune command. This command removes all unused containers, images, networks, and volumes from your system.

docker system prune

This command will ask for confirmation before removing the objects. If you want to remove all unused objects without confirmation, you can use the --force or -f option:

$ docker system prune --force

Docker Prune Commands:

Function	Description
docker system prune	Remove unused containers, images, networks, and volumes
docker system prune -a	Remove all unused objects, including containers, images, networks, and volumes
docker system prune -f	Force removal of all unused objects, without confirmation
docker system prune --volumes	Remove unused volumes
docker system prune --filter "label="	Remove objects matching a specific label
docker system prune --filter "until="	Remove objects that have not been used in a certain duration
docker system prune --filter "dangling=true"	Remove objects that are not associated with any container
docker system prune --filter "networks="	Remove unused networks
docker container prune	Remove all stopped containers
docker container prune --filter "until="	Remove containers that have not been used in a certain duration
docker image prune	Remove all unused images
docker image prune --filter "dangling=true"	Remove all dangling images
docker image prune --filter "label="	Remove images with a specific label
docker image prune --filter "dangling=false" --filter "label!="	Remove images that are not associated with a specific repository
docker volume prune	Remove all unused volumes
docker volume prune --filter "dangling=true"	Remove all dangling volumes
docker volume prune --filter "label="	Remove volumes with a specific label
docker volume prune --filter "dangling=false" --filter "label!="	Remove volumes that are not associated with a specific container

What if I delete an image?!

Images are necessary for creating and running containers in Docker. When a container is created, it is based on an image that contains the application code, libraries, and dependencies required to run the application.

If you delete an image that is being used by a running container, the container will continue to function normally. This is because the container has already loaded the necessary files and libraries into memory, and does not rely on the image file once it is running. However, if you delete an image that is required to create a container, you will not be able to create new containers from that image since it's no longer on your system but, you can pull it again from a registry or rebuild it from a Dockerfile.

To pull an image from a registry, you can use the docker pull command followed by the name of the image and its tag (if applicable). For example:

$ docker pull nginx:latest

This will download the latest version of the Nginx image from Docker Hub.

To rebuild an image from a Dockerfile, you can use the docker build command followed by the path to the Dockerfile. For example:

$ docker build -t myapp:latest .

This will build a new image called myapp from the Dockerfile in the current directory.

Once you have the image, you can use it to create new containers as needed. For example:

$ docker run -d myapp:latest

This will create a new container from the myapp image and run it in the background.

It's a good practice to avoid deleting images that are required to create containers. Instead, you can use the docker image prune command to remove unused images that are not associated with any containers or tags. This will help you to free up disk space without affecting your running containers.

Using Filters

To limit the objects that will be removed, you can use filters with the docker system prune command. Here are some examples of filters you can use:

Dangling: Filter images that are not associated with any container or tag.

$ docker image prune --filter "dangling=true"

The dangling filter option is used to filter images that are not referenced by any container or tag. When an image is created, it is given a unique ID that is used to identify it. If the image is tagged with a name and pushed to a registry or used to create a container, it is no longer considered dangling. However, if the image is not tagged or used to create a container, it is considered dangling and can be removed with the docker image prune command.

Until: Filter objects that have not been used in a certain duration.

$ docker system prune --filter "until=24h"

Label: Filter objects that have a specific label.

$ docker volume prune --filter "label=myapp"

Warning

It's important to note that removing unused objects and running the garbage collector can have unintended consequences. Be sure to review the objects that will be removed before running any of the above commands. Also, removing an image or container will permanently delete it, so be sure to back up any data that you want to keep.

Docker Compose

Docker Compose is a tool for defining and running multi-container Docker applications. It provides a convenient way to manage containers, networks, and volumes, and automates the creation and management of these objects.

You can use the docker-compose down command to remove all containers, networks, and volumes created by docker-compose up. This command will stop and remove all containers that were created with docker-compose up, as well as any associated networks and volumes.

Here's an example docker-compose.yml file that defines a web application with a PostgreSQL database:

version: '3'
services:
  web:
    image: nginx:latest
    ports:
      - "80:80"
  db:
    image: postgres:latest
    environment:
      POSTGRES_USER: postgres
      POSTGRES_PASSWORD: mypassword
      POSTGRES_DB: mydb
    volumes:
      - db-data:/var/lib/postgresql/data
volumes:
  db-data:

This file defines two services, web and db, and creates a named volume called db-data for storing the data of the db service.

To create and start the containers defined in this file, you can run the following command:

$ docker-compose up -d

This command will start the containers defined in the docker-compose.yml file in detached mode (-d), which means that they will run in the background. To stop and remove the containers created by docker-compose up, you can use the docker-compose down command:

$ docker-compose down

This command will stop and remove all containers created by docker-compose up, as well as any associated networks and volumes. By default, the docker-compose down command will remove only the containers and networks created by docker-compose up. To also remove any volumes that were created by docker-compose up, you can use the --volumes or -v option:

$ docker-compose down --volumes

This command will stop and remove all containers created by docker-compose up, as well as any associated networks and volumes, including named volumes.

Note that the docker-compose down command will permanently delete all data stored in the volumes associated with the containers that are being removed. Make sure to back up any data that you want to keep before running this command.

Using Docker Compose with persistent volumes is a good way to ensure that your data is safe and secure even when the containers are removed. By defining named volumes in your docker-compose.yml file, you can store data in a volume that will persist even when the container is removed. This is a good practice for applications that require persistent data storage.

In conclusion, managing disk space in Docker is an important aspect of Docker administration. By using the tools provided by Docker, such as the docker system prune command and filters, you can remove unused objects and reclaim disk space. Additionally, using Docker Compose with persistent volumes is a good way to ensure that your data is safe and secure even when the containers are removed.

Summary Commands

Command	Description
docker system prune	Remove all unused containers, images, networks, and volumes
docker image prune	Remove all unused images
docker container prune	Remove all stopped containers
docker volume prune	Remove all unused volumes
docker system prune --filter	Use filters to limit the objects that will be removed
docker image rm	Remove a specific image
docker-compose up	Start the containers defined in a Docker Compose file
docker-compose down	Stop and remove the containers, networks, and volumes created by Docker Compose

Demystifying User Permissions and Access in Linux for Developers

Gyasi Sutton, MD, MPH — Sun, 12 Feb 2023 17:18:47 GMT

In Linux, permissions and access control are fundamental concepts that govern the security of a system. The users of a Linux system can have different levels of access, depending on their permissions. The root user, also known as the superuser, has unrestricted access to the entire system, while other users have limited access to specific parts of the system.

In this tutorial, we will discuss the concepts of root and user permissions, access, and privileges in Linux.

User Accounts

In Linux, each user has a unique username and user ID (UID). Usernames are used to identify users in the system, while UID is a numeric identifier that is used by the system to determine the user's access privileges.

Each user has their own home directory, where they can store their personal files and data. By default, users can only access files and directories that are owned by them, unless they have been granted special permissions.

Root User

The root user is a special user account that has unrestricted access to the entire system. The root user has the highest level of permissions in Linux, and can perform any system-level task, such as installing software, modifying system configuration files, and managing system processes.

However, with great power comes great responsibility. The root user has the ability to make changes that could potentially harm the system, so it should be used with caution. It is generally recommended to use the root user only when it is absolutely necessary, and to perform routine tasks using a regular user account.

File Permissions

In Linux, file permissions are used to control access to files and directories. File permissions are divided into three categories: owner, group, and others.

The owner of a file is the user who created the file, and has full access to the file. The group is a collection of users who share the same access permissions to a file, while others are users who are not the owner or in the group.

Each category has three types of permissions: read (r), write (w), and execute (x). The read permission allows a user to view the contents of a file, the write permission allows a user to modify the contents of a file, and the execute permission allows a user to run the file as a program.

To view the permissions of a file, use the ls command with the -l option:

ls -l filename

The output will show the permissions of the file in the following format:

-rw-r--r-- 1 username groupname size date filename

The first character in the output (-) represents the type of file. The next three characters (rw-) represent the permissions of the owner, the next three (r--) represent the permissions of the group, and the final three (r--) represent the permissions of others.

Changing File Permissions

To change the file permissions, you can use the chmod command. The chmod command changes the permissions of a file or directory.

The syntax of the chmod command is as follows:

chmod [options] mode file

The options are:

-r : recursively change permissions of all files and subdirectories within the specified directory
-v : verbose mode, displays the permissions of each file after it has been changed
-c : changes only those files whose permissions have actually been changed

The mode is a three-digit number that represents the new permissions of the file. The first digit represents the owner's permissions, the second digit represents the group's permissions, and the third digit represents the permissions of others. Each digit is calculated by adding the values of the read (4), write (2), and execute (1) permissions.

For example, to give the owner full permissions, the group read and execute permissions, and others no permissions, you can use the following command:

chmod 750 filename

The 7 represents the sum of the read, write, and execute permissions for the owner (4+2+1=7), the 5 represents the sum of the read and execute permissions for the group (4+1=5), and the 0 represents no permissions for others. Alternatively, you can use the letters r, w, and x to represent the permissions. For example, to give the owner full permissions, the group read and execute permissions, and others no permissions, you can use the following command:

chmod u=rwx,g=rx,o= filename

The u stands for owner, g stands forgroup, and o stands for others. The rwx stands for read, write, andexecute, while the rx stands for read and execute.

UserGroups

In Linux, user groups are used to assign permissions to multiple users at once. A user can be a member of multiple groups, and each group can have its own setof permissions. To create a new group, you can use the groupadd command

groupadd groupname

To add a user to a group, you can use the usermod command:

usermod -a -G groupname username

The -a option adds the user to the group, while the -G option specifies the group.

Sudo

In Linux, the sudo command allows a user to execute commands with the permissions of the root user. This can be useful for performing tasks that require root privileges, without actually logging in as the root user.

To use the sudo command, simply prefix the command with sudo:

sudo command

You will be prompted for your password, and if it is correct, the command will be executed with root privileges.

Conclusion

The concepts of user permissions, access, and privileges are important not only on a single Linux workstation, but also on different types of servers, and even in Docker containers. In a multi-user environment, where multiple users are accessing the same system or application, it is important to restrict access to certain resources and data to prevent unauthorized modifications or breaches. Similarly, in server environments, managing user access and permissions is crucial for controlling who can access the server and what actions they can perform. With the increasing popularity of containerization technologies like Docker, managing user permissions and access is also important in containerized environments. In Docker, users can be assigned different permissions within the container, and the host system can restrict access to certain system resources, providing an added layer of security. By understanding user permissions and access in Linux, developers can ensure the safety and stability of their applications across different types of systems and environments.

Command	Option	Output
whoami		Displays the current user's username
id		Displays the current user's UID, GID, and group membership
su		Allows the current user to switch to the root user
su	- username	Allows the current user to switch to the specified user
sudo		Executes a command with root-level privileges
sudo	-u username	Executes a command with the privileges of the specified user
adduser	username	Creates a new user with the specified username
usermod	-a -G groupname username	Adds the user to the specified group
groupadd	groupname	Creates a new group with the specified name
chown	username:groupname filename	Changes the owner and group of the specified file
chmod	mode filename	Changes the permissions of the specified file

Mastering Time: A Beginner's Guide to Date and Time Manipulation in Python

Gyasi Sutton, MD, MPH — Fri, 10 Feb 2023 16:10:29 GMT

Sooner or later, during your coding journey, you are going to come across the need to manipulate time and dates. Especially if you deal with medical records as every entry, assessment, treatment, etc, should have a time and date associated with it. Different environment and languages have their own specific quirks(I'm looking at you SAS) when it comes to handling these aspects, and in this article, we are going to talk about python's modules for handling such objects. Python has a built-in module called datetime that provides the necessary classes and functions to work with dates and times. However, there are other third-party packages like dateutil, pytz, and pendulum that can make working with dates and times even easier. We will start with the datetime module first.

The datetime module provides the datetime class, which can be used to represent a specific point in time. The datetime class takes three arguments: the year, the month, and the day. For example:

from datetime import datetime
now = datetime(2022, 12, 1)
print(now)
# Output: 2022-12-01 00:00:00

The datetime module also provides the date class, which can be used to represent a specific date without a time. The date class takes three arguments: the year, the month, and the day. For example:

from datetime import date
today = date.today()
print(today)
# Output: 2022-12-01

You can also use the datetime.now() function to get the current date and time, and the datetime.utcnow() function to get the current date and time in UTC.

The datetime module also provides the timedelta class, which can be used to represent a duration of time. The timedelta class takes several arguments, such as days, seconds, microseconds, milliseconds, minutes, hours, and weeks.

For example, to get the date and time one day from now:

from datetime import datetime, timedelta
now = datetime.now()
tomorrow = now + timedelta(days=1)
print(tomorrow)
# Output: 2022-12-02 10:15:30.283740

You can also subtract timedelta from datetime to get the date and time before.

To get rid of the time component in a datetime object, you can use the date() method, which returns a date object with the same year, month, and day as the datetime object.

For example:

from datetime import datetime
now = datetime.now()
print(now)
# Output: 2022-12-01 10:15:30.283740
date_only = now.date()
print(date_only)
# Output: 2022-12-01

To output a date in a human-readable format, you can use the strftime() method, which allows you to format the date and time using codes that represent the various components of the date and time.

For example, to output a date in the format "YYYY-MM-DD":

now = datetime.now()
print(now.strftime("%Y-%m-%d"))
# Output: 2022-12-01

But as mentioned before python's datetime module has its limitations, so developers often use third party packages like dateutil , pytz, pendulum to work with dates and times.

The dateutil package provides a parser.parse() method that can parse almost any string representation of date and time, and also includes a relativedelta() function that can be used to perform arithmetic operations with dates and times.

For example, to format a date in the format "YYYY-MM-DD" using the dateutil package:

from dateutil import parser
now = parser.parse("2022-12-01")
print(now.strftime("%Y-%m-%d"))
# Output: 2022-12-01

Another package for date and time manipulation is pytz, which provides the timezone information.

import pytz
now = datetime.now(pytz.UTC)

This will give you the date and time in UTC format.

pendulum is another package that provides simple, easy-to-use, and Pythonic methods for creating, manipulating, formatting, and parsing dates and times. It also includes timezone support and advanced features such as period arithmetic, recurrences, and humanization.

import pendulum
now = pendulum.now()
print(now.to_date_string())

This will give you the date in string format 'YYYY-MM-DD'

Datetime syntax table

Code	Explanation	Output (example)
%a	Abbreviated weekday name	Mon
%A	Full weekday name	Monday
%b	Abbreviated month name	Dec
%B	Full month name	December
%d	Day of the month (zero-padded)	01
%m	Month of the year (zero-padded)	12
%Y	Year with century	2022
%H	Hour (24-hour format)	00
%I	Hour (12-hour format)	12
%M	Minute	30
%S	Second	45

In conclusion, python's datetime module provides the necessary classes and functions to work with dates and times, but third-party packages like dateutil, pytz, and pendulum can make working with dates and times even easier. These packages provide additional features such as parsing and formatting dates, timezone support, and advanced features such as period arithmetic, recurrences, and humanization. It is recommended to choose the package that best fits your requirements and use the appropriate methods to format your date and time.

Of course, this is just the tip of the iceberg. For a deeper dive check out these links to libraries:

datetime: https://docs.python.org/3/library/datetime.html
dateutil: https://dateutil.readthedocs.io/en/stable/
pytz: https://pythonhosted.org/pytz/
pendulum: https://pendulum.eustace.io/docs/

-- Stay tuned for Part 2 of this time series for R

Crafting Effective Prompts for AI-Generated Text

Gyasi Sutton, MD, MPH — Mon, 06 Feb 2023 00:47:00 GMT

Prompt engineering is a critical aspect of natural language processing (NLP) and is central to the development of language models. By carefully designing prompts, we can influence the outputs generated by language models and guide them to generate text that is on-topic, coherent, and readable. Whether you are working on a conversational AI system, a text generation tool, or any other NLP project, having a solid understanding of prompt engineering is essential. In this tutorial, we will explore the various techniques and strategies used in prompt engineering, including the use of escape characters, structure keys, and readability keys. Whether you are a seasoned NLP practitioner or just getting started with language models, this tutorial will provide you with the knowledge and skills needed to create effective and meaningful prompts for your NLP projects.

Please note that the information and explanations provided in this tutorial are general guidelines for most NLP models and may work for some image models as well. However, it is important to always read the API documentation for the specific model you are using to understand the exact commands and syntax for prompts. The features and tools available for prompt engineering can vary greatly between different models and APIs, so it is essential to consult the documentation for the specific model you are using to ensure that you are using the correct syntax and commands.

Use these guidelines when generating a prompt:

1. Understanding the task: To get started with prompt engineering using ChatGPT, it's essential to have a clear understanding of the task you want the model to perform. For example, let's say you want the model to generate a weather report for a specific location and date.

2. Defining the prompt format: In this case, the prompt format could include a prompt header and a prompt body. The header could provide a high-level overview of the task, for example: "Generate a weather report for [location] on [date]." The prompt body could provide more specific details and constraints, such as the desired format of the report or any additional information to include.

3. Crafting the prompt: When writing the prompt, consider the following points:

a. Keep it concise: A clear, concise prompt is easier for the model to understand and generates more consistent outputs. For example:

Generate a weather report for New York City on February 3rd, 2023:

b. Use clear language: Use simple, straightforward language to minimize ambiguity and reduce the chance of the model producing unexpected results. For example:

Generate a weather report for New York City, USA on February 3rd, 2023:

c. Provide context: Provide enough context for the model to understand the task and the context in which it is being performed. For example:

Generate a brief weather report for New York City, USA on February 3rd, 2023 including current temperature and any precipitation.

d. Include examples: Including examples can help illustrate the desired output and provide a reference for the model. For example:

Generate a brief weather report for New York City, USA on February 3rd, 2023 including current temperature and any precipitation:

Example:
The current temperature in New York City, USA is 47°F with light rain.

4. Iterating on the prompt: Once you have a draft of the prompt, test it with the model to see how it performs. If you're not happy with the outputs, refine the prompt until you get the desired results. Repeat this process until you're satisfied with the results.

Iteration, an Important process

Iteration is a critical component of the prompt engineering process. Essentially, it involves refining and improving your prompt over time, until you achieve the desired results. The goal of iterating on the prompt is to optimize the language model's performance and generate outputs that are coherent, on-topic, and readable. To do this, you'll need to test your prompt with the model and observe the outputs it generates. Based on this, you can then make changes to the prompt and re-test it, until you get the results you want.

The iterative process can involve fine-tuning the wording, adjusting the structure, and adding or removing specific elements. For example, you might start by crafting a basic prompt that outlines the topic and provides some context. Then, you can iterate on the prompt by adding specific details, adjusting the tone, or incorporating relevant keywords. The more you iterate, the better you'll understand what works and what doesn't, and you'll be able to refine the prompt accordingly.

The process of iterating on the prompt can be time-consuming, but it's well worth the effort. By investing time and effort into the prompt engineering process, you can greatly improve the quality and consistency of the outputs generated by your language model, and ultimately deliver better results for your NLP project

Punctuation

It's important to consider the use of punctuation and special characters when creating prompts for language models like ChatGPT. Here are some tips:

Use clear and consistent punctuation: Using clear and consistent punctuation can help reduce ambiguity and improve the readability of your prompt. For example, always use a colon after the header to separate it from the body of the prompt.
Avoid complex punctuation: Complex punctuation such as multiple exclamation points or long strings of punctuation marks can be confusing and difficult for the model to parse. Stick to straightforward punctuation such as periods, commas, and colons.
Use special characters with caution: Some special characters, such as emojis or mathematical symbols, may not be supported by the model or may be interpreted differently than intended. If you need to include special characters, consider using plain text representations instead.

Example Prompts

A bullet point list of ingredients for a recipe:

Generate a recipe for spaghetti carbonara:

- spaghetti
- pancetta or bacon
- eggs
- parmesan cheese
- black pepper

2. A prompt for a math problem with parentheses and symbols:

Solve the following equation:

(2x + 3) * (x - 4) = 0

3. A prompt for statistical analysis with a table:

Perform a t-test on the following data to determine if there is a significant difference in the mean weight of two species of fish:

Species 1: 40g 45g 50g 55g 60g Species 2: 50g 55g 60g 65g 70g

4. A prompt for a short story or creative writing:

Write a short story about a character named Maria who discovers a mysterious object:

Maria was on a walk in the park when she stumbled upon a shiny object hidden in the grass. At first, she thought it was just a piece of trash, but as she picked it up, she realized it was a beautiful and intricate object unlike anything she had ever seen before.

5. A prompt for a historical event with dates and locations:

Write a brief summary of the events of the Battle of Gettysburg, fought July 1-3, 1863:

The Battle of Gettysburg was a decisive battle of the American Civil War, fought between the Union and Confederate forces in and around the town of Gettysburg, Pennsylvania. Over the course of three days, the Union army successfully repulsed repeated Confederate attacks, leading to a Union victory and a turning point in the war.

Coaxing the last bit of info

Escape characters are used to represent special characters or sequences that have a specific meaning in the input syntax and you can create more complex and readable prompts for your language model. Here's how you can add escape characters to a prompt:

Backslash (\): In most prompt inputs, the backslash is used as an escape character. To include a literal backslash in your prompt, you need to escape it by adding another backslash before it, like so: \\.
Newline (\n): To add a newline character to your prompt, you can use the escape sequence \n. This will cause the text that follows to be displayed on a new line.
Tab (\t): To add a tab character to your prompt, you can use the escape sequence \t. This will cause the text that follows to be indented.
Quotes (\' or \"): To include a quote character in your prompt, you can escape it by adding a backslash before it. For single quotes, use \', and for double quotes, use \".

Character	Description
\	Backslash (escape character)
\n	Newline
\t	Tab
\'	Single quote
\"	Double quote

In some prompt inputs, you may want to add themes or structures to guide the output and make it more coherent. Here's how you can do that:

Templates: One way to add structure to the output is to provide a template that the language model should follow. For example, if you are generating a recipe, you can provide a template with placeholders for ingredients, instructions, and serving information. The language model will then fill in the placeholders with the corresponding information.
Keywords: Another way to add themes to the output is to provide keywords or phrases that should appear in the output. This can help guide the language model to generate outputs that are on topic and coherent.
Grammar and syntax: You can also add structure to the output by specifying the grammar and syntax that should be used. For example, you can specify that the output should be written in a specific tense (e.g., present or past), or that it should follow a specific sentence structure (e.g., subject-verb-object).

Here's an example of a prompt with a template for a recipe:

Generate a recipe for a chocolate cake:

Ingredients:
- [x] cups of all-purpose flour
- [x] cups of granulated sugar
- [x] cups of unsweetened cocoa powder
- [x] teaspoons of baking powder
- [x] teaspoons of baking soda
- [x] teaspoons of salt
- [x] large eggs
- [x] cups of buttermilk
- [x] cups of warm water
- [x] cups of vegetable oil

Instructions:
1. Preheat the oven to [x]°F.
2. In a large bowl, whisk together the flour, sugar, cocoa powder, baking powder, baking soda, and salt.
3. In a separate bowl, beat the eggs, buttermilk, warm water, and vegetable oil.
4. Pour the wet ingredients into the dry ingredients and stir until just combined.
5. Pour the batter into a greased [x]-inch round cake pan.
6. Bake for [x] minutes, or until a toothpick inserted into the center of the cake comes out clean.
7. Let the cake cool for [x] minutes, then remove from the pan and transfer to a wire rack to cool completely.

Serving Information:
Serves [x] people.

By using templates, keywords, and syntax, you can help guide the language model to generate outputs that are more coherent and on topic, and that follow the structure and style that you desire.

Conclusion

In conclusion, prompt engineering is the process of designing and crafting effective prompts for language models to generate high-quality and coherent outputs. To achieve this goal, you can use techniques such as templates, keywords, and grammar/syntax to add structure and themes to the output.

In this tutorial, we covered the basics of prompt engineering and how to design effective prompts for language models. We discussed the importance of having a clear goal for the output, and how to use templates, keywords, and grammar/syntax to control the output. We also summarized the key concepts in a summary table and cheat sheet for quick reference.

Overall, prompt engineering is an important aspect of working with language models, as it allows you to control the quality and coherence of the outputs and get the results you need for your applications. Whether you are working on a specific project or simply exploring the capabilities of language models, prompt engineering is a valuable tool to have in your toolkit.

Summary

Here's a cheat sheet for prompt engineering:

Start by defining the goal of your prompt and what you want the output to be (e.g., a recipe, a story, a summary).
Consider using templates, keywords, and syntax to add structure and themes to the output.
Test and refine your prompt until you are satisfied with the quality and coherence of the outputs.

Concept	Description
Prompt engineering	The process of designing and crafting effective prompts for language models, in order to generate high-quality and coherent outputs.
Templates	A way to add structure to the output by providing a template that the language model should follow.
Keywords	A way to add themes to the output by providing keywords or phrases that should appear in the output.
Grammar and syntax	A way to add structure to the output by specifying the grammar and syntax that should be used.

Remember, the specific features and tools available for prompt engineering will depend on the specific language model you are using, so be sure to consult the documentation for more information.

Coding In Medicine

Agent-Sourced: A Provenance Tag for the Agent Era

The fight

The wrong question

Agent-sourced

Why a label beats a ban — and beats blind trust

Two tiers, and a gate that can’t be skipped

It honors both sides

Not another bot — an identifier and a verifier

The open question — and an invitation

Read more

My AI Engineering Philosophy: Why I Never Get Locked In

How I Learned This Lesson the Hard Way

The Trap Most Developers Fall Into

My Philosophy: Vendor-Neutral, Model-Agnostic Development

1. Models Change Faster Than You Think

2. The Wood and Paper Analogy

3. Vendors Change Their Terms

How This Philosophy Plays Out in My Work

Abstraction Layers Everywhere

Prompt Engineering That Travels

Universal Data Formats

Why Medium-Sized Businesses Must Train In-House

The Economics That Changed My Mind

The Real Cost of Vendor Lock-In

My Development Principles

The Bottom Line

Manipulating time like a TimeLord with Flux

A Bird’s-Eye View of Flux

Real-World Use Cases

Hello, Flux: Your First Tutorial

1. Installation

2. Meet Timeline

3. Sleeping Virtually

4. Changing the Time Factor

5. Freezing Time

6. Scheduling a Callback

Seven Practical Code Snippets

1. Testing a Long-Running Function

2. Speeding Up a Simulation

3. Scheduling Periodic Tasks

4. Freezing Time for Precise Assertions

5. Using the Global Timeline Across Modules

6. Simulating a 5-Year Stock Prediction Algorithm

7. Testing a Cron-Like Scheduler

A Data Engineer's Perspective: Timezones and Data Synchronization

Final Thoughts

Key Takeaways

SQL Alchemy for pythonic pipelines

What is SQL Alchemy

Advantages of SQL Alchemy

Differences Between SQLAlchemy and Straight SQL

Examples: SQLAlchemy vs. Straight SQL

Example 1: Fetching Data

Example 2: Inserting Data

Example 3: Relationships Between Tables

When to Use SQLAlchemy vs. Straight SQL

Two different paths to SQLAlchemy

1. Declaring Classes (SQLAlchemy ORM)

How It Works:

Advantages:

Example:

2. Using Core (Table-Based)

How It Works:

Advantages:

Example:

Key Differences Between Declaring Classes (ORM) and Core

When to Use Which?

Blending ORM and Core

Other Industry Standard Data Engines

1. Snowflake with SQLAlchemy

Setup

Connection Setup

2. General Approach for Other Databases

Steps:

Examples of Other Databases

MongoDB (with SQLAlchemy-mongo)

ClickHouse

Google BigQuery

Key Considerations for Non-Common Data Engines

2. Meet `Timeline`

Example 2: Bulk Data Insertion Using `text()`