Gyasi Sutton, MD, MPH Physician in Training, Coder at heart. Python, R, Node, and Rust.

Creating a USMLE-style question-and-answer generator

  Reading Time:

We are going to start a project to use current technology to vastly increase our knowledge retention. We will do this by teaching ourselves, not ChatGPT, how to score high on a USMLE exam. We will do this by first defining a prompt and agent that will output a single best answer mcq. Once we get that working, we can move on to things like RAG and think of ways to tweak output numbers and difficulty.

The United States Medical Licensing Examination (USMLE) is a three-step examination for medical licensure in the United States. It assesses a physician's ability to apply knowledge, concepts, and principles, and to demonstrate fundamental patient-centered skills, that are important in health and disease and constitute the basis of safe and effective patient care. The exam is sponsored by the Federation of State Medical Boards (FSMB) and the National Board of Medical Examiners (NBME).

Collect the Research

Now let's gather all the information we can about the layout and structure of the questions, Perplexity, Phind, or ChatGpt works well for this step.

Multiple-Choice Questions (MCQs)

Components of MCQs:

  1. Stem: The clinical vignette or lead-in question that provides context.
  2. Lead-in: The direct question posed to the examinee.
  3. Options: A set of possible answers, typically five choices (A to E), where one is the best or most correct answer.

Distractors:

  • Distractors are incorrect options provided in multiple-choice questions designed to mislead or challenge the examinee. They are plausible enough to be considered but incorrect.
  • Purpose: They test the depth of the examinee’s knowledge and ability to discern the correct answer from similar but incorrect choices.

Difficulty Levels:

  • USMLE questions range from basic to highly complex. The difficulty is determined by factors such as the integration of multiple concepts, the specificity of the clinical scenario, and the level of critical thinking required.
  • Questions can be categorized into three levels:
  • Low Difficulty: Requires recall of basic facts and straightforward application of knowledge.
  • Moderate Difficulty: Involves understanding and applying multiple concepts to a clinical scenario.
  • High Difficulty: Requires synthesis of information, advanced clinical reasoning, and decision-making skills.

Measurement of Difficulty:

  • Item Response Theory (IRT): This statistical method is used to calibrate question difficulty and discriminate between different levels of examinee ability.
  • Parameters: Difficulty (b-parameter), discrimination (a-parameter), and guessing (c-parameter).
  • Calibration: Based on examinee'responses, the difficulty of each question is adjusted to ensure accurate measurement of ability.

Example Deep Dive on Question Difficulty

Basic Question:

  • Stem: A 24-year-old female presents with a sore throat and fever. Physical examination reveals pharyngeal erythema and exudates.
  • Lead-in: What is the most likely diagnosis?
  • Options:
  • A) Streptococcal pharyngitis (correct)
  • B) Viral pharyngitis
  • C) Mononucleosis
  • D) Allergic rhinitis
  • E) GERD

Moderate Difficulty Question:

  • Stem: A 50-year-old male with a history of chronic obstructive pulmonary disease (COPD) presents with increased dyspnea and productive cough. His temperature is 38.3°C, and he has coarse breath sounds with wheezing.
  • Lead-in: What is the best initial treatment?
  • Options:
  • A) Antibiotics (correct)
  • B) Inhaled corticosteroids
  • C) Beta-blockers
  • D) Diuretics
  • E) Antihistamines

High Difficulty Question:

  • Stem: A 68-year-old female with a history of diabetes and hypertension presents with sudden onset of right-sided weakness and difficulty speaking. She was last seen normal 3 hours ago. Her blood pressure is 180/100 mmHg, and CT scan shows no hemorrhage.
  • Lead-in: What is the next best step in management?
  • Options:
  • A) IV thrombolytics (correct)
  • B) Aspirin
  • C) Clopidogrel
  • D) Heparin
  • E) Blood pressure control

All of this content took 4 minutes with prompting. I knew exactly what I was looking for and could verify most of the returned information. Ok now that we've learned all have all that info we know that we need certain instructions in the output, namely the stem, lead-in options, but I also want to add some extra step. We also need the reasoning about why each answer is right or wrong and for the right answer we want a deeper more detailed reason about why this answer is right compared to others.  These will help guide the structure that we need. Thinking about getting this output to conform to a json structure now will speed up frontend development at a later point. We will save the difficulty problem for another tutorial.

Prompt Engineering

First let me plug two exciting technologies, that I hope to get to in another consumer. Fabric and Dspy. Fabric has made me in and even bigger consumer of digital information as I've use the prompts in pipelines to quickly summarize YouTube videos and research papers and even summarize medium articles. Dspy is an interesting project where the prompt is not needed, and it's abstracted away, but fine tunes that hidden prompt to get better output. I'm underselling this a lot and I will have a tutorial soon about this as well.
Now for the purposes of being transparent I came across this article that had a decent starting prompt template for what I wanted to do.

import os 
import os
from dotenv import load_dotenv
from guidance import models, gen


load_dotenv(
    ".env",
    override=True,
)


def generate_usmle_prompt(subject):

    initial_prompt = f"""
    You are developing a question bank for medical exams focusing on the topic of {subject}. The subject may be broad , so the first thing to thing about is refining the question based on a specific refined subject, like a disease, medication, sign, symptom, marker, chemical process, etc, though noting that it should always come back to a disease process. When creating the answers make sure to create distractors that could be part of a differential diagnosis 

    Generate a high-quality Single Best Answer (SBA) question using the following framework:

    
    Create the question from the stem in parentheses below, while the stem is separated to guide you, the basis introductory part of the question should be the paragraph comprised of the stems
    
    (Stem:
    Patient details (gender/age)
    Presenting complaint (PC)
    Relevant clinical history (HPC)
    Physical examination findings
    Diagnostic test results (optional)
    
    Lead in Question stem: 
    )
    
    
    Answer options:
    [Insert plausible answer option]
    [Insert plausible answer option]
    [Insert plausible answer option]
    [Insert plausible answer option]
    [Insert plausible answer option]

    Explanation:
    For each answer option, provide a brief explanation of why it is correct or incorrect.
    
    Why this is the most appropriate answer:
    (this is an optional section to explain why the correct answer is the best choice in more detail)
    Given the clinical picture and the likely diagnosis of [insert condition], the most probable cause of the patient's symptoms is [insert correct answer option]."""

    lllm = models.LiteLLMCompletion("gpt-4o", echo=False)

    lllm += initial_prompt + gen(name="definition")

    question = lllm["definition"]

    return question


def main():
    subject = input("Enter a medical subject: ")
    print("Generating initial prompt...")
    initial_prompt = generate_usmle_prompt(subject)

    print(initial_prompt)


if __name__ == "__main__":
    main()

If I run the program with the input "serious congenital virus"

**Question:**

A 3-day-old male neonate presents with lethargy, poor feeding, and a fever of 38.5°C. The mother reports that the baby was born at term via vaginal delivery without complications. The mother had no significant prenatal history and tested negative for Group B Streptococcus. On physical examination, the neonate is jaundiced, has hepatosplenomegaly, and exhibits a petechial rash. A complete blood count reveals thrombocytopenia and elevated liver enzymes. 

What is the most likely diagnosis?

A) Cytomegalovirus (CMV) infection
B) Group B Streptococcus (GBS) sepsis
C) Neonatal herpes simplex virus (HSV) infection
D) Toxoplasmosis
E) Rubella

**Explanation:**

A) Cytomegalovirus (CMV) infection
- **Incorrect:** While CMV can cause jaundice, hepatosplenomegaly, and thrombocytopenia, it is less likely to present with a petechial rash and fever in the neonatal period.

B) Group B Streptococcus (GBS) sepsis
- **Incorrect:** GBS sepsis can present with fever and lethargy, but it is less likely to cause jaundice, hepatosplenomegaly, and a petechial rash.

C) Neonatal herpes simplex virus (HSV) infection
- **Correct:** HSV infection in neonates can present with fever, lethargy, poor feeding, jaundice, hepatosplenomegaly, and a petechial rash. The presence of these symptoms along with the elevated liver enzymes and thrombocytopenia makes HSV the most likely diagnosis.

D) Toxoplasmosis
- **Incorrect:** Toxoplasmosis can cause jaundice and hepatosplenomegaly, but it typically presents with other findings such as chorioretinitis and intracranial calcifications, which are not mentioned in this case.

E) Rubella
- **Incorrect:** Congenital rubella syndrome can cause jaundice and hepatosplenomegaly, but it is more commonly associated with cataracts, congenital heart defects, and sensorineural deafness, which are not described in this case.

**Why this is the most appropriate answer:**
Given the clinical picture and the likely diagnosis of neonatal herpes simplex virus (HSV) infection, the most probable cause of the patient's symptoms is HSV infection. The combination of fever, lethargy, poor feeding, jaundice, hepatosplenomegaly, petechial rash, thrombocytopenia, and elevated liver enzymes strongly suggests HSV as the underlying cause.

The output is decent, but not perfect yet. this is with the Gpt 4 model: what would be interesting is to see how other models compare to each other. We will test that in another session but now, I'll just rewrite the code a little to get strict json output. This wa I can properly save the output to a database, in case i want this particularly generated question to pop up again. Running the program again gives me this output:

Enter a medical subject: diabetes
Generating initial prompt...
{
    "question": "A 45-year-old male presents to the clinic with excessive thirst and frequent urination. He reports a history of slowly progressive kidney failure. On physical examination, his blood pressure is 130/85 mmHg, and he appears well-hydrated. Laboratory tests reveal normal blood glucose levels, but a low urine osmolality. Which of the following is the most likely diagnosis?",
    "choices": [
        {
            "choice": "Diabetes Mellitus Type 1",
            "explanation": "Diabetes Mellitus Type 1 is characterized by hyperglycemia due to autoimmune destruction of insulin-producing beta cells. This patient has normal blood glucose levels, making this diagnosis unlikely."
        },
        {
            "choice": "Diabetes Mellitus Type 2",
            "explanation": "Diabetes Mellitus Type 2 involves insulin resistance and is also characterized by hyperglycemia. The patient's normal blood glucose levels do not support this diagnosis."
        },
        {
            "choice": "Diabetes Insipidus",
            "explanation": "Diabetes Insipidus is characterized by excessive thirst and urination due to a deficiency of antidiuretic hormone (ADH) or renal insensitivity to ADH, leading to dilute urine. The patient's symptoms and low urine osmolality are consistent with this condition.",
            "reasoning": "Given the clinical picture and the likely diagnosis of Diabetes Insipidus, the most probable cause of the patient's symptoms is a deficiency or insensitivity to antidiuretic hormone, leading to the production of large volumes of dilute urine.",
            "correct_answer": true
        },
        {
            "choice": "Chronic Kidney Disease",
            "explanation": "Chronic Kidney Disease can cause polyuria, but it is usually associated with other symptoms such as hypertension and electrolyte imbalances. The patient's normal blood pressure and low urine osmolality suggest a different diagnosis."
        },
        {
            "choice": "Primary Polydipsia",
            "explanation": "Primary Polydipsia involves excessive fluid intake leading to dilute urine. However, it is less likely in the context of slowly progressive kidney failure and the specific laboratory findings presented."
        }
    ]
}

This output has the structure I want, though I'm not necessarily satisfied with the length of the clinical vignette for the questions, they should be longer with more details. We can explore expanding this by prompt engineering which we will tackle next time.

In this session, we embarked on an exciting journey to enhance our knowledge retention while preparing for the United States Medical Licensing Examination (USMLE). We began by dissecting the structure and components of multiple-choice questions (MCQs), which are crucial for the exam. We identified the essential elements of MCQs—stems, lead-ins, options, and distractors—along with the varying levels of difficulty that characterize USMLE questions. By using statistical methods such as Item Response Theory (IRT), we aim to calibrate question difficulty and improve our assessment strategies. Our initial efforts yielded a structured JSON output that aligns with our goal of having a MCQ database,  though I'm not necessarily satisfied with the length of the clinical vignette for the questions--they should be longer with more details. We can explore expanding this by prompt engineering which we will tackle next time.

Hey everyone, let's keep teaching and learning!

Understanding Gists for Sharing Code

IntroductionSharing code snippets efficiently is crucial for collaboration and learning. GitHub Gists offer a streamlined way to share code or text snippets, making them accessible to...

SQL Alchemy for pythonic pipelines

I've primarily been developing code, benchmarks, and data tables in Python, as platforms like Snowflake often present quirks and limitations in data processing that make it...