Lab 1: The Naive AgentΒΆ
Duration: ~20 minutes
What You'll Build
In this lab you'll explore a naive AI agent that helps doctors manage their patient portal inbox. You'll learn the core agent loop, see how LangGraph and Structured Outputs work together, and run the agent against real (synthetic) patient data.
By the end you'll have a working agent, a clear sense of what it does well, and a list of problems you'd want to fix before putting this anywhere near a real clinic.
The ProblemΒΆ
Since the introduction of Electronic Health Record (EHR) portals, doctors are overwhelmed with messages from their patients. A single message might contain several unrelated questions. Some are urgent; most are routine. Responding to all of them comes on top of a full patient load β and that means keeping up is exhausting.
When patients don't get responses, important medical needs can go unaddressed.
A real inbox backlog
Dr. Sarah Kim at Lakeview Family Medicine has 12 patients in the workshop dataset. Among them there are 10 unresolved portal messages β including an urgent message from a patient whose pharmacy is refusing to fill their warfarin prescription, and a follow-up complaint that nobody has responded.
How should Dr. Kim prioritize? Which messages need attention today, and which can wait? What patient history is relevant to each message?
The QuestionΒΆ
Can we use an AI agent to support doctor-patient communication outside of appointments in a way that:
- Preserves the doctor-patient relationship β the doctor stays in the loop
- Keeps the doctor as the expert β the agent surfaces information, it doesn't make medical decisions
- Reduces cognitive load β the agent organizes and prioritizes, so the doctor can focus on the medicine
What we are NOT building
We are not building an agent that drafts responses for the doctor, makes diagnoses, or acts autonomously on patient care. That would be the Gmail "draft it for you" anti-pattern β tempting, but dangerous in a domain where the human must remain the expert.
Instead, we want the agent to surface and organize information so the doctor can act on it efficiently. Think "intelligent inbox triage," not "AI doctor."
The Agent Loop: Observe, Reason, ActΒΆ
Before we build anything, let's understand the core pattern behind every AI agent.
The community has converged on ReAct (Yao et al., 2023) as the standard agent architecture. It's a loop with three steps:
flowchart LR
O[Observe] --> R[Reason]
R --> A[Act]
A --> O | Step | What happens | In our case |
|---|---|---|
| Observe | The agent takes in new information β a query, updated data, tool results | A new patient message arrives, or the agent reads a patient record |
| Reason | The LLM decides what to do next β call a tool, ask for more info, or respond | "This message mentions warfarin β I should look up the patient's medication list and recent labs" |
| Act | The agent executes β calls a tool, writes to memory, or produces output | Calls get_patient_record(), reads the result, then summarizes findings |
The loop repeats until the agent decides it has enough information to produce a final response.
Chat agents vs. background agentsΒΆ
Most people encounter agents as chat agents β you type a message, the agent reasons and responds, you type another message. The user drives each turn.
But there's another pattern that's often a better fit for real systems: the background agent.
| Chat agent | Background agent | |
|---|---|---|
| Trigger | User sends a message | New data arrives (a patient message, a database update) |
| Session | User-driven, multi-turn | Data-driven, often single-turn |
| Ending | User stops chatting | Agent decides it's done |
| UI | Chat window | Dashboard, inbox, notifications |
For our doctor inbox problem, a background agent is the better design: incoming patient messages trigger the agent, it processes them and updates a structured inbox, and the doctor interacts with the inbox UI β not with a chat window.
Why not chat?
A chat interface is risky here because you can't stop doctors from asking the agent to draft responses, make diagnoses, or confirm medical correctness. An inbox UI keeps the agent's role constrained to organizing information, preserving the doctor's role as the expert.
The DataΒΆ
The workshop includes a set of synthetic EHR patient records in the data/ directory. These simulate Lakeview Family Medicine, a small GP practice with three providers and 12 patients.
Each patient file (data/patients/patient_001.json through patient_012.json) contains:
| Section | Contents |
|---|---|
demographics | Name, DOB, contact info, insurance, preferred language |
socialHistory | Smoking, alcohol, exercise, lifestyle notes |
familyHistory | Family medical conditions |
allergies | Known allergies with reactions and criticality |
conditions | Active/resolved conditions with ICD-10 codes |
medications | Current medications with RxNorm codes, dosages, prescribers |
immunizations | Vaccination records with CVX codes |
encounters | Office visit notes in SOAP format |
labs | Lab results with LOINC codes, values, reference ranges |
messages | Patient portal messages with threading and priority |
How this data was generated
The synthetic data was generated using a structured pipeline designed to avoid common LLM pitfalls β see data/README.md for the full methodology. Key points:
- A diversity matrix (
data/patient_specs.json) was defined before generation to ensure varied demographics, conditions, communication styles, and clinical archetypes - Demographics were deliberately decoupled from clinical attributes β names and implied ethnicity do not predict occupation, condition, or communication style
- Each patient was generated by an isolated subagent to prevent cross-patient pattern reuse
- All medical codes (ICD-10, LOINC, RxNorm, CVX) are real
Learning ObjectivesΒΆ
By the end of this lab, you will:
- Understand the ReAct loop (Observe β Reason β Act) and how it maps to LangGraph's
create_react_agent - Know the difference between an agent that surfaces information and one that generates content β and why it matters in healthcare
- Understand how Structured Outputs and constrained decoding eliminate JSON parsing headaches
- Run the agent against synthetic patient data and examine its output
- Identify the limitations of a naive implementation
Step 1: Explore the EHR ViewerΒΆ
Before we look at any agent code, let's get oriented with the application.
Start the EHR data viewer (you don't need an API key for this part):
Open http://localhost:8501 in your browser.
This is Dr. Kim's inbox dashboard. Take a minute to explore:
- Patient selector (top) β switch between patients. The emoji shows inbox status.
- Medical record (left) β conditions, medications, labs, encounter history. Click the tabs.
- Concerns panel (right) β empty for now. This is where the agent's output will appear.
- Inbox (bottom left) β patient portal messages, newest first. Click one to see the full conversation.
Look at the inbox
Browse a few patients and their messages. Notice how some messages touch on multiple topics β a patient might ask about a medication refill and report a new symptom in the same message.
Now imagine you're Dr. Kim with a full day of appointments. How would you decide which messages need attention first?
Step 2: Understand the Agent CodeΒΆ
Now let's look at how the agent works. The code lives in lab1/agent/.
Tools: how the agent reads patient dataΒΆ
Open lab1/agent/tools.py. Each function decorated with @tool becomes something the LLM can call:
@tool
def get_patient_record(patient_id: str) -> dict:
"""Get a patient's full record: demographics, conditions, allergies,
medications, lab results, encounter history, messages, and social history."""
resp = requests.get(f"{API_URL}/patients/{patient_id}")
resp.raise_for_status()
return resp.json()
The @tool decorator does three things automatically:
- Registers the function as a callable tool
- Generates a JSON schema from the type hints and docstring
- Makes it available to the LLM during the ReAct loop
The agent has five tools: list_patients, get_patient_record, get_messages, search_labs, and get_inbox.
Think about this
Notice that the agent can call list_patients() or get_patient_record() for any patient β not just the one it was asked to review. What could go wrong?
The agent: LangGraph's ReAct loopΒΆ
Open lab1/agent/agent.py. The core is surprisingly short:
def _build_agent():
llm = get_chat_model()
return create_react_agent(
model=llm,
tools=ALL_TOOLS,
prompt=SYSTEM_PROMPT,
response_format=PatientConcerns,
)
create_react_agent builds the full ReAct loop for us:
- Send the system prompt + user message to the LLM
- If the LLM returns tool calls β execute them, feed results back, go to 1
- If the LLM is done β make one final call with Structured Outputs to produce a
PatientConcernsobject
That last point is key: response_format=PatientConcerns tells the API to use Structured Outputs. Let's unpack what that means.
Structured Outputs and constrained decodingΒΆ
LLMs generate text one token at a time. Normally, every token in the vocabulary is a candidate at each step. Constrained decoding changes this: before each token is sampled, the API masks out every token that would make the output invalid according to your schema. The model literally cannot produce malformed JSON β it's not a "try and retry" approach, it's a hard constraint on generation.
When you pass response_format=PatientConcerns (a Pydantic model), the API converts it to a JSON schema and enforces it during generation. This gives you:
Correctness for free. Every field will be present, every enum value will be valid, every list will be a list. No parsing code, no fallback logic, no stripping markdown code fences from LLM output that decided to be "helpful."
Fewer tokens, lower cost. Without structured outputs, you'd need to describe the exact JSON format you want in the prompt β field names, types, enum values, examples. That's easily 200-400 extra tokens of instructions on every call, and the model might still get it wrong, requiring a retry (which doubles your cost). Constrained decoding moves all of that into the schema, so you don't pay for it in prompt tokens or retry tokens.
Background agents need this. In a chat interface, a human can look at malformed output and ask "try again." A background agent has no human in the loop β if the output doesn't parse, the pipeline fails silently or crashes. Structured Outputs guarantee that every agent run produces a valid PatientConcerns object, even at 3 AM with no one watching.
How is this different from asking nicely?
You might have seen prompts that say "respond in JSON with these fieldsβ¦" β that's prompting for structure. It works most of the time, but the model can always surprise you with a markdown wrapper, a missing field, or a creative reinterpretation of your enum values.
Structured Outputs are fundamentally different: the constraint is enforced at the token level during generation. It's the difference between asking someone to drive the speed limit and installing a speed governor on the engine.
The output contractΒΆ
Open lab1/agent/models.py. The Concern model defines what the agent produces:
class Concern(BaseModel):
id: str
patient_id: str
title: str
summary: str # one sentence
action: str # what the doctor should do
concern_type: ConcernType # medication, lab_result, symptom, follow_up, administrative
urgency: Urgency # routine, soon, urgent
status: ConcernStatus # unresolved, monitoring, resolved
evidence: list[str] # specific values and dates
related: RelatedData # links back to messages, labs, conditions, encounters
This is the contract between the agent and the UI. The agent fills it in; the UI renders it. The doctor never sees raw LLM output.
Step 3: Run the AgentΒΆ
Make sure you've configured your LLM provider in .env (see Prerequisites).
Start the agent API in a second terminal (the UI from Step 1 should still be running):
Now go back to the UI at http://localhost:8501:
- Select a patient from the dropdown
- Click Run Agent in the Concerns panel
- Wait for the agent to finish (the button will show "Agent Runningβ¦")
- Examine the concerns that appear
The agent will call tools to explore the patient's record, then produce structured concerns with urgency levels, evidence, and recommended actions.
Try several patients
Run the agent on 2-3 different patients. Notice how:
- Concerns are sorted by urgency (π΄ urgent β π‘ soon β π΅ routine)
- Each concern has a specific action for the doctor
- The Related links let you jump to the relevant message, lab, or encounter
- The patient dropdown updates with urgency indicators
Step 4: Evaluate the OutputΒΆ
Now the important part. Run the agent on a few patients and critically evaluate what it produces.
Write down 2-3 things
For each patient you review, write down:
- Something the agent got right β a concern that's genuinely useful, well-evidenced, and actionable
- Something the agent got wrong β a concern that's misleading, vague, or incorrect
- Something that's missing β a real issue in the patient's record that the agent didn't surface
Keep these notes. We'll use them throughout the remaining labs.
What's WorkingΒΆ
Let's acknowledge what this naive agent already does well:
The ReAct loop works. The agent autonomously decides which tools to call and in what order. It doesn't follow a hardcoded pipeline β it investigates based on what it finds. LangGraph handles the loop mechanics so we can focus on the prompt and tools.
Structured Outputs eliminate parsing problems. Constrained decoding guarantees valid JSON matching our Pydantic schema. No retries on malformed output, no code-fence stripping, no manual field mapping. This reduces both latency and cost.
The UI keeps the doctor in control. The doctor clicks a button, reviews structured output, and acts on it. There's no chat box where they might ask the agent to draft a reply or confirm a diagnosis. The agent's role is constrained by the interface itself.
Background processing is the right pattern. The agent runs in the background and writes to a store. The doctor sees results when they're ready β they don't have to sit in a conversation and wait for each response.
What's BrokenΒΆ
But this agent has serious problems. Some you probably noticed in your evaluation:
π No access controlsΒΆ
The agent can access any patient's data at any time. When you ask it to review patient-001, nothing stops it from calling get_patient_record("patient-007"). In a real system, this would violate HIPAA's minimum necessary standard β the agent should only see data relevant to its current task.
π² Concerns aren't stableΒΆ
Run the agent twice on the same patient. You'll likely get different concerns β different titles, different urgency levels, maybe different issues entirely. The agent overwrites its previous output on every run. There's no persistence, no diffing, no way to track how concerns change over time.
π€₯ No hallucination checksΒΆ
The agent might report a lab value that doesn't exist, misattribute a symptom, or fabricate evidence. There's nothing in place to verify that the agent's output actually matches the patient record. We're trusting the LLM to be accurate β and it won't always be.
π¨ββοΈ The agent overstepsΒΆ
Read the system prompt carefully. Despite explicit instructions to "not make clinical recommendations," the agent tends to:
- Suggest diagnoses ("possible hypothyroidism")
- Recommend treatments ("consider starting levothyroxine")
- Editorialize on urgency in ways that could bias the doctor
The LLM wants to be helpful. In healthcare, "helpful" can be dangerous.
π No completeness checksΒΆ
How do you know the agent found all the concerns? It might surface 3 out of 5 real issues and you'd never know. There's no mechanism to verify coverage β no comparison against the actual record, no checklist, no second opinion.
Up NextΒΆ
These problems aren't just academic β they're the kind of issues that would stop a real healthcare system from deploying this agent.
In the remaining labs, we'll fix them:
| Lab | Problem | Solution |
|---|---|---|
| Lab 2 | No visibility into what the agent is doing | Observability: tracing, logging, cost tracking |
| Lab 3 | Unstable output, hallucinations, overstepping | Evaluation: output validation, grounding checks, guardrails |
| Lab 4 | Unrestricted data access | Security: scoped tools, access controls, audit trails |
Keep your notes
The issues you wrote down in Step 4 are your personal roadmap for the next three labs. As we add observability, evaluation, and security, check whether each improvement addresses something you noticed.