Open Source Benchmarking Framework

LLM FHIR Eval

@flexpa/llm-fhir-eval is a benchmark evaluation framework for Large Language Models (LLMs) on FHIR-specific tasks.

14
Models Evaluated
2
Tasks Evaluated
100%
Best extraction Pass@1
100%
Best generation Pass@1

Built with FHIR tooling and open source frameworks

MedplumHAPI FHIRFHIR ValidatorSynthea

Extraction Benchmark

In the extraction benchmark, an LLM is given a pair of inputs consisting of a FHIR resource and a question. The benchmark evaluates its accuracy (pass@1) in correctly extracting the answer to the question from the resource.

Two prompts are used in this benchmark, to evaluate the effectiveness of prompt engineering techniques for this task. In both cases, the model is given a zero-shot instruction with no samples.

Minimalist Prompt

A terse, single-shot instruction. No examples, no schema, and minimal system guidance are provided.

Extract the answer to the question from the FHIR resource.

<fhir-resource>
{{resource}}
</fhir-resource>

<question>
{{question}}
</question>

Specialist Prompt

A domain-expert prompt that supplies detailed system instructions. It explicitly reminds the model to output only the specific answer.

You are a FHIR data extraction specialist.
Given a FHIR resource and a question, extract the requested information.
Return only the specific answer without explanation.
If the question cannot be answered with the information provided, return "N/A".
Do not infer or make assumptions.
When the question is about a specific value, return the value only.
When the value exists literally in the FHIR resource, return the value only.
If a unit is specified, return the value with unit, in the normally expected format.
Do not return extra text or formatting including unnecesary quotes around strings.
Do not append or prepend any newlines.

<fhir-resource>
{{resource}}
</fhir-resource>

<question>
{{question}}
</question>

Results Summary

Minimalist
Specialist
o3-low+24.1%
Minimalist80.6%
Specialist100.0%
claude-3.5-sonnet+163.0%
Minimalist37.5%
Specialist98.6%
claude-sonnet-4+173.1%
Minimalist36.1%
Specialist98.6%
o3-high+7.7%
Minimalist90.3%
Specialist97.2%
claude-3.5-haiku+32.1%
Minimalist73.6%
Specialist97.2%
claude-opus-4+165.4%
Minimalist36.1%
Specialist95.8%
gemini-2.5-flash+23.2%
Minimalist77.8%
Specialist95.8%
gemini-2.5-pro+11.3%
Minimalist86.1%
Specialist95.8%
gpt-4.1+150.0%
Minimalist36.1%
Specialist90.3%
gemini-2.0-flash-9.2%
Minimalist90.3%
Specialist81.9%
gpt-3.5-turbo+152.4%
Minimalist25.6%
Specialist64.6%
medgemma-27b-text-it+26.2%
Minimalist51.2%
Specialist64.6%
ii-medical-8b+155.0%
Minimalist24.4%
Specialist62.2%
medgemma-4b-it+13.6%
Minimalist53.7%
Specialist61.0%

Complete Results (Pass@1)

Pass@1 accuracy for every evaluated model & prompt combination.

ModelPromptScoreAccuracyPassFailLatency (s)Cost ($)
openai-o3-lowSpecialist72.0/72100.0%7203.133.4209
anthropic-claude-3-5-sonnet-20241022Specialist71.0/7298.6%7112.72.8001
anthropic-claude-sonnet-4-20250514Specialist71.0/7298.6%7113.013.9308
openai-o3-highSpecialist70.0/7297.2%7025.133.9498
claude-3-5-haiku-20241022Specialist70.0/7297.2%7023.80.9340
anthropic-claude-opus-4-20250514Specialist69.0/7295.8%6934.569.6569
google-gemini-2.5-flash-preview-05-20Specialist69.0/7295.8%6932.90.6279
google-gemini-2.5-pro-preview-05-06Specialist69.0/7295.8%69312.010.9536
openai-gpt-4.1Specialist65.0/7290.3%6572.01.6637
openai-o3-highMinimalist65.0/7290.3%6576.533.5414
google-gemini-2.0-flashMinimalist65.0/7290.3%6570.80.3994
google-gemini-2.5-pro-preview-05-06Minimalist62.0/7286.1%621010.310.7403
google-gemini-2.0-flashSpecialist59.0/7281.9%59130.70.4028
openai-o3-lowMinimalist58.0/7280.6%58143.633.0866
google-gemini-2.5-flash-preview-05-20Minimalist56.0/7277.8%56163.70.6309
claude-3-5-haiku-20241022Minimalist53.0/7273.6%53193.90.9368
openai-gpt-3.5-turboSpecialist53.0/8264.6%53190.50.0411
medgemma-27b-text-itSpecialist53.0/8264.6%53191.00.0000
ii-medical-8bSpecialist51.0/8262.2%512122.00.0000
medgemma-4b-itSpecialist50.0/8261.0%50221.60.0000
medgemma-4b-itMinimalist44.0/8253.7%44281.80.0000
medgemma-27b-text-itMinimalist42.0/8251.2%42303.60.0000
anthropic-claude-3-5-sonnet-20241022Minimalist27.0/7237.5%27453.32.8118
openai-gpt-4.1Minimalist26.0/7236.1%26463.71.7046
anthropic-claude-sonnet-4-20250514Minimalist26.0/7236.1%26464.413.8970
anthropic-claude-opus-4-20250514Minimalist26.0/7236.1%26466.569.4046
openai-gpt-3.5-turboMinimalist21.0/8225.6%21510.60.0388
ii-medical-8bMinimalist20.0/8224.4%205223.00.0000

Generation Benchmark

In the generation benchmark, an LLM is given an unstructured clinical note and must generate a valid FHIR resource. The benchmark evaluates the model's ability to structure clinical data into standardized FHIR format.

Two approaches are compared: zero-shot generation versus multi-shot with tool use, demonstrating the impact of access to the FHIR $validate operation on generation quality.

Zero-Shot Generation

A direct instruction to convert clinical notes to FHIR without examples or structured guidance. Tests the model's inherent understanding of FHIR structure. No examples are provided.

You are a health informaticist expert in FHIR. 
You will receive unstructured notes and you need to structure them into FHIR resources.
You must only include data that is present in the note.
You must only return a valid FHIR JSON Bundle, with the appropriate resources, with no additional explanation.
You may include multiple resources in the bundle.
You must follow the FHIR R4 specification.
You mut not include a meta element in the resources.
When generateing a CodeableConcept, you must include a coding element with a system, code, and display.
When generating a CodeableConcept, you must use a display matching what is expected by the CodeSystem.
Each entry in a Bundle must have a fullUrl which is the identity of the resource in the entry.
The id of a resource must be a valid UUID in lowercase.

You must only return JSON with no additional markup or explanation.

<note>
{{note}}
</note>

Multi-Shot + Tool Use

Provides multiple examples and access to a FHIR validation function. Models can call the $validate operation up to 10 times to iteratively improve their FHIR Bundle before submitting the final result.

Validation Process:

  1. Model generates initial FHIR Bundle from clinical note
  2. Calls validate_fhir_bundle() function
  3. Receives validation errors and warnings from FHIR server
  4. Iteratively fixes issues and re-validates (up to 10 attempts)
  5. Returns final validated Bundle

Results Summary

Zero-Shot
Multi-Shot + Tool Use
claude-sonnet-4+100.0%
Zero-Shot50.0%
Multi-Shot + Tool Use100.0%
claude-opus-4+100.0%
Zero-Shot42.9%
Multi-Shot + Tool Use85.7%
gpt-4.1+83.3%
Zero-Shot42.9%
Multi-Shot + Tool Use78.6%
o3-high+57.1%
Zero-Shot50.0%
Multi-Shot + Tool Use78.6%
claude-3.5-sonnet+120.0%
Zero-Shot35.7%
Multi-Shot + Tool Use78.6%
o3-low+33.3%
Zero-Shot42.9%
Multi-Shot + Tool Use57.1%
claude-3.5-haiku+75.0%
Zero-Shot28.6%
Multi-Shot + Tool Use50.0%
gpt-3.5-turbo-61.1%
Zero-Shot14.3%
Multi-Shot + Tool Use5.6%
gemini-2.0-flash
Zero-Shot14.3%
Multi-Shot + Tool Use-
gemini-2.5-flash
Zero-Shot28.6%
Multi-Shot + Tool Use-
gemini-2.5-pro
Zero-Shot50.0%
Multi-Shot + Tool Use-
ii-medical-8b
Zero-Shot0.0%
Multi-Shot + Tool Use-
medgemma-4b-it
Zero-Shot0.0%
Multi-Shot + Tool Use-
medgemma-27b-text-it
Zero-Shot0.0%
Multi-Shot + Tool Use-

Complete Results (Pass@1)

Pass@1 accuracy for every evaluated model & generation approach combination.

ModelPromptScoreAccuracyPassFailLatency (s)Cost ($)
anthropic-claude-sonnet-4-20250514Tool Use14.0/14100.0%14032.41.3101
anthropic-claude-opus-4-20250514Tool Use13.6/1485.7%12241.67.7170
openai-gpt-4.1Tool Use13.4/1478.6%11326.90.2678
openai-o3-highTool Use13.4/1478.6%11369.74.8258
anthropic-claude-3-5-sonnet-20241022Tool Use12.8/1478.6%11339.10.8450
openai-o3-lowTool Use12.6/1457.1%8651.27.0484
anthropic-claude-3-5-haiku-20241022Tool Use10.0/1450.0%7755.00.4672
openai-o3-highZero Shot12.2/1450.0%7735.81.4541
anthropic-claude-sonnet-4-20250514Zero Shot12.4/1450.0%779.60.2075
google-gemini-2.5-pro-preview-05-06Zero Shot12.4/1450.0%7735.00.5771
openai-gpt-4.1Zero Shot12.2/1442.9%688.80.0757
openai-o3-lowZero Shot11.6/1442.9%6814.60.6870
anthropic-claude-opus-4-20250514Zero Shot12.4/1442.9%6814.41.0696
anthropic-claude-3-5-sonnet-20241022Zero Shot12.0/1435.7%599.00.1568
anthropic-claude-3-5-haiku-20241022Zero Shot10.2/1428.6%4107.90.0450
google-gemini-2.5-flash-preview-05-20Zero Shot11.6/1428.6%41011.30.0242
openai-gpt-3.5-turboZero Shot9.6/1414.3%2123.80.0110
google-gemini-2.0-flashZero Shot10.4/1414.3%2124.20.0056
openai-gpt-3.5-turboTool Use7.0/185.6%11332.30.0547
ii-medical-8bZero Shot2.8/140.0%01465.60.0000
medgemma-4b-itZero Shot6.8/140.0%01422.50.0000
medgemma-27b-text-itZero Shot9.0/140.0%01436.20.0000

Contribute to LLM FHIR Eval

Help us expand tasks, improve evaluation quality, and push the boundaries of healthcare LLM research. Every contribution counts!

Get Involved on GitHub