LLM FHIR Eval
@flexpa/llm-fhir-eval is a benchmark evaluation framework for Large Language Models (LLMs) on FHIR-specific tasks.
Built with FHIR tooling and open source frameworks
Extraction Benchmark
In the extraction benchmark, an LLM is given a pair of inputs consisting of a FHIR resource and a question. The benchmark evaluates its accuracy (pass@1) in correctly extracting the answer to the question from the resource.
Two prompts are used in this benchmark, to evaluate the effectiveness of prompt engineering techniques for this task. In both cases, the model is given a zero-shot instruction with no samples.
Minimalist Prompt
A terse, single-shot instruction. No examples, no schema, and minimal system guidance are provided.
Extract the answer to the question from the FHIR resource.
<fhir-resource>
{{resource}}
</fhir-resource>
<question>
{{question}}
</question>
Specialist Prompt
A domain-expert prompt that supplies detailed system instructions. It explicitly reminds the model to output only the specific answer.
You are a FHIR data extraction specialist.
Given a FHIR resource and a question, extract the requested information.
Return only the specific answer without explanation.
If the question cannot be answered with the information provided, return "N/A".
Do not infer or make assumptions.
When the question is about a specific value, return the value only.
When the value exists literally in the FHIR resource, return the value only.
If a unit is specified, return the value with unit, in the normally expected format.
Do not return extra text or formatting including unnecesary quotes around strings.
Do not append or prepend any newlines.
<fhir-resource>
{{resource}}
</fhir-resource>
<question>
{{question}}
</question>
Results Summary
Complete Results (Pass@1)
Pass@1 accuracy for every evaluated model & prompt combination.
Model | Prompt | Score | Accuracy | Pass | Fail | Latency (s) | Cost ($) |
---|---|---|---|---|---|---|---|
openai-o3-low | Specialist | 72.0/72 | 100.0% | 72 | 0 | 3.1 | 33.4209 |
anthropic-claude-3-5-sonnet-20241022 | Specialist | 71.0/72 | 98.6% | 71 | 1 | 2.7 | 2.8001 |
anthropic-claude-sonnet-4-20250514 | Specialist | 71.0/72 | 98.6% | 71 | 1 | 3.0 | 13.9308 |
openai-o3-high | Specialist | 70.0/72 | 97.2% | 70 | 2 | 5.1 | 33.9498 |
claude-3-5-haiku-20241022 | Specialist | 70.0/72 | 97.2% | 70 | 2 | 3.8 | 0.9340 |
anthropic-claude-opus-4-20250514 | Specialist | 69.0/72 | 95.8% | 69 | 3 | 4.5 | 69.6569 |
google-gemini-2.5-flash-preview-05-20 | Specialist | 69.0/72 | 95.8% | 69 | 3 | 2.9 | 0.6279 |
google-gemini-2.5-pro-preview-05-06 | Specialist | 69.0/72 | 95.8% | 69 | 3 | 12.0 | 10.9536 |
openai-gpt-4.1 | Specialist | 65.0/72 | 90.3% | 65 | 7 | 2.0 | 1.6637 |
openai-o3-high | Minimalist | 65.0/72 | 90.3% | 65 | 7 | 6.5 | 33.5414 |
google-gemini-2.0-flash | Minimalist | 65.0/72 | 90.3% | 65 | 7 | 0.8 | 0.3994 |
google-gemini-2.5-pro-preview-05-06 | Minimalist | 62.0/72 | 86.1% | 62 | 10 | 10.3 | 10.7403 |
google-gemini-2.0-flash | Specialist | 59.0/72 | 81.9% | 59 | 13 | 0.7 | 0.4028 |
openai-o3-low | Minimalist | 58.0/72 | 80.6% | 58 | 14 | 3.6 | 33.0866 |
google-gemini-2.5-flash-preview-05-20 | Minimalist | 56.0/72 | 77.8% | 56 | 16 | 3.7 | 0.6309 |
claude-3-5-haiku-20241022 | Minimalist | 53.0/72 | 73.6% | 53 | 19 | 3.9 | 0.9368 |
openai-gpt-3.5-turbo | Specialist | 53.0/82 | 64.6% | 53 | 19 | 0.5 | 0.0411 |
medgemma-27b-text-it | Specialist | 53.0/82 | 64.6% | 53 | 19 | 1.0 | 0.0000 |
ii-medical-8b | Specialist | 51.0/82 | 62.2% | 51 | 21 | 22.0 | 0.0000 |
medgemma-4b-it | Specialist | 50.0/82 | 61.0% | 50 | 22 | 1.6 | 0.0000 |
medgemma-4b-it | Minimalist | 44.0/82 | 53.7% | 44 | 28 | 1.8 | 0.0000 |
medgemma-27b-text-it | Minimalist | 42.0/82 | 51.2% | 42 | 30 | 3.6 | 0.0000 |
anthropic-claude-3-5-sonnet-20241022 | Minimalist | 27.0/72 | 37.5% | 27 | 45 | 3.3 | 2.8118 |
openai-gpt-4.1 | Minimalist | 26.0/72 | 36.1% | 26 | 46 | 3.7 | 1.7046 |
anthropic-claude-sonnet-4-20250514 | Minimalist | 26.0/72 | 36.1% | 26 | 46 | 4.4 | 13.8970 |
anthropic-claude-opus-4-20250514 | Minimalist | 26.0/72 | 36.1% | 26 | 46 | 6.5 | 69.4046 |
openai-gpt-3.5-turbo | Minimalist | 21.0/82 | 25.6% | 21 | 51 | 0.6 | 0.0388 |
ii-medical-8b | Minimalist | 20.0/82 | 24.4% | 20 | 52 | 23.0 | 0.0000 |
Generation Benchmark
In the generation benchmark, an LLM is given an unstructured clinical note and must generate a valid FHIR resource. The benchmark evaluates the model's ability to structure clinical data into standardized FHIR format.
Two approaches are compared: zero-shot generation versus multi-shot with tool use, demonstrating the impact of access to the FHIR $validate
operation on generation quality.
Zero-Shot Generation
A direct instruction to convert clinical notes to FHIR without examples or structured guidance. Tests the model's inherent understanding of FHIR structure. No examples are provided.
You are a health informaticist expert in FHIR.
You will receive unstructured notes and you need to structure them into FHIR resources.
You must only include data that is present in the note.
You must only return a valid FHIR JSON Bundle, with the appropriate resources, with no additional explanation.
You may include multiple resources in the bundle.
You must follow the FHIR R4 specification.
You mut not include a meta element in the resources.
When generateing a CodeableConcept, you must include a coding element with a system, code, and display.
When generating a CodeableConcept, you must use a display matching what is expected by the CodeSystem.
Each entry in a Bundle must have a fullUrl which is the identity of the resource in the entry.
The id of a resource must be a valid UUID in lowercase.
You must only return JSON with no additional markup or explanation.
<note>
{{note}}
</note>
Multi-Shot + Tool Use
Provides multiple examples and access to a FHIR validation function. Models can call the $validate
operation up to 10 times to iteratively improve their FHIR Bundle before submitting the final result.
Validation Process:
- Model generates initial FHIR Bundle from clinical note
- Calls
validate_fhir_bundle()
function - Receives validation errors and warnings from FHIR server
- Iteratively fixes issues and re-validates (up to 10 attempts)
- Returns final validated Bundle
Results Summary
Complete Results (Pass@1)
Pass@1 accuracy for every evaluated model & generation approach combination.
Model | Prompt | Score | Accuracy | Pass | Fail | Latency (s) | Cost ($) |
---|---|---|---|---|---|---|---|
anthropic-claude-sonnet-4-20250514 | Tool Use | 14.0/14 | 100.0% | 14 | 0 | 32.4 | 1.3101 |
anthropic-claude-opus-4-20250514 | Tool Use | 13.6/14 | 85.7% | 12 | 2 | 41.6 | 7.7170 |
openai-gpt-4.1 | Tool Use | 13.4/14 | 78.6% | 11 | 3 | 26.9 | 0.2678 |
openai-o3-high | Tool Use | 13.4/14 | 78.6% | 11 | 3 | 69.7 | 4.8258 |
anthropic-claude-3-5-sonnet-20241022 | Tool Use | 12.8/14 | 78.6% | 11 | 3 | 39.1 | 0.8450 |
openai-o3-low | Tool Use | 12.6/14 | 57.1% | 8 | 6 | 51.2 | 7.0484 |
anthropic-claude-3-5-haiku-20241022 | Tool Use | 10.0/14 | 50.0% | 7 | 7 | 55.0 | 0.4672 |
openai-o3-high | Zero Shot | 12.2/14 | 50.0% | 7 | 7 | 35.8 | 1.4541 |
anthropic-claude-sonnet-4-20250514 | Zero Shot | 12.4/14 | 50.0% | 7 | 7 | 9.6 | 0.2075 |
google-gemini-2.5-pro-preview-05-06 | Zero Shot | 12.4/14 | 50.0% | 7 | 7 | 35.0 | 0.5771 |
openai-gpt-4.1 | Zero Shot | 12.2/14 | 42.9% | 6 | 8 | 8.8 | 0.0757 |
openai-o3-low | Zero Shot | 11.6/14 | 42.9% | 6 | 8 | 14.6 | 0.6870 |
anthropic-claude-opus-4-20250514 | Zero Shot | 12.4/14 | 42.9% | 6 | 8 | 14.4 | 1.0696 |
anthropic-claude-3-5-sonnet-20241022 | Zero Shot | 12.0/14 | 35.7% | 5 | 9 | 9.0 | 0.1568 |
anthropic-claude-3-5-haiku-20241022 | Zero Shot | 10.2/14 | 28.6% | 4 | 10 | 7.9 | 0.0450 |
google-gemini-2.5-flash-preview-05-20 | Zero Shot | 11.6/14 | 28.6% | 4 | 10 | 11.3 | 0.0242 |
openai-gpt-3.5-turbo | Zero Shot | 9.6/14 | 14.3% | 2 | 12 | 3.8 | 0.0110 |
google-gemini-2.0-flash | Zero Shot | 10.4/14 | 14.3% | 2 | 12 | 4.2 | 0.0056 |
openai-gpt-3.5-turbo | Tool Use | 7.0/18 | 5.6% | 1 | 13 | 32.3 | 0.0547 |
ii-medical-8b | Zero Shot | 2.8/14 | 0.0% | 0 | 14 | 65.6 | 0.0000 |
medgemma-4b-it | Zero Shot | 6.8/14 | 0.0% | 0 | 14 | 22.5 | 0.0000 |
medgemma-27b-text-it | Zero Shot | 9.0/14 | 0.0% | 0 | 14 | 36.2 | 0.0000 |
Contribute to LLM FHIR Eval
Help us expand tasks, improve evaluation quality, and push the boundaries of healthcare LLM research. Every contribution counts!
Get Involved on GitHub