🎖️ Flexpa Selected as CMS Aligned Network Early Adopter. Click here to learn more →

Open Source Benchmarking Framework

LLM FHIR Eval

@flexpa/llm-fhir-eval is a benchmark evaluation framework for Large Language Models (LLMs) on FHIR-specific tasks.

Models Evaluated

Tasks Evaluated

100%

Best extraction Pass@1

100%

Best generation Pass@1

Built with FHIR tooling and open source frameworks

MedplumHAPI FHIRFHIR ValidatorSynthea

Extraction Benchmark

In the extraction benchmark, an LLM is given a pair of inputs consisting of a FHIR resource and a question. The benchmark evaluates its accuracy (pass@1) in correctly extracting the answer to the question from the resource.

Two prompts are used in this benchmark, to evaluate the effectiveness of prompt engineering techniques for this task. In both cases, the model is given a zero-shot instruction with no samples.

Minimalist Prompt

A terse, single-shot instruction. No examples, no schema, and minimal system guidance are provided.

Extract the answer to the question from the FHIR resource.

<fhir-resource>
{{resource}}
</fhir-resource>

<question>
{{question}}
</question>

Specialist Prompt

A domain-expert prompt that supplies detailed system instructions. It explicitly reminds the model to output only the specific answer.

You are a FHIR data extraction specialist.
Given a FHIR resource and a question, extract the requested information.
Return only the specific answer without explanation.
If the question cannot be answered with the information provided, return "N/A".
Do not infer or make assumptions.
When the question is about a specific value, return the value only.
When the value exists literally in the FHIR resource, return the value only.
If a unit is specified, return the value with unit, in the normally expected format.
Do not return extra text or formatting including unnecesary quotes around strings.
Do not append or prepend any newlines.

<fhir-resource>
{{resource}}
</fhir-resource>

<question>
{{question}}
</question>

Results Summary

Minimalist

Specialist

Improvement shown on right

Minimalist

Specialist

o3-low

80.6%

100.0%

+24.1%

o3-low+24.1%

Minimalist80.6%

Specialist100.0%

claude-3.5-sonnet

37.5%

98.6%

+163.0%

claude-3.5-sonnet+163.0%

Minimalist37.5%

Specialist98.6%

claude-sonnet-4

36.1%

98.6%

+173.1%

claude-sonnet-4+173.1%

Minimalist36.1%

Specialist98.6%

o3-high

90.3%

97.2%

+7.7%

o3-high+7.7%

Minimalist90.3%

Specialist97.2%

claude-3.5-haiku

73.6%

97.2%

+32.1%

claude-3.5-haiku+32.1%

Minimalist73.6%

Specialist97.2%

claude-opus-4

36.1%

95.8%

+165.4%

claude-opus-4+165.4%

Minimalist36.1%

Specialist95.8%

gemini-2.5-flash

77.8%

95.8%

+23.2%

gemini-2.5-flash+23.2%

Minimalist77.8%

Specialist95.8%

gemini-2.5-pro

86.1%

95.8%

+11.3%

gemini-2.5-pro+11.3%

Minimalist86.1%

Specialist95.8%

gpt-4.1

36.1%

90.3%

+150.0%

gpt-4.1+150.0%

Minimalist36.1%

Specialist90.3%

gemini-2.0-flash

90.3%

81.9%

-9.2%

gemini-2.0-flash-9.2%

Minimalist90.3%

Specialist81.9%

gpt-3.5-turbo

25.6%

64.6%

+152.4%

gpt-3.5-turbo+152.4%

Minimalist25.6%

Specialist64.6%

medgemma-27b-text-it

51.2%

64.6%

+26.2%

medgemma-27b-text-it+26.2%

Minimalist51.2%

Specialist64.6%

ii-medical-8b

24.4%

62.2%

+155.0%

ii-medical-8b+155.0%

Minimalist24.4%

Specialist62.2%

medgemma-4b-it

53.7%

61.0%

+13.6%

medgemma-4b-it+13.6%

Minimalist53.7%

Specialist61.0%

Complete Results (Pass@1)

Pass@1 accuracy for every evaluated model & prompt combination.

Model	Prompt	Score	Accuracy	Pass	Fail	Latency (s)	Cost ($)
openai-o3-low	Specialist	72.0/72	100.0%	72	0	3.1	33.4209
anthropic-claude-3-5-sonnet-20241022	Specialist	71.0/72	98.6%	71	1	2.7	2.8001
anthropic-claude-sonnet-4-20250514	Specialist	71.0/72	98.6%	71	1	3.0	13.9308
openai-o3-high	Specialist	70.0/72	97.2%	70	2	5.1	33.9498
claude-3-5-haiku-20241022	Specialist	70.0/72	97.2%	70	2	3.8	0.9340
anthropic-claude-opus-4-20250514	Specialist	69.0/72	95.8%	69	3	4.5	69.6569
google-gemini-2.5-flash-preview-05-20	Specialist	69.0/72	95.8%	69	3	2.9	0.6279
google-gemini-2.5-pro-preview-05-06	Specialist	69.0/72	95.8%	69	3	12.0	10.9536
openai-gpt-4.1	Specialist	65.0/72	90.3%	65	7	2.0	1.6637
openai-o3-high	Minimalist	65.0/72	90.3%	65	7	6.5	33.5414
google-gemini-2.0-flash	Minimalist	65.0/72	90.3%	65	7	0.8	0.3994
google-gemini-2.5-pro-preview-05-06	Minimalist	62.0/72	86.1%	62	10	10.3	10.7403
google-gemini-2.0-flash	Specialist	59.0/72	81.9%	59	13	0.7	0.4028
openai-o3-low	Minimalist	58.0/72	80.6%	58	14	3.6	33.0866
google-gemini-2.5-flash-preview-05-20	Minimalist	56.0/72	77.8%	56	16	3.7	0.6309
claude-3-5-haiku-20241022	Minimalist	53.0/72	73.6%	53	19	3.9	0.9368
openai-gpt-3.5-turbo	Specialist	53.0/82	64.6%	53	19	0.5	0.0411
medgemma-27b-text-it	Specialist	53.0/82	64.6%	53	19	1.0	0.0000
ii-medical-8b	Specialist	51.0/82	62.2%	51	21	22.0	0.0000
medgemma-4b-it	Specialist	50.0/82	61.0%	50	22	1.6	0.0000
medgemma-4b-it	Minimalist	44.0/82	53.7%	44	28	1.8	0.0000
medgemma-27b-text-it	Minimalist	42.0/82	51.2%	42	30	3.6	0.0000
anthropic-claude-3-5-sonnet-20241022	Minimalist	27.0/72	37.5%	27	45	3.3	2.8118
openai-gpt-4.1	Minimalist	26.0/72	36.1%	26	46	3.7	1.7046
anthropic-claude-sonnet-4-20250514	Minimalist	26.0/72	36.1%	26	46	4.4	13.8970
anthropic-claude-opus-4-20250514	Minimalist	26.0/72	36.1%	26	46	6.5	69.4046
openai-gpt-3.5-turbo	Minimalist	21.0/82	25.6%	21	51	0.6	0.0388
ii-medical-8b	Minimalist	20.0/82	24.4%	20	52	23.0	0.0000

Generation Benchmark

In the generation benchmark, an LLM is given an unstructured clinical note and must generate a valid FHIR resource. The benchmark evaluates the model's ability to structure clinical data into standardized FHIR format.

Two approaches are compared: zero-shot generation versus multi-shot with tool use, demonstrating the impact of access to the FHIR $validate operation on generation quality.

Zero-Shot Generation

A direct instruction to convert clinical notes to FHIR without examples or structured guidance. Tests the model's inherent understanding of FHIR structure. No examples are provided.

You are a health informaticist expert in FHIR. 
You will receive unstructured notes and you need to structure them into FHIR resources.
You must only include data that is present in the note.
You must only return a valid FHIR JSON Bundle, with the appropriate resources, with no additional explanation.
You may include multiple resources in the bundle.
You must follow the FHIR R4 specification.
You mut not include a meta element in the resources.
When generateing a CodeableConcept, you must include a coding element with a system, code, and display.
When generating a CodeableConcept, you must use a display matching what is expected by the CodeSystem.
Each entry in a Bundle must have a fullUrl which is the identity of the resource in the entry.
The id of a resource must be a valid UUID in lowercase.

You must only return JSON with no additional markup or explanation.

<note>
{{note}}
</note>

Multi-Shot + Tool Use

Provides multiple examples and access to a FHIR validation function. Models can call the $validate operation up to 10 times to iteratively improve their FHIR Bundle before submitting the final result.

Validation Process:

Model generates initial FHIR Bundle from clinical note
Calls validate_fhir_bundle() function
Receives validation errors and warnings from FHIR server
Iteratively fixes issues and re-validates (up to 10 attempts)
Returns final validated Bundle

Results Summary

Zero-Shot

Multi-Shot + Tool Use

Improvement shown on right

Zero-Shot

Multi-Shot + Tool Use

claude-sonnet-4

50.0%

100.0%

+100.0%

claude-sonnet-4+100.0%

Zero-Shot50.0%

Multi-Shot + Tool Use100.0%

claude-opus-4

42.9%

85.7%

+100.0%

claude-opus-4+100.0%

Zero-Shot42.9%

Multi-Shot + Tool Use85.7%

gpt-4.1

42.9%

78.6%

+83.3%

gpt-4.1+83.3%

Zero-Shot42.9%

Multi-Shot + Tool Use78.6%

o3-high

50.0%

78.6%

+57.1%

o3-high+57.1%

Zero-Shot50.0%

Multi-Shot + Tool Use78.6%

claude-3.5-sonnet

35.7%

78.6%

+120.0%

claude-3.5-sonnet+120.0%

Zero-Shot35.7%

Multi-Shot + Tool Use78.6%

o3-low

42.9%

57.1%

+33.3%

o3-low+33.3%

Zero-Shot42.9%

Multi-Shot + Tool Use57.1%

claude-3.5-haiku

28.6%

50.0%

+75.0%

claude-3.5-haiku+75.0%

Zero-Shot28.6%

Multi-Shot + Tool Use50.0%

gpt-3.5-turbo

14.3%

5.6%

-61.1%

gpt-3.5-turbo-61.1%

Zero-Shot14.3%

Multi-Shot + Tool Use5.6%

gemini-2.0-flash

14.3%

gemini-2.0-flash

Zero-Shot14.3%

Multi-Shot + Tool Use-

gemini-2.5-flash

28.6%

gemini-2.5-flash

Zero-Shot28.6%

Multi-Shot + Tool Use-

gemini-2.5-pro

50.0%

gemini-2.5-pro

Zero-Shot50.0%

Multi-Shot + Tool Use-

ii-medical-8b

0.0%

ii-medical-8b

Zero-Shot0.0%

Multi-Shot + Tool Use-

medgemma-4b-it

0.0%

medgemma-4b-it

Zero-Shot0.0%

Multi-Shot + Tool Use-

medgemma-27b-text-it

0.0%

medgemma-27b-text-it

Zero-Shot0.0%

Multi-Shot + Tool Use-

Complete Results (Pass@1)

Pass@1 accuracy for every evaluated model & generation approach combination.

Model	Prompt	Score	Accuracy	Pass	Fail	Latency (s)	Cost ($)
anthropic-claude-sonnet-4-20250514	Tool Use	14.0/14	100.0%	14	0	32.4	1.3101
anthropic-claude-opus-4-20250514	Tool Use	13.6/14	85.7%	12	2	41.6	7.7170
openai-gpt-4.1	Tool Use	13.4/14	78.6%	11	3	26.9	0.2678
openai-o3-high	Tool Use	13.4/14	78.6%	11	3	69.7	4.8258
anthropic-claude-3-5-sonnet-20241022	Tool Use	12.8/14	78.6%	11	3	39.1	0.8450
openai-o3-low	Tool Use	12.6/14	57.1%	8	6	51.2	7.0484
anthropic-claude-3-5-haiku-20241022	Tool Use	10.0/14	50.0%	7	7	55.0	0.4672
openai-o3-high	Zero Shot	12.2/14	50.0%	7	7	35.8	1.4541
anthropic-claude-sonnet-4-20250514	Zero Shot	12.4/14	50.0%	7	7	9.6	0.2075
google-gemini-2.5-pro-preview-05-06	Zero Shot	12.4/14	50.0%	7	7	35.0	0.5771
openai-gpt-4.1	Zero Shot	12.2/14	42.9%	6	8	8.8	0.0757
openai-o3-low	Zero Shot	11.6/14	42.9%	6	8	14.6	0.6870
anthropic-claude-opus-4-20250514	Zero Shot	12.4/14	42.9%	6	8	14.4	1.0696
anthropic-claude-3-5-sonnet-20241022	Zero Shot	12.0/14	35.7%	5	9	9.0	0.1568
anthropic-claude-3-5-haiku-20241022	Zero Shot	10.2/14	28.6%	4	10	7.9	0.0450
google-gemini-2.5-flash-preview-05-20	Zero Shot	11.6/14	28.6%	4	10	11.3	0.0242
openai-gpt-3.5-turbo	Zero Shot	9.6/14	14.3%	2	12	3.8	0.0110
google-gemini-2.0-flash	Zero Shot	10.4/14	14.3%	2	12	4.2	0.0056
openai-gpt-3.5-turbo	Tool Use	7.0/18	5.6%	1	13	32.3	0.0547
ii-medical-8b	Zero Shot	2.8/14	0.0%	0	14	65.6	0.0000
medgemma-4b-it	Zero Shot	6.8/14	0.0%	0	14	22.5	0.0000
medgemma-27b-text-it	Zero Shot	9.0/14	0.0%	0	14	36.2	0.0000

Contribute to LLM FHIR Eval

Help us expand tasks, improve evaluation quality, and push the boundaries of healthcare LLM research. Every contribution counts!

Get Involved on GitHub