Blog/Platform Updates

Benchmarking LLMs on FHIR

Flexpa announces the industry's first comprehensive open-source benchmark for evaluating Large Language Models on FHIR-specific healthcare tasks.

May 30, 2025•Team Flexpa

SAN FRANCISCO, CA – June 2, 2025 – Flexpa, the leading healthcare data platform, today announced the full release of LLM FHIR Eval, the industry's first comprehensive open-source benchmark for evaluating Large Language Models (LLMs) on FHIR-specific healthcare tasks.

Following a successful preview release in November 2024, the framework now provides standardized evaluation metrics across 14 leading AI models, including OpenAI's o3, Anthropic's Claude 4 Opus, and Google's Gemini Pro 2.5, establishing new industry standards for assessing LLM capabilities with healthcare interoperability tasks.

Addressing Critical Gap in Interoperability AI Evaluation

While general AI benchmarks exist, only Flexpa's LLM FHIR Eval provides a clear evidence of how tool-assistance / function calling can enhance model capabilities with FHIR, the global standard for healthcare data exchange.

"Healthcare organizations need objective, standardized ways to evaluate AI models before integrating them into interoperability workflows," said Joshua Kelly, CTO and co-founder of Flexpa. "Our benchmark provides the healthcare industry with the rigorous evaluation framework it needs to make informed decisions about AI adoption."

Comprehensive Benchmark Results Available

The complete evaluation results, featuring detailed performance metrics across two core benchmark categories, are now available at flexpa.com/eval. The interactive dashboard showcases:

Extraction Benchmark: Testing accuracy in answering questions about FHIR resources, with top models achieving saturating the benchmark using specialist prompting techniques
Generation Benchmark: Evaluating models' ability to convert clinical notes into valid FHIR JSON, where tool-assisted approaches showed dramatic improvements over zero-shot generation

Key findings reveal that prompt engineering and iterative validation significantly impact model performance on healthcare-specific tasks, with some smaller, specialized models outperforming larger general-purpose ones.

Industry Impact and Next Steps

The release comes as healthcare organizations increasingly seek to integrate AI into interoperability workflows while maintaining safety and compliance standards. The benchmark provides a standardized approach to model evaluation that healthcare organizations, AI developers, and researchers can use to validate AI systems before deployment.

Flexpa CTO Joshua Kelly will present the benchmark findings and methodology this week at FHIR DevDays 2025, the premier FHIR developer conference, highlighting the framework's potential to accelerate responsible AI adoption in healthcare.

Flexpa is actively collaborating with healthcare informaticists, AI researchers, and FHIR developers to expand the benchmark suite with additional tasks and evaluation methodologies.

About the Evaluation Results

Complete benchmark results, interactive performance comparisons, and detailed methodology are available at flexpa.com/eval. The evaluation framework and source code are available on GitHub under an open-source license.

About Flexpa

Flexpa is the leader in patient-consented claims data from every health plan. Founded in 2021, Flexpa builds a single, secure integration to connect to 300+ health plans, giving you instant access to identified claims data.

In this blog

Addressing Critical Gap in Interoperability AI Evaluation

Comprehensive Benchmark Results Available

Industry Impact and Next Steps

About the Evaluation Results

About Flexpa

More platform updates

View All

How We Used SQL on FHIR to Shrink LLM Context by 92%

How we leveraged SQL on FHIR to dramatically reduce token usage in Flexpal, our smart health agent, while maintaining rich clinical context and simplifying our tooling architecture.

November 3, 2025•Larry Ditton