Eyal Klang’s Post

Director of the Generative AI Research Program, Division of Data-Driven and Digital Medicine (D3M) at Mount Sinai

6mo

I am thrilled to share our recent publication in NEJM AI, which explores the use of large language models (LLMs) like GPT-3.5, GPT-4, Gemini Pro, and Llama2-70b in medical coding. This collaborative study benchmarks LLMs performance in generating accurate medical billing codes and highlights both the potential and current limitations of AI in healthcare. While GPT-4 showed the most promising results, it is clear that further model fine-tuning, the use of advanced techniques like Retrieval-Augmented Generation, and the development of robust regulatory frameworks are necessary to safely integrate AI technologies into healthcare administrative pipelines. I invite you to read our full study and join the conversation: How can we further refine AI applications in healthcare to ensure better patient outcomes and operational efficiency? Robbie Freeman Ali Soroush, MD, MS Ben Glicksberg Alexander Charney Eyal Zimlichman, MD Yiftach Barash Girish Nadkarni 🔗 Link to the full study https://lnkd.in/e2VWsyun

Girish Nadkarni

Chief, Division of Data Driven and Digital Medicine (D3M) and Director, Charles Bronfman Institute of Personalized Medicine at the Mount Sinai Health System | AI | Healthcare | Data Science | Digital Health

6mo

Utilizing #genai for medical coding is considered low-hanging fruit. However, it is crucial to assess the capabilities and limitations of LLMs) like GPT-3.5, GPT-4, Gemini Pro, and Llama2-70b in medical coding tasks. We performed a comprehensive benchmarking analysis of 'out of the box' LLMs for performing medical coding. Methods 📜 We extracted 12 months of unique ICD and CPT codes from a large health system. We provided LLMs with a code description and a prompt to generate a billing code. We then calculated similarity metrics with the code. 🔍 Main Findings: Performance: GPT-4 outperformed other models with the highest exact match rates across ICD-9-CM, ICD-10-CM, and CPT codes. However, even the best results were under 50%, highlighting a significant accuracy gap. Error Analysis: LLMs frequently generated codes that were either imprecise or completely fabricated, raising concerns about their current utility in clinical settings. Factors Influencing Performance: Shorter codes and descriptions with higher frequency in electronic health records generally correlate with better performance. 🚀 Future Directions: To harness AI's full potential in healthcare efficiently, further research must focus on: Model Training and Fine-tuning: Tailoring LLMs to understand better and generate medical codes through advanced training methods. Hybrid AI-Coder Systems: Developing systems that combine AI's computational power with human expertise to enhance accuracy and reliability. Regulatory Frameworks: Establishing robust guidelines to ensure the safe integration of AI technologies in medical documentation processes. Addressing these challenges, we can pave the way for more reliable and efficient medical coding solutions, ultimately improving patient care and operational efficiency. 🔗 Link to the full study https://lnkd.in/e2VWsyun Let's discuss how we can turn these insights into actionable solutions. #HealthTech #ArtificialIntelligence #MedicalCoding #DigitalHealth Eyal Klang Robbie Freeman Ali Soroush, MD, MS Ben Glicksberg Alexander Charney

Large Language Models Are Poor Medical Coders — Benchmarking of Medical Code Querying

ai.nejm.org

12 Comments

Amar Naik

Engineering Leader, Digital Transformation, Product Engineering and Management, Artificial Intelligence , Gen AI , Web3 , Research & Development

6mo

Girish Nadkarni Eyal Klang This is a valuable contribution to the discussion on the potential and limitations of LLMs in medical coding. Tailoring models for medical language and integrating human expertise are essential steps keenly interested in the following areas. Any plans/roadmaps on the below? -to explore how LLMs arrive at their code suggestions. Understanding their reasoning could improve trust and identify potential biases. -LLM performance in real-time coding scenarios would be insightful. -Research on seamless integration of LLMs with existing Electronic Health Records (EHR) systems is crucial for practical implementation.

Robbie Freeman

VP, Digital Experience & CNIO @ Mount Sinai | Digital | AI | Innovation

6mo

So proud of the work we’re doing together and thankful to be part of this amazing group of colleagues leading the way on healthcare AI! Eyal Klang Girish Nadkarni

4 Reactions

Sahar Hashmi MD-PhD

Global Award Winning Top AI Business Expert, Leader in Digital Transformation for AI Systems, #Gen-AI Enterprise, #Gen-AI in Healthcare, #Gen-AI in EHR/EMR, #Digital Data & Design, #Advisor to VCs/ Startups, #AIEthics

5mo

It sounds like an insightful study, but has anyone considered why these LLMs, like ChatGPT and others, were poor CPT coders? It's because these models were never trained on ICD and CPT codes. The datasets these models are trained on consist of public data, and these ICD and CPT codes are not part of publicly available data. If these models are not trained on, for example, cat images, how can you expect them to identify cat images with sophisticated prompt engineering? In my opinion, you are probably testing the 'guessing' power of these LLM models.

1 Reaction

Christopher Foster-McBride

The ‘AI Risk guy’, Co-Founder @Digital Human Assistants | Founder @AI for the Soul | Co-Founder @tokes compare | Founder @Medical Coding and Documentation GPT, also healthcare and public services

4mo

Eyal Klang I think for more complex but speacialised tasks like medical coding, agentic workflows are the way forward, where teams of agents iterate and can refine outputs. Each agent in the flow can also use a fine tuned model. I also think the way you construct the prompt template/flow is important.

Moshe Golan

President at Kidod Science & Technologies Ltd.

6mo

Healthcare system first goal shall be finding the solution for detection the disease of a patient which is suffering from, within 30 days and not after 12 month and visiting 19 experts.

Nate Darcy

Accelerating Life Science ䷷

6mo

I believe a breakthrough is coming in ~4 months.

Sharon Paz

Digital Health | Entrepreneur | Innovation 💡

6mo

Eyal, you are awesome 👌

1 Reaction

Dalia Landes

Invest & Trade Officer bringthemhome 🎗

6mo

So cool.

1 Reaction

See more comments

To view or add a comment, sign in

More Relevant Posts

Girish Nadkarni

Chief, Division of Data Driven and Digital Medicine (D3M) and Director, Charles Bronfman Institute of Personalized Medicine at the Mount Sinai Health System | AI | Healthcare | Data Science | Digital Health
6mo
Report this post
Utilizing #genai for medical coding is considered low-hanging fruit. However, it is crucial to assess the capabilities and limitations of LLMs) like GPT-3.5, GPT-4, Gemini Pro, and Llama2-70b in medical coding tasks. We performed a comprehensive benchmarking analysis of 'out of the box' LLMs for performing medical coding. Methods 📜 We extracted 12 months of unique ICD and CPT codes from a large health system. We provided LLMs with a code description and a prompt to generate a billing code. We then calculated similarity metrics with the code. 🔍 Main Findings: Performance: GPT-4 outperformed other models with the highest exact match rates across ICD-9-CM, ICD-10-CM, and CPT codes. However, even the best results were under 50%, highlighting a significant accuracy gap. Error Analysis: LLMs frequently generated codes that were either imprecise or completely fabricated, raising concerns about their current utility in clinical settings. Factors Influencing Performance: Shorter codes and descriptions with higher frequency in electronic health records generally correlate with better performance. 🚀 Future Directions: To harness AI's full potential in healthcare efficiently, further research must focus on: Model Training and Fine-tuning: Tailoring LLMs to understand better and generate medical codes through advanced training methods. Hybrid AI-Coder Systems: Developing systems that combine AI's computational power with human expertise to enhance accuracy and reliability. Regulatory Frameworks: Establishing robust guidelines to ensure the safe integration of AI technologies in medical documentation processes. Addressing these challenges, we can pave the way for more reliable and efficient medical coding solutions, ultimately improving patient care and operational efficiency. 🔗 Link to the full study https://lnkd.in/e2VWsyun Let's discuss how we can turn these insights into actionable solutions. #HealthTech #ArtificialIntelligence #MedicalCoding #DigitalHealth Eyal Klang Robbie Freeman Ali Soroush, MD, MS Ben Glicksberg Alexander Charney

Large Language Models Are Poor Medical Coders — Benchmarking of Medical Code Querying

ai.nejm.org

4 Comments
Like Comment
To view or add a comment, sign in
Moshe Golan

President at Kidod Science & Technologies Ltd.
6mo
Report this post
Healthcare system first goal shall be finding the solution for detection the disease of a patient which is suffering from, within 30 days and not after 12 month and visiting 19 experts.

Girish Nadkarni

Chief, Division of Data Driven and Digital Medicine (D3M) and Director, Charles Bronfman Institute of Personalized Medicine at the Mount Sinai Health System | AI | Healthcare | Data Science | Digital Health
6mo

Utilizing #genai for medical coding is considered low-hanging fruit. However, it is crucial to assess the capabilities and limitations of LLMs) like GPT-3.5, GPT-4, Gemini Pro, and Llama2-70b in medical coding tasks. We performed a comprehensive benchmarking analysis of 'out of the box' LLMs for performing medical coding. Methods 📜 We extracted 12 months of unique ICD and CPT codes from a large health system. We provided LLMs with a code description and a prompt to generate a billing code. We then calculated similarity metrics with the code. 🔍 Main Findings: Performance: GPT-4 outperformed other models with the highest exact match rates across ICD-9-CM, ICD-10-CM, and CPT codes. However, even the best results were under 50%, highlighting a significant accuracy gap. Error Analysis: LLMs frequently generated codes that were either imprecise or completely fabricated, raising concerns about their current utility in clinical settings. Factors Influencing Performance: Shorter codes and descriptions with higher frequency in electronic health records generally correlate with better performance. 🚀 Future Directions: To harness AI's full potential in healthcare efficiently, further research must focus on: Model Training and Fine-tuning: Tailoring LLMs to understand better and generate medical codes through advanced training methods. Hybrid AI-Coder Systems: Developing systems that combine AI's computational power with human expertise to enhance accuracy and reliability. Regulatory Frameworks: Establishing robust guidelines to ensure the safe integration of AI technologies in medical documentation processes. Addressing these challenges, we can pave the way for more reliable and efficient medical coding solutions, ultimately improving patient care and operational efficiency. 🔗 Link to the full study https://lnkd.in/e2VWsyun Let's discuss how we can turn these insights into actionable solutions. #HealthTech #ArtificialIntelligence #MedicalCoding #DigitalHealth Eyal Klang Robbie Freeman Ali Soroush, MD, MS Ben Glicksberg Alexander Charney

Large Language Models Are Poor Medical Coders — Benchmarking of Medical Code Querying

ai.nejm.org
Like Comment
To view or add a comment, sign in
HealthCare Resolution Services, Inc.

2,455 followers
4mo
Report this post
AI might seem like an appropriate tool for medical coding, but so far the results haven't been promising. "The investigators reported that all of the studied large language models...showed limited accuracy (below 50 percent) in reproducing the original medical codes, highlighting a significant gap in their usefulness for medical coding." https://lnkd.in/etGBBiw2 The bottom line? If you want accurate medical coding, you need certified, experienced coders: https://lnkd.in/g2tFSDat

AI models fall short in medical coding accuracy

news-medical.net
Like Comment
To view or add a comment, sign in
Scott Wallace, PhD (Clinical Psychology)

Behavioral Health Scientist and Technologist specializing in AI and mental health | Cybertherapy pioneer | Entrepreneur | Keynote Speaker | Professional Training | Clinical Content Development
6mo
Report this post
Study Finds LLM Errors in Clinical Coding "Unacceptably Large" A new study from the Icahn School of Medicine at Mount Sinai has found that currently available (LLMs) like GPT-3.5, GPT-4, Gemini Pro and Llama2-70b are not sufficiently accurate to automate the assignment of medical codes for billing and research purposes. The researchers compared these state-of-the-art LLMs on their ability to match over 27,000 diagnosis and procedure codes to their official text descriptions from Mount Sinai patient records. They found that all models scored below 50% accuracy, with GPT-4 performing best at around 45% for diagnosis codes and 50% for procedure codes. The researchers determined the extent of errors was still "unacceptably large" for using LLMs to fully automate medical coding without human oversight. What does this mean? General-purpose LLMs have limitations for specialized healthcare tasks like medical coding. There is a need for more specialized, tailored AI models and tools specifically designed and refined for healthcare use cases. Integrating LLMs with expert clinical knowledge could improve their accuracy. Rigorous evaluation and iterative refinement will be critical before deploying AI The implications point toward a future where AI will likely play an increasingly important role in healthcare, but via specialized models that are thoroughly validated. General AI may augment human expertise, but is unlikely to fully automate complex clinical tasks anytime soon without significant refinement tailored to the healthcare domain. More like this? I invite you to my LinkedIn group, “Artificial Intelligence in Mental Health”—a vetted, science-based forum focused on the intersection of AI and mental health care, free from promotions or marketing. Link in comments. #ai #aihealthcare #healthcare #coding #LLM The Study: https://lnkd.in/gmaf_4Z8

Large Language Models Are Poor Medical Coders — Benchmarking of Medical Code Querying

ai.nejm.org

1 Comment
Like Comment
To view or add a comment, sign in
Mark Marciante

Servant Leader, Healthcare Policy Expert. Community Advocate, Digital Transformation Catalyst.
5mo
Report this post
#ai models are terrible #medicalbilling coders without a lot of work. This study from the NEJM Group, published in January, took 12 months of data from a reputable EHR and attempted to generate medical billing codes using clinical notes. They compared against GPT-3.5, GPT-4, Gemini Pro, and Llama2-70b. Results? GPT-4 had the highest exact match rate: 33.9% of the time it got the right ICD-10 code, and 49.8% of the time it got the right CPT code (oh, and 45.9% of the time it got the right ICD-9 code--which is an historical code now). Of course, there are a lot of caveats--no advanced "prompt engineering", no optimization, and of course the models have gotten better. The speculation here is that clinical notes are very long and difficult for LLMs to focus, leading to higher error rates.

Large Language Models Are Poor Medical Coders — Benchmarking of Medical Code Querying

ai.nejm.org

3 Comments
Like Comment
To view or add a comment, sign in
Henry Watson

Practice Lead - Artificial Intelligence & Machine Learning (EMEA): Connecting diverse talent
7mo
Report this post
Exciting read alert! 🚀 Check out this fascinating article on how AI is revolutionizing medical coding for physicians and coders. Dive in to discover: 🤖 The transformative power of AI in streamlining medical coding processes 💡 Insights into how AI technology is enhancing accuracy and efficiency 🏥 Real-world examples showcasing the benefits for healthcare organizations 🔍 Opportunities for leveraging AI to optimize coding workflows Don't miss out on this insightful exploration! #AI #MedicalCoding #HealthcareInnovation

How AI is transforming medical coding for physicians and coders

healthcareitnews.com
Like Comment
To view or add a comment, sign in
IMO Health

11,954 followers
1mo
Report this post
Can #LLMs excel in medical coding? Yes, with rich semantics and #clinicalAI. By leveraging rich clinical terminology and IMO Clinical AI to train LLMs, the accuracy of #medicalcoding can be greatly improved. Learn more: https://bit.ly/3yO2w6H #AIinHealthcare #HealthIT

Can LLMs excel in medical coding? Yes, with rich semantics and clinical AI

imohealth.com
Like Comment
To view or add a comment, sign in
Anand Jayaraman

Strategic HealthTech Exec | HLS Payer Practice Lead | Healthcare Advisory | Thought Leader & Change Catalyst | Driving AI/Gen-AI Powered Digital & Data Modernization for Healthcare | Visionary Growth Architect
5mo
Report this post
This may only further improve further and disrupt medical billing and coding. With more industry participation in democratizing and anonymizing handwritten clinical notes from various EMR systems AI would not only improve coding accuracy but may also evolve to customize how a provider system can maximize revenue coding for all the services provided. #aiinhealthcare

Mark Marciante

Servant Leader, Healthcare Policy Expert. Community Advocate, Digital Transformation Catalyst.
5mo

#ai models are terrible #medicalbilling coders without a lot of work. This study from the NEJM Group, published in January, took 12 months of data from a reputable EHR and attempted to generate medical billing codes using clinical notes. They compared against GPT-3.5, GPT-4, Gemini Pro, and Llama2-70b. Results? GPT-4 had the highest exact match rate: 33.9% of the time it got the right ICD-10 code, and 49.8% of the time it got the right CPT code (oh, and 45.9% of the time it got the right ICD-9 code--which is an historical code now). Of course, there are a lot of caveats--no advanced "prompt engineering", no optimization, and of course the models have gotten better. The speculation here is that clinical notes are very long and difficult for LLMs to focus, leading to higher error rates.

Large Language Models Are Poor Medical Coders — Benchmarking of Medical Code Querying

ai.nejm.org
Like Comment
To view or add a comment, sign in
IMO Health

11,954 followers
4mo
Report this post
Can #LLMs excel in medical coding? Yes, with rich semantics and #clinicalAI. By leveraging rich clinical terminology and IMO Clinical AI to train LLMs, the accuracy of #medicalcoding can be greatly improved. Learn more: https://bit.ly/3yO2w6H #AIinHealthcare #HealthIT

Can LLMs excel in medical coding? Yes, with rich semantics and clinical AI

imohealth.com
Like Comment
To view or add a comment, sign in
Bobby Guelich

Co-Founder and CEO at Elion
6mo
Report this post
Back with another deep dive: AI Medical Coding 🏥 👩🏽💻 To put it mildly, medical coding is a huge pain point. It’s expensive, inefficient, and error-prone, not to mention challenging to staff with high team member turnover. Over the last five years, fully autonomous medical coding solutions have emerged, promising to address many of these pain points. The idea is simple (but not easy): take the medical chart, fully code it leveraging several types of AI (e.g., NLP, machine learning, and deep learning), and send it directly to billing without human involvement. Recently, the technology has progressed from handling straightforward encounters in high-volume specialties (e.g., radiology and pathology) to tackling more complex visits, such as emergency medicine. However some areas, like inpatient facility coding, remain out of reach. Evaluating solutions in this space can be tricky. Key criteria to consider include: ↳ Breadth of coverage across specialties and code types ↳ Coding accuracy and automation rates (but be sure to define automation clearly!) ↳ Support quality and responsiveness ↳ Coding auditability and transparency ↳ Pricing (obviously) Here's our working list of AI Medical Coding vendors. Anyone else we should add? • Aidéo Technologies • Arintra • CodaMetrix • CorroHealth • Fathom • Maverick Medical AI • Nym • Phare Health • RapidClaims.ai • Synaptec Health
8 Comments
Like Comment
To view or add a comment, sign in

1,637 followers

46 Posts

View Profile Follow

Eyal Klang’s Post

More Relevant Posts

Explore topics