Director of the Generative AI Research Program, Division of Data-Driven and Digital Medicine (D3M) at Mount Sinai
I am thrilled to share our recent publication in NEJM AI, which explores the use of large language models (LLMs) like GPT-3.5, GPT-4, Gemini Pro, and Llama2-70b in medical coding. This collaborative study benchmarks LLMs performance in generating accurate medical billing codes and highlights both the potential and current limitations of AI in healthcare. While GPT-4 showed the most promising results, it is clear that further model fine-tuning, the use of advanced techniques like Retrieval-Augmented Generation, and the development of robust regulatory frameworks are necessary to safely integrate AI technologies into healthcare administrative pipelines. I invite you to read our full study and join the conversation: How can we further refine AI applications in healthcare to ensure better patient outcomes and operational efficiency? Robbie Freeman Ali Soroush, MD, MS Ben Glicksberg Alexander Charney Eyal Zimlichman, MD Yiftach Barash Girish Nadkarni 🔗 Link to the full study https://lnkd.in/e2VWsyun
Chief, Division of Data Driven and Digital Medicine (D3M) and Director, Charles Bronfman Institute of Personalized Medicine at the Mount Sinai Health System | AI | Healthcare | Data Science | Digital Health
Utilizing #genai for medical coding is considered low-hanging fruit. However, it is crucial to assess the capabilities and limitations of LLMs) like GPT-3.5, GPT-4, Gemini Pro, and Llama2-70b in medical coding tasks. We performed a comprehensive benchmarking analysis of 'out of the box' LLMs for performing medical coding. Methods 📜 We extracted 12 months of unique ICD and CPT codes from a large health system. We provided LLMs with a code description and a prompt to generate a billing code. We then calculated similarity metrics with the code. 🔍 Main Findings: Performance: GPT-4 outperformed other models with the highest exact match rates across ICD-9-CM, ICD-10-CM, and CPT codes. However, even the best results were under 50%, highlighting a significant accuracy gap. Error Analysis: LLMs frequently generated codes that were either imprecise or completely fabricated, raising concerns about their current utility in clinical settings. Factors Influencing Performance: Shorter codes and descriptions with higher frequency in electronic health records generally correlate with better performance. 🚀 Future Directions: To harness AI's full potential in healthcare efficiently, further research must focus on: Model Training and Fine-tuning: Tailoring LLMs to understand better and generate medical codes through advanced training methods. Hybrid AI-Coder Systems: Developing systems that combine AI's computational power with human expertise to enhance accuracy and reliability. Regulatory Frameworks: Establishing robust guidelines to ensure the safe integration of AI technologies in medical documentation processes. Addressing these challenges, we can pave the way for more reliable and efficient medical coding solutions, ultimately improving patient care and operational efficiency. 🔗 Link to the full study https://lnkd.in/e2VWsyun Let's discuss how we can turn these insights into actionable solutions. #HealthTech #ArtificialIntelligence #MedicalCoding #DigitalHealth Eyal Klang Robbie Freeman Ali Soroush, MD, MS Ben Glicksberg Alexander Charney
So proud of the work we’re doing together and thankful to be part of this amazing group of colleagues leading the way on healthcare AI! Eyal Klang Girish Nadkarni
It sounds like an insightful study, but has anyone considered why these LLMs, like ChatGPT and others, were poor CPT coders? It's because these models were never trained on ICD and CPT codes. The datasets these models are trained on consist of public data, and these ICD and CPT codes are not part of publicly available data. If these models are not trained on, for example, cat images, how can you expect them to identify cat images with sophisticated prompt engineering? In my opinion, you are probably testing the 'guessing' power of these LLM models.
Eyal Klang I think for more complex but speacialised tasks like medical coding, agentic workflows are the way forward, where teams of agents iterate and can refine outputs. Each agent in the flow can also use a fine tuned model. I also think the way you construct the prompt template/flow is important.
Healthcare system first goal shall be finding the solution for detection the disease of a patient which is suffering from, within 30 days and not after 12 month and visiting 19 experts.
I believe a breakthrough is coming in ~4 months.
Eyal, you are awesome 👌
So cool.
Engineering Leader, Digital Transformation, Product Engineering and Management, Artificial Intelligence , Gen AI , Web3 , Research & Development
6moGirish Nadkarni Eyal Klang This is a valuable contribution to the discussion on the potential and limitations of LLMs in medical coding. Tailoring models for medical language and integrating human expertise are essential steps keenly interested in the following areas. Any plans/roadmaps on the below? -to explore how LLMs arrive at their code suggestions. Understanding their reasoning could improve trust and identify potential biases. -LLM performance in real-time coding scenarios would be insightful. -Research on seamless integration of LLMs with existing Electronic Health Records (EHR) systems is crucial for practical implementation.