🎉 New Pariksha alert! 🎊
I am so proud to share our latest work, Health Pariksha, an extensive assessment of 24 LLMs, examining their performance on data collected from Indian patients interacting with a medical chatbot in Indian English and four other Indic languages. This work was done in collaboration with Varun Gumma, Mohit Jain, Ananditha Raghunath and Karya (human annotation).
Highlights of our work:
- Multilingual Evaluation: The study evaluates LLM responses to 750 questions posed by patients using a medical chatbot, covering five languages: Indian English, Hindi, Kannada, Tamil, and Telugu. Our dataset is unique, containing code-mixed queries such as “Agar operation ke baad pain ho raha hai, to kya karna hai?”, “Can I eat before the kanna operation”, and culturally relevant queries such as “Can I eat chapati/puri/non veg after surgery?”.
- Responses validated by doctors: We utilized doctor-validated responses as the ground truth for evaluating model responses.
- Uniform RAG Framework: All models were assessed using a uniform Retrieval Augmented Generation (RAG) framework, ensuring a consistent and fair comparison.
- Uncontaminated Dataset: The dataset used is free from contamination in the training data of the evaluated models, providing a reliable basis for assessment.
- Specialized Metrics: The evaluation was based on four metrics: factual correctness, semantic similarity, coherence, and conciseness, as well as a combined overall metric, chosen in consultation with domain experts and doctors. Both automated techniques and human evaluators were employed to ensure comprehensive assessment.
Key Findings:
- Performance Variability: The study finds significant performance variability among models, with some smaller models outperforming larger ones.
- Language-Specific Performance: Indic models do not consistently perform well on Indic language queries, and factual correctness is generally lower for non-English queries. This shows that there is still work to be done to build models that can answer questions reliably in Indian languages
- Locally-grounded, non-translated datasets: Our dataset includes various instances of code-switching, Indian English colloquialisms, and culturally specific questions which cannot be obtained by translating datasets, particularly with automated translations. While models were able to handle code-switching to a certain extent, responses varied greatly to culturally-relevant questions. This underscores the importance of collecting datasets from target populations while building solutions.
Check out the rest of the leaderboards in our paper (link in comments)