Engineer & Data Scientist | Building AI/Data-First Products & Solutions | Problems First | at the Crossroads of Systems, Behavior, and Intelligence | Learning over Knowing
🚀 I recently dived into Anoop Kunchukuttan’s survey presentation on Extending Large Language Models (LLMs) to other languages, AI4Bhārat it’s nothing short of a playbook for building customized LLMs. 🌐🛠️ 🔍 Highlights: • An E2E crisp overview on Extending the power of English-based LLMs to Indic and other languages.🌍 • Technical Breakdown: Clear intros to canonical LLMs and design patterns – essential for both beginners and experts. 🧩 • In-depth Process Guide: From extending vocabularies 📚 to embedding new words , initialisation methodologies 🧬, the survey meticulously outlines each step. • Practical Insights: Details on pretraining corpuses, continual pretraining, instruction fine-tuning, and the importance of dataset preparation by augmenting/distilling from existing models/translations. 💡 - Insights and details have been given on various data mixing strategies like romanized / code switching/ using cross lingual datases • Cross-Lingual Innovation: Discusses cross-lingual prompting methods and instruction alignment . 🤖⚙️ • Empirical Evidence: Ablation studies to back the theories and provide concrete data to easily understand what worked better on these activities 📊 This survey is not just an academic read but a solid reference for anyone looking to adapt LLMs to specific domains or languages. It underscores the critical steps in making AI accessible and functional across linguistic barriers. A massive shoutout to Anoop Kunchukuttan ! #AI #LanguageModels #MachineLearning #InclusiveAI #AI4Bharat #generativeai #deelearning
Can you share the link?
Excited to dive into this! Thanks for sharing your insights. Prabakaran Chandran
Researcher-Microsoft, Co-founder and Co-lead-AI4Bharat. I work on Machine Translation, Multilingual Learning and Indian Language NLP
6moThanks, happy to see you found it useful!