🚀 1 Billion Records in 22 Languages: The ETL Project for India Voter Data We converted 1 billion voter records scattered across Indian electoral authorities of all administrative divisions and states into a structured database. The data came in many forms—PDFs, photos of handwritten forms, and in 22 different languages. We cross-referenced names and addresses from India Post. Besides, we converted non-English data into Roman characters—all thanks to the expertise of linguists and custom machine transliteration algorithms. Read the full case study here (sample data included): https://lnkd.in/ekFfiNz8 #DataExtraction #WebScraping #Nannostomus #DataScience #BigData #ETL
Nannostomus’ Post
More Relevant Posts
-
Shubhradeep Nandi, Sr. Data Scientist at the Government of Andhra Pradesh, showcased pioneering research at #MLDS2024, revolutionizing taxpayer risk assessment. By 2020, India has already lost 20,000 crores due to GST taxes, Nandi recalled while beginning his presentation. Integrating taxpayer data into natural-language profiles, fine-tuning Large Language Models (LLMs), his approach outperforms traditional methods, offering nuanced insights for informed decision-making. This transformative use of LLMs enhances accuracy and deepens understanding of taxpayer behavior, crucial for modernizing governmental financial departments.
To view or add a comment, sign in
-
Associate Director at Standard Chartered Bank | Data Scientist | Helping people to break into the Data Industry | IIM-C | IIIT-B | ex- HSBC
Major Discrepancies in Exit Polls vs. Election Results: A Data Scientist's Insight. Yesterday, the Indian Lok Sabha 2024 election results were announced, revealing significant discrepancies between the exit polls and the actual outcomes. In my latest post, I explore the biases and other contributing factors leading to such discrepancies. Additionally, I propose advanced strategies for improving exit poll accuracy, including dynamic sampling, machine learning integration, enhanced data validation, and Bayesian inference. P.S: This post is purely statistical. There can be political reasons for the discrepancies too, but analyzing those is like mixing chai with coffee - best left to the experts in a different brew! #IndianElections #LokSabha2024 #ExitPolls #DataScience #Analytics #ElectionAnalytics #BiasInPolling
To view or add a comment, sign in
-
Building the best text embeddings simply using synthetic data & LLM. https://lnkd.in/dmr5yvbp
To view or add a comment, sign in
-
SQLCoder-2–7b: How to Reliably Query Data in Natural Language, on Consumer Hardware Nice piece on this custom 7b model by our Sjoerd Tiemensma https://lnkd.in/dfzBnpGT
SQLCoder-2–7b: How to Reliably Query Data in Natural Language, on Consumer Hardware
useai.substack.com
To view or add a comment, sign in
-
This week, we completed a really intense data story. Parsing polling booth data in Chennai's Lok Sabha constituencies, mixing and matching it with property guideline values data (as a proxy for income) and testing a hypothesis that "class based voting" was a thing in urban India and especially in Chennai. Turns out that the correlation between class and party choice was quite high in Chennai - the urban poor and less well off sections voted a lot for the Dravida Munnetra Kazhagam (and its allies) and the party also did well across segments, while the richer and more well-off sections of Chennai voted a lot for the BJP, especially in affluent areas. Polling booth data was promptly made available by Tamil Nadu's Chief Electoral Officer, albeit in scanned PDFs and making it a bit difficult to parse and convert data into simple CSV. Property price/guideline value data is available in Tamil Nadu Government's registration department and we had to scrape the full information. The names of streets/ associations in the department website was in Tamil and Google Translate API did a great job but it wasn't exactly a perfect match. Still, Vignesh Radhakrishan, Sambavi Parthasarathy, myself and a couple of interns managed to do what we set out to do using Gen AI tools, good ol' plain spreadsheet work, intensive cleaning up and matching records and so on. The output is attached. It got enlivened by a wonderful illustration by our colleague, Soumyadip Sinha. The internet versions of the article are here - https://lnkd.in/gJgfgz6a https://lnkd.in/gFFTE9YC
To view or add a comment, sign in
-
I created a visualization of the number of Communal Riots (CR) registered per state in india from the year 2016 to 2020. Data was taken from the National Crime Records Bureau report which you can download from https://data.gov.in/ There were huge discrepancies in data, for instance in Uttar Pradesh there were no CR mentions after the year 2018 which is obviously not true. I believe the data is not a actual representation of the reality but this can be used by Computational social scientist to study the causation and correlation to mitigate the riot prone areas. Also, there is no state data available after the year 2022. #data #tableau #india Sinan Aral Prof Sandra Wachter
To view or add a comment, sign in
-
𝐋𝐋𝐌 𝐟𝐨𝐫 𝐑𝐨𝐦𝐚𝐧𝐢𝐚𝐧 The Institute for Logic and Data Science (ILDS: https://ilds.ro/) launches the project 𝐿𝐿𝑀 𝑓𝑜𝑟 𝑅𝑜𝑚𝑎𝑛𝑖𝑎𝑛 - Pre-training and fine-tuning of Large Language Models to obtain a foundation model for the Romanian language 𝐓𝐢𝐦𝐞𝐟𝐫𝐚𝐦𝐞: February 2024 – February 2025 𝐏𝐚𝐫𝐭𝐧𝐞𝐫𝐬: - BRD - Groupe Societe Generale - Applied Data Science Center, University of Bucharest - National University of Science and Technology POLITEHNICA Bucharest 𝐃𝐞𝐬𝐜𝐫𝐢𝐩𝐭𝐢𝐨𝐧 𝐞𝐱𝐜𝐞𝐫𝐩𝐭𝐬: This project is part of a larger project that aims at building a Large Language Model (LLM) for the Romanian language that can be adapted to a wide range of domains and use cases (i.e., foundation model). As approach for adapting to a specific domain, the larger project will focus on Retrieval-Augmented Generation (RAG) that combines information retrieval with text generation. It helps to provide more accurate and contextually relevant responses. As a use case, the focus will be on questions answering chat assistants. All these need a LLM with good capabilities. 𝐔𝐬𝐞𝐟𝐮𝐥 𝐥𝐢𝐧𝐤𝐬: Project page on ILDS site: https://lnkd.in/dGZDXJNS The technical report may be found here: https://lnkd.in/dbqrP2JP The model may be downloaded here: https://lnkd.in/d5Q85BZA The underlying code may be downloaded here: https://lnkd.in/ddEcGD-a
To view or add a comment, sign in
-
Data Structures & Algorithms are Source of Income in India. Rather than Tools to solve Problems
To view or add a comment, sign in
-
Dataverse and multiple languages https://lnkd.in/dV2WPFj5 The page above is a great resource to identify where text and labels are used in Dataverse. I heard an interesting idea that you can add multiple languages to data. You can have a field for each language. This isn't scalable if you have many languages to support but it would let you show 2 different languages and can search in it.
To view or add a comment, sign in
-
This is an example of the type of application I have been waiting -- using a verified data base. https://wapo.st/3I76x7y
Opinion | An ‘education legend’ has created an AI that will change your mind about AI
washingtonpost.com
To view or add a comment, sign in
17 followers