LLMs #2

Kiruthika Subramani

Innovating AI for a Better Tomorrow | AI Engineer | Google Developer Expert | Author | IBM Dual Champion | 200+ Global AI Talks | Master's Student at MILA

Published Apr 29, 2024

Hey all, Welcome back for the second Episode of Cup of Coffee Series with LLMs. Again we have Mr. Bean with us.

Are you here for the first time ? Check out my previous article where I discussed the LLMs intro and transformer architecture.

Wohoo let's get started.

At first let us discuss the different categories of LLMs.

1. Generative

It focus on patterns in text data and use them to create new things, like writing or translating languages (Make it new).

Tasks- Text generation, machine translation, writing different kinds of creative content.

Products - Content creation tools like Jasper or Writesonic, machine translation apps like Google Translate, AI-powered image/music generation tools like DALL-E or MuseNet.

2. Discriminative

These focus on sorting existing text data, like classifying emails as spam or figuring out if a review is positive(Spot the difference). So today we can able to find it on Amazon AI Review Summary

Tasks - Sentiment analysis (classifying text as positive, negative, or neutral), spam detection (classifying emails as spam or not spam).

Products - Spam filters in email service providers like Gmail, sentiment analysis tools in social media analytics platforms, chatbots with intent recognition.

Mr. Bean : What are the steps involved in building an LLMs?

Sure, let me explain you.

Define Goal & Use Case
Data Collection & Preprocessing
Model Architecture & Design
Train the LLM
Evaluate & Validate
Fine-Tuning (Optional)
Deploy the LLM
Ethical Considerations

These are the steps involved in building LLMs. Let me explain you in detail.

Define Goal & Use Case

Not all problems require an LLM. First, clearly define what you want the LLM to achieve.

Goal

What problem are you trying to solve with the LLM? Is it for generating creative text formats, translating languages, or writing different kinds of content?

Use Case

How will the LLM be used in practice? Will it be integrated into a specific application, used for research, or offered as a public service?

Mr.Bean : How do we know which problem requires LLMs?

Great question.

Does the problem involve massive amounts of text data? Then LLMs excel at learning patterns from large datasets.

Does the task require understanding complex relationships within language?

Is the goal to achieve human-level performance?

Then LLM's would be a great choice.

For example:

An LLM could be a good choice for summarizing news articles - it can process massive amounts of text and identify key points. (Suitable)

An LLM might not be the best choice for a simple math equation solver - a traditional algorithm might be faster and more efficient. (Not ideal)

Mr Bean : Can I know, where the LLMs are not ideal?

LLMs are excellent for creative tasks and working with language, but it's important to be aware of their limitations. For tasks requiring Reasoning and logic, Tasks requiring factual accuracy, Real-time tasks, Common sense and social cues, up-to-date information, or data retrieval other tools might be more suitable.

LLMs can be fooled by biased data or misleading information in their training data. This can lead them to confidently spout incorrect information.

2 .Data Acquisition and Preprocessing for LLMs

It involves two process data collection and Preprocessing. Let us discuss both in detail.

 I. Data Collection

I. a) General LLMs

Web Crawling

It is a Large-scale automated processes that harvest data from publicly accessible websites.

Techniques involve identifying relevant URLs, navigating website structures, and extracting text content. Tools like Apache Nutch or Scrapy can be used.

Public Text Datasets

Pre-existing collections of text data like books (Project Gutenberg), articles (arXiv), or code repositories (GitHub) can be valuable sources.

I. b) Specialized LLMs

Domain-specific data is crucial. For example, a medical LLM might utilize scientific publications from PubMed Central, while a legal LLM could leverage legal case documents.

I. c) Other methods for data acquisition includes

Data Sources Beyond Crawling:

Data APIs and Marketplaces

These platforms act as libraries, offering pre-processed datasets on various topics. Imagine them as pre-organized data ready for your LLM training. Examples include Google Cloud Storage or Amazon S3.

Data Licensing

Sometimes, the most valuable data resides with private organizations (e.g., news agencies, medical journals). Obtaining licenses grants access to this data, but careful consideration of copyright and usage terms is crucial.

Enhancing Data Diversity

Crowdsourcing

Platforms like Amazon Mechanical Turk allow you to outsource data collection tasks (labeling, annotation) to a large, distributed workforce. This can be a cost-effective way to gather diverse data points, especially for specific needs.

Manual Curation

For specialized LLMs, human expertise is irreplaceable. Experts can select high-quality data that's directly relevant to the LLM's purpose, ensuring the model focuses on the most appropriate information.

II. Data Preprocessing

1.Text Cleaning

2.Data Balancing

3.Text Augmentation

1.Text Cleaning

Normalization

Lowercasing: Converting all text to lowercase for consistency (e.g., "Cat" and "CAT" become the same).

Punctuation Removal: Removing punctuation marks (.,?!) as they might not hold meaning for the LLM. Decisions might be made to keep some punctuation (e.g., quotation marks for dialogue).

Tokenization: Breaking down text into smaller units the LLM can understand. This could be words, characters, or even sentences depending on the model's architecture.

Stop Word Removal: Eliminating common words that offer little meaning (e.g., "the," "a," "an") to improve training efficiency and focus on content-rich words.

Spelling Correction: Techniques like dictionary lookup or statistical methods can be used to identify and correct typos or misspellings.

Entity Recognition and Removal: Identifying and removing named entities like people, locations, or organizations might be necessary for privacy or security concerns. It is optional.

2. Data Balancing

Real-world data often has inherent biases. Techniques like oversampling (replicating underrepresented data points) or undersampling (removing data from overrepresented classes) can be used to create a more balanced dataset for training.

3.Text Augmentation

Artificially expanding the dataset by creating variations of existing data points. This can involve techniques like

Synonym Replacement - Replacing words with synonyms to introduce variety.

Paraphrasing - Generating slightly different phrasings of the same sentence to improve the model's ability to handle paraphrased language.

Back-translation - Translating text to another language and then back to the original language to introduce slight variations.

Tools and Techniques:

Libraries like NLTK (Natural Language Toolkit) or spaCy offer functionalities for text cleaning, tokenization, and other preprocessing tasks.

Cloud platforms like Google Cloud AI Platform or Amazon SageMaker provide managed services for data preprocessing and training pipeline

For today, we have discussed two steps of building LLMs. Thanks Mr. Bean for joining me today. Let us discuss more on our next discussion after 48 hours.

Bye Everyone, Stay Tuned.

Signing off,

Kiruthika Subramani.

Links with this icon were created by LinkedIn and links without it were added by the author.

Kiruthivarma S

Ideator | FOSS Contributor | Tech Enthusiast & Explorer | Python, Wordpress Dev | Linux SysAdmin

10mo

🔥 Great narration, Keep rising! Kiruthika Subramani

1 Reaction

See more comments

To view or add a comment, sign in

LLMs #2

Kiruthika Subramani

Innovating AI for a Better Tomorrow | AI Engineer | Google Developer Expert | Author | IBM Dual Champion | 200+ Global AI Talks | Master's Student at MILA

Define Goal & Use Case

I. a) General LLMs

Recommended by LinkedIn

I. b) Specialized LLMs

I. c) Other methods for data acquisition includes

1.Text Cleaning

Normalization

2. Data Balancing

3.Text Augmentation

More articles by Kiruthika Subramani

Insights from the community

Others also viewed

Issue #217 - THE ML ENGINEER 🤖

LangSmith

Prompt Engineering

Five Lessons from my first LLM project

AskItRight: My Journey to Building an AI-Powered PDF Query App (RAG - llama3.1) 🚀

Community-Driven Building: How Open Source is Fueling Progress in the LLM Ecosystem, and a Short Guide to Dataset Generation & Transfer Learning

High Fidelity Retrieval Augmented Generation (RAG) with Meta Llama 3.1 at PubNub

Optimizing RAG Pipelines for Real-World Deployment

When LLMs Made Everyone a Coder

Chunking in Retrieval-Augmented Generation (RAG) and it's Types:

Explore topics

Define Goal & Use Case

I. a) General LLMs

Recommended by LinkedIn

I. b) Specialized LLMs

I. c) Other methods for data acquisition includes

1.Text Cleaning

Normalization

2. Data Balancing

3.Text Augmentation

More articles by Kiruthika Subramani

RAG System with Video

Building a RAG System using Gemini API

Evaluation methods for LLMs

Different Fine-tuning Methods for LLMs

Pretraining and Fine Tuning LLMs

Architecting Large Language Models

LLM's Introduction

Transformers

Generative Adversarial Network (GAN)

Autoencoder

Insights from the community

Others also viewed

Issue #217 - THE ML ENGINEER 🤖

LangSmith

Prompt Engineering

Five Lessons from my first LLM project

AskItRight: My Journey to Building an AI-Powered PDF Query App (RAG - llama3.1) 🚀

Community-Driven Building: How Open Source is Fueling Progress in the LLM Ecosystem, and a Short Guide to Dataset Generation & Transfer Learning

High Fidelity Retrieval Augmented Generation (RAG) with Meta Llama 3.1 at PubNub

Optimizing RAG Pipelines for Real-World Deployment

When LLMs Made Everyone a Coder

Chunking in Retrieval-Augmented Generation (RAG) and it's Types:

Explore topics