LLMs #2

LLMs #2

Hey all, Welcome back for the second Episode of Cup of Coffee Series with LLMs. Again we have Mr. Bean with us.

Are you here for the first time ? Check out my previous article where I discussed the LLMs intro and transformer architecture.

Wohoo let's get started.

At first let us discuss the different categories of LLMs.

1. Generative

It focus on patterns in text data and use them to create new things, like writing or translating languages (Make it new).

Tasks- Text generation, machine translation, writing different kinds of creative content.

Products - Content creation tools like Jasper or Writesonic, machine translation apps like Google Translate, AI-powered image/music generation tools like DALL-E or MuseNet.

2. Discriminative

These focus on sorting existing text data, like classifying emails as spam or figuring out if a review is positive(Spot the difference). So today we can able to find it on Amazon AI Review Summary

Tasks - Sentiment analysis (classifying text as positive, negative, or neutral), spam detection (classifying emails as spam or not spam).

Products - Spam filters in email service providers like Gmail, sentiment analysis tools in social media analytics platforms, chatbots with intent recognition.

Mr. Bean : What are the steps involved in building an LLMs?

Sure, let me explain you.

  1. Define Goal & Use Case
  2. Data Collection & Preprocessing
  3. Model Architecture & Design
  4. Train the LLM
  5. Evaluate & Validate
  6. Fine-Tuning (Optional)
  7. Deploy the LLM
  8. Ethical Considerations

These are the steps involved in building LLMs. Let me explain you in detail.

Define Goal & Use Case


Not all problems require an LLM. First, clearly define what you want the LLM to achieve.


Goal

What problem are you trying to solve with the LLM? Is it for generating creative text formats, translating languages, or writing different kinds of content?

Use Case

How will the LLM be used in practice? Will it be integrated into a specific application, used for research, or offered as a public service?


Mr.Bean : How do we know which problem requires LLMs?


Great question.

Does the problem involve massive amounts of text data? Then LLMs excel at learning patterns from large datasets.
Does the task require understanding complex relationships within language?
Is the goal to achieve human-level performance?

Then LLM's would be a great choice.

For example:

An LLM could be a good choice for summarizing news articles - it can process massive amounts of text and identify key points. (Suitable)

An LLM might not be the best choice for a simple math equation solver - a traditional algorithm might be faster and more efficient. (Not ideal)

Mr Bean : Can I know, where the LLMs are not ideal?


LLMs are excellent for creative tasks and working with language, but it's important to be aware of their limitations. For tasks requiring Reasoning and logic, Tasks requiring factual accuracy, Real-time tasks, Common sense and social cues, up-to-date information, or data retrieval other tools might be more suitable.

LLMs can be fooled by biased data or misleading information in their training data. This can lead them to confidently spout incorrect information.

2 .Data Acquisition and Preprocessing for LLMs

It involves two process data collection and Preprocessing. Let us discuss both in detail.

 I. Data Collection         

I. a) General LLMs

Web Crawling

It is a Large-scale automated processes that harvest data from publicly accessible websites.

Techniques involve identifying relevant URLs, navigating website structures, and extracting text content. Tools like Apache Nutch or Scrapy can be used.

Public Text Datasets

Pre-existing collections of text data like books (Project Gutenberg), articles (arXiv), or code repositories (GitHub) can be valuable sources.

I. b) Specialized LLMs

Domain-specific data is crucial. For example, a medical LLM might utilize scientific publications from PubMed Central, while a legal LLM could leverage legal case documents.

I. c) Other methods for data acquisition includes

Data Sources Beyond Crawling:

Data APIs and Marketplaces

These platforms act as libraries, offering pre-processed datasets on various topics. Imagine them as pre-organized data ready for your LLM training. Examples include Google Cloud Storage or Amazon S3.

Data Licensing

Sometimes, the most valuable data resides with private organizations (e.g., news agencies, medical journals). Obtaining licenses grants access to this data, but careful consideration of copyright and usage terms is crucial.

Enhancing Data Diversity

Crowdsourcing

Platforms like Amazon Mechanical Turk allow you to outsource data collection tasks (labeling, annotation) to a large, distributed workforce. This can be a cost-effective way to gather diverse data points, especially for specific needs.

Manual Curation

For specialized LLMs, human expertise is irreplaceable. Experts can select high-quality data that's directly relevant to the LLM's purpose, ensuring the model focuses on the most appropriate information.

II. Data Preprocessing        
1.Text Cleaning
2.Data Balancing
3.Text Augmentation

1.Text Cleaning

Normalization

Lowercasing: Converting all text to lowercase for consistency (e.g., "Cat" and "CAT" become the same).

Punctuation Removal: Removing punctuation marks (.,?!) as they might not hold meaning for the LLM. Decisions might be made to keep some punctuation (e.g., quotation marks for dialogue).


Tokenization: Breaking down text into smaller units the LLM can understand. This could be words, characters, or even sentences depending on the model's architecture.

Stop Word Removal: Eliminating common words that offer little meaning (e.g., "the," "a," "an") to improve training efficiency and focus on content-rich words.

Spelling Correction: Techniques like dictionary lookup or statistical methods can be used to identify and correct typos or misspellings.

Entity Recognition and Removal: Identifying and removing named entities like people, locations, or organizations might be necessary for privacy or security concerns. It is optional.

2. Data Balancing

Real-world data often has inherent biases. Techniques like oversampling (replicating underrepresented data points) or undersampling (removing data from overrepresented classes) can be used to create a more balanced dataset for training.


3.Text Augmentation

Artificially expanding the dataset by creating variations of existing data points. This can involve techniques like


Synonym Replacement - Replacing words with synonyms to introduce variety.

Paraphrasing - Generating slightly different phrasings of the same sentence to improve the model's ability to handle paraphrased language.

Back-translation - Translating text to another language and then back to the original language to introduce slight variations.

Tools and Techniques:

Libraries like NLTK (Natural Language Toolkit) or spaCy offer functionalities for text cleaning, tokenization, and other preprocessing tasks.

Cloud platforms like Google Cloud AI Platform or Amazon SageMaker provide managed services for data preprocessing and training pipeline

For today, we have discussed two steps of building LLMs. Thanks Mr. Bean for joining me today. Let us discuss more on our next discussion after 48 hours.



Bye Everyone, Stay Tuned.

Signing off,

Kiruthika Subramani.


Kiruthivarma S

Ideator | FOSS Contributor | Tech Enthusiast & Explorer | Python, Wordpress Dev | Linux SysAdmin

10mo

🔥 Great narration, Keep rising! Kiruthika Subramani

To view or add a comment, sign in

More articles by Kiruthika Subramani

  • RAG System with Video

    RAG System with Video

    Hello Everyone,It’s Friday, and guess who’s back? Hope you all had a fantastic week! This week, let’s dive into…

    2 Comments
  • Building a RAG System using Gemini API

    Building a RAG System using Gemini API

    Welcome to the first episode of AI Weekly with Krithi! In this series, we’ll explore various AI topics, tools, and…

    3 Comments
  • Evaluation methods for LLMs

    Evaluation methods for LLMs

    Hey all, Welcome back for the sixth Episode of Cup of Coffee Series with LLMs. Again we have Mr.

  • Different Fine-tuning Methods for LLMs

    Different Fine-tuning Methods for LLMs

    Hey all, Welcome back for the fifth Episode of Cup of Coffee Series with LLMs. Again we have Mr.

    1 Comment
  • Pretraining and Fine Tuning LLMs

    Pretraining and Fine Tuning LLMs

    Hey all, Welcome back for the fourth Episode of Cup of Coffee Series with LLMs. Again we have Mr.

  • Architecting Large Language Models

    Architecting Large Language Models

    Hey all, Welcome back for the third Episode of Cup of Coffee Series with LLMs. Again we have Mr.

  • LLM's Introduction

    LLM's Introduction

    Hello Everyone! Kiruthika here, after a long. I am back with the cup of coffee series with LLMs.

    2 Comments
  • Transformers

    Transformers

    Hello, folks! Kiruthika is back after a long break. Yep, let's get started with our Cup of Coffee Series! Today, we…

    4 Comments
  • Generative Adversarial Network (GAN)

    Generative Adversarial Network (GAN)

    🎉🎉🎉Pour yourself a virtual cup of coffee with GANs after a long. Finally, we are stepping into 19 th week of this…

    1 Comment
  • Autoencoder

    Autoencoder

    🎉🎉🎉 It's time for a "Cup of Coffee with Autoencoder"! ☕️🔍 🛒🤝 An autoencoder is a neural network architecture used…

Insights from the community

Others also viewed

Explore topics