A Beginner’s Guide to Large Language Models

In our ongoing series of blogs “Unravelling the AI mystery” Digitate continues to explore advances in AI and our experiences in turning AI and GenAI theory into practice. The blogs are intended to enlighten you as well as provide perspective into how Digitate solutions are built.

Please enjoy the blogs

1. Riding The GenAI Wave

2. Prompt Engineering – Enabling Large Language Models to Communicate With Humans

3. What are Large Language Models? Use Cases & Applications

4. Harnessing the power of word embeddings

written by different members of our top-notch team of data scientists and Digitate solution providers.

A Beginner’s Guide to Large Language Models

Natural Language Processing (NLP) influences our world in many ways. Our daily lives are permeated by applications of NLP, such as search engines, question answering, document analysis, spam filtering, customer service bots, etc. It is fascinating to study how the underlying engines running these applications work, especially how machines work with language and text.

It is numbers that a computer understands and manipulates. To represent and process language or textual information, we convert it into numbers called embeddings, which we discussed in one of our previous blog posts. Over the years, the techniques of language understanding/representation have evolved in the world of NLP. Following are some of the key areas in which NLP has evolved over time:

Statistical Measures: A baby step in this space consists of applying statistical techniques on strings. No meaning is attached to the text yet. Techniques such as set similarity and edit distance belong to this class.

Word Embeddings: This is a simple way to understand some meaning of the text. Words are represented as numbers such that these numbers capture the context in which the words are used. Techniques such as word2vec fall in this space.

Sentence Embeddings: Word embeddings are then combined to form sentence-level embeddings to define the meaning of longer sentences and paragraphs. It captures the context across sentences; however, there is no detailed contextual understanding of the language itself. Techniques such as Bag of Words fall in this space.

Language Models: These are complex models designed to understand and generate human language. They learn from a raw text corpus.

Large Language Models (LLM): If the size of the language model and training corpus is in the order of hundreds of millions, it is called a large language model. Any type of language model can be scaled up; however, most recently, the focus has been on scaling up generative models. BERT and GPT (Generative pre-trained transformer) are examples of large language models.

In this post, we will explore large language models (LLMs) with a specific focus on generative analytics. As there are many large language models, we will deep-dive with examples of the popular GPT series. This blog discusses GPT models in general. We will publish another blog with a specific focus on ChatGPT, a separate offshoot of GPT models that is specifically trained for conversations.

We discuss:

About Generative Models
What are Large Language Models (LLM)?
How do large language models work?
Why are large language models important?
What are large language models used for?
Generative Pre-training of GPT
Supervised Fine-tuning of GPT
Output Illustration: What would the model predict?
Opportunities and Challenges of LLMs
What is the difference between large language models and generative AI?
Real-world Applications of LLMs
What is the future of LLMs?

About Generative Models

Generative models can generate any type of data. These models can generate text, images, video, and audio. They have been used for a long time; however, only recently have they started gaining traction in the Natural Language Processing world. For example, we get a recommendation while typing text on our phone or computer, right? This is an example of a text generation model working in the background!

For a text-based model, statistically speaking, they learn a probability distribution of how words occur in any language and then use these probabilities to generate text output. So, what do we want from a model that is “generative” in nature? It would be great to have the ability to generate “a relevant sequence of text” when given a starting point or input. For example, a simple logic, such as a model that predicts the next word (to a given input or prompt) and then, the next word and then the next word, and so on! That is how we can generate sentence after sentence! This is what GPT does.

In the term GPT,

G stands for Generative: The model generates text
P stands for pre-trained: The model is pre-trained on a very large text corpus
T stands for Transformer: The model transforms text and is based on the transformer architecture

What are Large Language Models (LLM)?

As we saw earlier, a language model is a deep learning-based model that understands the context and meaning of words and sentences in a language. LLMs are large-scale language models that are pre-trained on a large volume of training data and have a large number of model parameters.

Let us also understand what the parameters of a large-scale language model are. A deep learning model has many layers of neurons connected to each other. These help it learn the relationships between the input and output when the model is trained on data. To put it in a simplified manner, these relationships are stored by the model in the form of numbers called weights. These are also called parameters. The more the neurons and connections, the more the parameters! The more the parameters, the more nuances of the relationship between words/phrases can be learned. An LLM has hundreds of millions of such parameters. While BERT has around 345 million parameters, GPT2 has 1.5 billion parameters, and GPT3 has 175 billion parameters. GPT4 is supposed to have parameters in trillion(s).

Another aspect that influences an LLM is the size of the training corpus. The larger the size of the raw training text, the more detailed the learning. The more varied the text is, the more different things the model learns! GPT3 is pre-trained on multiple sources of data. It is trained on hundreds of billions of words, including code.

Following is its composition:

Common Crawl consists of ~60% of training data. It refers to public datasets that are obtained by crawling the web. It includes multilingual text.
WebText2 consists of ~22% of training data. It is a private OpenAI dataset created by crawling links from Reddit that had three upvotes. The idea is that these URLs are trustworthy and will contain quality content.
Books1 consists of ~8 % of training data. Books is an internet-based collection of a sampling of books.
Books2 consists of ~ 8% of training data.
Wikipedia consists of ~ 3% of training data.
Note that there are many LLMs currently available. Notable among them are:
Gemini: Developed by Google, it is the latest, most powerful, and general model from Google. It comes in three versions: Ultra, Pro, and Nano.
LLaMA: It is a family of autoregressive large language models (LLMs) released by Meta AI in February 2023. They range in scale from 7B to 70B parameters.
Galactica: Developed by Meta. A large language model that can store, combine, and reason about scientific knowledge.
Codex: It is the model that powers GitHub Copilot. Proficient in more than a dozen programming languages, Codex can now interpret simple commands in natural language and execute them.
PaLM-E: Developed by Google. It is an LLM focused on robot sensor data.
Chinchilla: Developed by Deepmind. It considerably simplifies downstream utilization because it requires much less computational power for inference and fine-tuning.

How do large language models work?

Large language models work through a two-step process that involves extensive pre-training and fine-tuning. Specialized data sets are used to make LLMs adaptable to various specific tasks. Let’s break down this complex process and examine each step in detail:

Pre-training

As an advanced AI model, a large language model is pre-trained on a large volume of data, also known as a corpus, obtained from different sources like public forums, tutorials, Wikipedia, Github, etc. In this stage, unlabeled, unstructured data sets are used by LLMs for what is known as “unsupervised learning.” These data sets consist of trillions of words that LLMs will understand, analyze, and find connections between. These unstructured data sets are invaluable as they contain more data. When these data sets are fed to LLMs without instructions, their AI algorithm helps them understand semantics and establish relationships between words and concepts, shaping their understanding of the human language.

Fine-Tuning

Once the unstructured data from different sources has been fed to LLMs, the next step involves fine-tuning the information learned. In this stage, some labeled data is available that LLMs utilize to accurately distinguish between and learn different concepts. Simply put, fine-tuning is how LLMs enhance their understanding of words and concepts to optimize their performance on specific NLP-related tasks.

Once a base-trained model has been achieved, it can be developed further with specialized instructions for various practical purposes. Therefore, LLMs can be queried with relevant prompts, and they will then use AI model inference to respond accordingly (such as with an answer to a question, translated text, sentiment analysis reports, etc.)

Why are large language models important?

In this day and age, when digital transformation is at its peak, LLMs have attained particular importance in artificial intelligence as they are the foundational models used to build wide-ranging applications for various uses. Other than teaching AI and machine learning (ML) models the human language, LLMs are capable of performing complex functions such as translation, content summarization, classification and categorization, sentiment analysis, etc., with ease, making them invaluable for industries like healthcare, finance, marketing, and entertainment.

What makes LLMs significant today is their large sets of parameters (which are like human memories) that help them learn through fine-tuning. Because LLMs are based on a transformer model architecture consisting of an encoder and a decoder, they can process data by “tokenizing” the input and simultaneously analyzing the data to derive relationships between tokens (words and concepts). LLMs are versatile AI models that can be deployed across different use cases and applications. And as they exhibit efficiency and accuracy, they’re widely used by businesses in diverse fields.

What are large language models used for?

LLMs are highly versatile and can be used for completing different NLP-related tasks. These tasks include translation, whereby LLMs trained in multiple languages translate content from one language to another; content summary, whereby LLMs summarize extensive text blocks or multiple pages of text for a simplified understanding; and content rewriting, whereby LLMs will rewrite sections of text upon instruction, offering an easy and efficient way to modify content. LLMs are also helpful for:

Information retrieval

LLMs aid information retrieval systems to enhance the relevance and accuracy of search results. For instance, each time a large language model (e.g., Google or Bing search engine) is queried, it searches for relevant information and summarizes it before it is communicated to you in a conversational style. This is how an LLM will perform informational retrieval in response to a prompt given as a question or an instruction.

Sentiment analysis

These AI models can also understand the sentiment/emotion behind words and analyze the intent behind a piece of content or a particular response, just like humans do. This makes LLMs a viable alternative to hiring human agents in the customer service departments of large organizations.

Text generation

Large language models like ChatGPT are a brilliant example of generative AI enabling users to generate new text based on inputs. When prompted, these models produce textual content pieces. For instance, they can be prompted with a command like “Compose a short coming-of-age story in the style of Louisa May Alcott” or “Write a tagline for the newly opened supermarket.”

Code generation

LLMs exhibit a remarkable ability to understand patterns, enabling them to generate functional code. As they can understand the coding requirements, they are an invaluable asset in programming, helping developers create code snippets for several different software applications.

Chatbots and conversational AI

Chatbots, also known as virtual assistants, are conversational AI models that can understand the context and sentiment behind conversations and create natural and engaging responses to interact with users, just like a human agent does. A common application of conversational AI is through chatbots, which can take various forms and engage users in a query-and-response model. OpenAI‘s ChatGPT is the most widely used LLM-based AI chatbot developed using the GPT-3.5 model. Now, users also have the option to leverage the newer GPT-4 LLM for enhanced performance.

Classification and categorization

In traditional ML use cases involving classification as well, LLMs can be used. The in-context learning ability of LLMs leveraging Zero-Shot and Few-Shot learning can be used.

Generative Pre-training of GPT

As mentioned earlier, we need to pre-train the model on a very large text corpus. It then understands the underlying relationship between various words in a given context. Let’s look at an example.

Suppose we input a sentence to the model, “Life is a work in progress.” What does it learn? And how? Firstly, we should understand that we push any model to learn by providing inputs and outputs (targets). It then learns to map them, i.e., for a given input, it learns to predict the output as close to the target that is present in the training data. For generative purposes, a brilliant way of creating a target is to shift our input sentence one word to the right and make it a target! This way, we teach the model to generate a “next word,” given the previous sequence of words.

This means the model is learning the relationships such that it learns:
For input, “Life is a,” the target word to generate is “work.”
For input, “Life is a work in,” the target word to generate is “progress”
And so on…
And lastly comes the token “<eos>.” This is the end of the sentence token; it now learns to stop the sentence!
There are some interesting aspects of this training process:
Self-Supervised: It is interesting to note that the training is unsupervised or self-supervised, meaning that nobody provided the model with any labeled input. The raw sentences are fed to the model; the target sequence is shifted by one token, and voila! The model starts learning relationships between the words, which it can reproduce later while predicting the output sequence.
Autoregressive: Also, the context of the words is carried from not just the last word but all the previous words. After each token or word is generated, that token is added to the sequence of inputs. The whole sequence then becomes an input to the subsequent step!
Unidirectional: The context is carried and learned from left to right and not vice versa, unlike BERT, which is bidirectional.

Supervised Fine-tuning of GPT

The generative pre-training helps GPT understand the nuances of the language. After we have pre-trained the GPT model, we can now fine-tune it for any task in NLP. We can use a domain-specific dataset in the same language to take advantage of the learnings and understanding of the model for that language.

The purpose of fine-tuning is to optimize the GPT model further to perform well on a specific task by adjusting its parameters to better fit the data for that task. For example, a GPT model that has been pre-trained on a large corpus of text data can be fine-tuned on a dataset of specific sports commentary to improve its ability to answer the questions for a given sports event. Please note that it can get expensive to fine-tune such large models depending on the data size. Other smaller or specific versions of GPT3 (such as Ada and Davinci) are also available for fine-tuning.

Fine-tuning a GPT model is a powerful tool for a variety of NLP applications, as it enables the model to be tailored to specific tasks and datasets.

Output Illustration: What would the model predict?

Let us understand these concepts with an example. Consider that we use the following sentences to train a GPT model:

Life is a dream for the wise.
Life is a dream — realize it.
Life is a song — sing it.
Life is a song — play it.
Life is a challenge — meet it.
Life is a work of art.
Life is the work of God’s agent.
Life is a sacrifice — offer it

After pre-training on the above sentences, let us input the sequence- “Life is a” to the model. What will it generate or predict next?

A simplified explanation is that the model builds a probability distribution of the possible words that can come in next! Will it be “work” or “song”? If the word “work” has a higher probability (i.e., appearing the greatest number of times), the word “work” is predicted. The model then tries to predict the next word after “Life is a work.” And this continues till we reach the “End of Sentence” prediction!

Opportunities and Challenges of LLMs

LLMs present a wide range of advantages:

Ready to use: They can be used directly for many tasks and to classify items into various categories without explicit training or labeling.
Easy to use: They are just an API/chat away. It requires very little coding to access them. This makes their use very accessible to a wide range of users.
Useful for a wide variety of tasks: They can take inputs in many forms and produce output as a sequence of text, making them useful in a variety of scenarios.
LLMs are also being released as open source: Many LLMs are being released as open source, making them accessible to researchers and industry practitioners.
While all these advantages make LLMs very enticing, they also come with certain limitations:
LLMs are large: It is very difficult to create an LLM from scratch.
LLMs are Black boxes: It is very difficult to explain exactly (almost impossible) why a certain sequence was generated by the model. We cannot be sure about the underlying logic.
LLMs can have a bias: If the corpus for pretraining has a bias in the text, the same bias is percolated in the model’s output.
LLMs can do hallucinations: The output sequence may not be fully factual always. The model seems to hallucinate.
LLMs may have IP issues: Who owns the underlying data and its subsequent downstream tasks IP is an unanswered question.

What is the difference between large language models and generative AI?

Generative AI is a broad term in the field of artificial intelligence that refers to models equipped with the capability to generate a wide range of content, such as text, code, images, video, and music. Gen AI models can create and not just analyze a given input. Some key examples of generative AI include Midjourney, DALL-E, Bard, and ChatGPT.

On the other hand, large language models are a category of generative AI that uses textual data to produce textual content as output. These models get specialized training on textual data so that they can generate textual content with varying levels of accuracy. These models are way more advanced than traditional rule-based machine learning models as they can imagine and create new pieces of text based on the training data sets. A well-known example of this is ChatGPT.

Not all generative AI models are large language models. Multimodal LLMs are an example wherein the input can be images, but the output can be text. LLMs can understand the nuances in the human language by identifying and analyzing patterns in the training data and thus can produce coherent and contextually relevant responses that may be sentences, paragraphs, or even entire articles. So, whether you are looking to generate complex text or summarize a given piece of information, it is the generative and transformative potential of LLMs that empowers you to do so in under a few minutes.

Real-world Applications of LLMs

Hundreds of new apps on LLMs are being built every day around the globe, impacting all industries. A few notable applications include:

Code generation
Content generation tools
Copywriting
Conversational tools
Educational tools
Enterprise search
Information retrieval
Virtual Assistant

Domain-specific customizations are also being developed. An interesting example in this space is the finance GPT that has been built by Bloomberg called the BloombergGPT, which specializes in finance and is designed to assist analysts, advisors, and other professionals.

Bing search has integrated GPT4 and has taken the search engine to the next level as it minimizes hallucinations by combining factual search and generative tech.

What is the future of LLMs?

If we think about the evolution of the early AI and ML models into the advanced generative AI models like large language models today, the transformation has been nothing short of impressive. Even in the coming years, these large language models are set to advance to become more sophisticated and serve humans in different areas of life. It is believed that the future will see LLMs training on heavier and more extensive data sets to deliver an almost human-like performance regardless of what they’re used for. Although LLMs will never match human intelligence, they will undoubtedly keep getting smarter as generative AI improves with time. Here are some projected future LLM trends:

Increased capabilities

Despite the impressive advancements made so far, LLMs are not yet perfect. There is still a huge scope for their improvement. Hence, future versions of these AI models are expected to have greater accuracy and enhanced capabilities. Developers strive for perfection by continually refining LLMs, working hard to eliminate inaccuracies, and reducing bias in the current models. This dedication to improvement lays the foundation for more advanced iterations, ensuring that future LLMs will meet the demands of constantly evolving AI applications.

Audiovisual training

While large language models have traditionally been trained on textual data, a notable shift is taking place, with some models now being trained using audio-visual input. This sparks a significant innovation in AI model development, speeding up the learning process and unlocking new possibilities. Using visual and auditory data enhances the adaptability of LLMs, paving the path for far more versatile functionalities. This innovation particularly holds promise for applications made for autonomous vehicles, where a multimodal approach to training can significantly enhance the understanding and responsiveness of the models.

Workplace transformation

LLMs will be instrumental in transforming how professionals in the workplace operate. Just as robots were used for streamlining manufacturing processes, LLMs are also believed to usher in a significant change in the coming years, taking the burden of monotonous tasks off humans. Some of the tasks that LLMs will be capable of handling efficiently in the future include repetitive clerical duties, customer service chats, and simple automated copywriting.

Conversational AI

LLMs are also set to enhance the ability of virtual assistants like Siri, Alexa, and Google Assistant to comprehend user intent and respond more accurately to complex commands. The improvement in their interpretation of the given commands will significantly enhance their overall performance and even make extended and seamless interactions with users possible.

Attribution and explanation

A current challenge in the use of LLMs is that the source of the generated content isn’t always known. Future LLMs are expected to provide clearer explanations and attributions for their results. By offering transparent explanations, users will be better able to understand where the content came from, establishing greater trust and reliability in using LLMs for different applications across industries like healthcare, finance, security, etc.

Conclusion

The space of natural language processing has seen significant advancements over the years. It offers various levers ranging from statistical tools to word embeddings to language models. The appropriate tool is selected for a given task by considering various factors such as the type of task, available computational power, amount of data, and type of data available. For example, if we have a good amount of data in the biomedical field and want to build classification or extraction tasks on it, then this can be done using BERT! On the other hand, if we have a limited amount of data from IT server logs and want to extract enterprise context from these logs, then word and sentence embeddings might suit this task better.

Moreover, machines nowadays can understand and express themselves in a human-like manner. State-of-the-art performances on various NLP tasks (such as classification, Q&A, etc.) are achieved and broken every few months. With the current pace of research across the ecosystem, we can expect newer models and techniques to be released very often. Business applications of these models are also evolving, and the way the world works is expected to change in many industries. As accuracy increases, so does the productive use of these techniques.

Lastly, the latest in this technology is expected to contribute to the building of “Artificial General Intelligence.” When and how? We’re waiting with bated breath!

Written by Sarang Varhadpande, Machine Learning Solution Architect at Digitate

A Beginner’s Guide to Large Language Models

About Generative Models

What are Large Language Models (LLM)?

How do large language models work?

Pre-training

Fine-Tuning

Why are large language models important?

What are large language models used for?

Information retrieval

Sentiment analysis

Text generation

Recommended by LinkedIn

Code generation

Chatbots and conversational AI

Classification and categorization

Generative Pre-training of GPT

Supervised Fine-tuning of GPT

Output Illustration: What would the model predict?

Opportunities and Challenges of LLMs

What is the difference between large language models and generative AI?

Real-world Applications of LLMs

What is the future of LLMs?

Increased capabilities

Audiovisual training

Workplace transformation

Conversational AI

Attribution and explanation

Conclusion

The Closed Loop

3,593 followers

More articles by Digitate

Transforming First Mile and Last Mile of AIOps with Generative AI

Embark on your AI adoption journey with help from a master orchestrator

How procurement teams can make smarter decisions using market intelligence

When Digitate Looks In The Mirror, Does It See A Digital Twin?

AI resistance isn’t where you expect it

Streamlining SAP Management: A Deep Dive into Digitate ASUG Survey

Tribal knowledge, the secret stumbling block to AIOps transformation

AI-driven spend classification: Turbo-charging spend analytics

Facing skepticism about the ROI of intelligent automation? Here’s expert advice on escaping the IA doom loop

Escape the endless IT upgrade cycle with AI-guided change management

Insights from the community

Others also viewed

Mathematical Foundations of Large Language Models

How to Become a Master in Large Language Models (LLMs)

LLM Models

Unraveling the Magic of Transformers in NLP

Get Ahead of the Competition with Generative AI: The Technology That's Changing Everything

Phi-2: A Small Language Model That Packs a Big Punch

Mastering ROUGE Matrix: Your Guide to Large Language Model Evaluation for Summarization with Examples

Part 9: The Next Leap in AI — From Transformers to Pre-Trained Powerhouses

Enhancing Named Entity Recognition (NER) with Large Language Models (LLMs)

Natural Language Processing: Transforming AI & Daily Life in America

Explore topics