ChatGPT is all over the news and heavily featured in new business capabilities in company newsletters everywhere. Its capabilities seem astounding — writing emails that sound like you, reading more books in one minute than humans could read in a lifetime. It can draft reports and answer complex queries. This is the era of the large language model (LLM).
We’re now at the cutting-edge intersection of business, technology, and linguistics. For companies seeking a competitive edge, understanding these behemoths of the AI world isn’t just beneficial—it’s essential. Dive in as we unravel the magic behind these digital wordsmiths and the new paradigms they’re setting for future business operations.
What are large language models?
LLMs are machine learning models designed to understand and generate human language. While “understand” should probably go in quotations, these models do understand one thing: patterns. Thanks to a massive number of parameters ranging from hundreds of millions to hundreds of billions, these algorithms can capture and process intricate patterns and nuances of language like never before.
If you’ve ever used ChatGPT or a similar LLM, you might be fooled into thinking the machine is sentient. That isn’t true yet. But these LLMs are doing something truly remarkable by understanding and predicting patterns in language that read as close to human comprehension and generation as possible without sentience. It’s a new world.
There are different types of large language models:
- Transformer Models: Introduced in the paper “Attention is All You Need” by Vaswani et al, these use self-attention mechanisms to weigh input tokens differently, allowing for dynamic relationships between different parts of an input sequence. They dominate NLP tasks and are the basis for models like BERT, GPT, T5, etc.
- Autoencoder: A neural network used for unsupervised learning of efficient codings, it’s designed to minimize the difference between input and its reconstruction. It has applications in anomaly detection, de-noising data, and generating new data.
- Sequence-to-Sequence (Seq2Seq): A model consisting of two primary parts: an encoder and a decoder. The encoder processes an input sequence and compresses it into a context vector. The decoder takes this context vector and produces an output sequence. This has applications in machine translation (e.g., translating English to French), speech recognition, and text summarization.
- Recursive Neural Networks (RecNNs or TreeNets): RecNNs operate on hierarchical tree structures rather than as sequences. Nodes in these trees represent words, and their children represent constituent words or phrases. They’re used often in tasks like parsing and sentiment analysis.
- Hierarchical Models: A type of model architecture designed to capture hierarchical structures in data. It can involve multiple levels or layers, each capturing different levels of abstraction. These appear in image recognition tasks (where different layers can recognize parts of objects, for example) and document classification (where different levels might understand words, sentences, paragraphs, and entire documents).
What’s the difference between LLMs and natural language processing (NLP)?
NLP and LLMs are related concepts in the field of artificial intelligence.
Natural language processing is a broad field of application focusing on the interaction between computers and humans through natural language. The primary goal is to enable computers to understand, interpret, and generate human language meaningfully and usefully. LLMs are specific types of machine learning models designed to understand and generate human language. They’re a subset of models and techniques used in NLP.
NLP encompasses a wide range of tasks and also covers foundational topics like linguistics, semantics, and syntax. LLMs primarily focus on understanding context from vast amounts of text and generating coherent and contextually relevant content. Where NLP can be applied to simple or more complex tasks, LLMs are typically utilized for more complex understanding and generation on par with well-informed humans.
NLP also has a long history, tracing back to the early days of computer science. Early NLP relied on hand-crafted rules and later statistical or neural-based methods. LLMs are a more recent evolution of NLP, using deep learning models to mimic human communication and understanding.
How do large language models mimic humans?
The ability of an LLM to mimic human text, speech, and understanding comes from extensive training. It’s important to note that while it may seem like LLMs think like humans, they don’t “understand” language or concepts in the same way humans do. Their “knowledge” is pattern recognition derived from vast amounts of data, devoid of true consciousness, emotions, or innate understanding.
That said, these models mimic human language understanding and generation through a combination of vast amounts of data, intricate architecture, and advanced training methods. Here’s how they approach human-like linguistic capabilities:
LLMs are trained on enormous datasets, much of which comes from the internet. This includes books, articles, websites, social media, and other forms of written input. These inputs expose the model to a diverse range of topics, contexts, and writing styles. By processing this data, they learn grammar, idioms, facts, reasoning patterns, and even some biases present in the texts.
Models are “pre-trained” on this material to learn fundamental language tasks. Once models are successfully pre-trained, they can be adapted to task-specific use cases:
- Concept: Once an LLM has been pre-trained on a large corpus, it can be further trained (fine-tuned) on a smaller, task-specific dataset.
- Use: It’s a standard approach for adapting a general-purpose model to a specific task, like sentiment analysis or named entity recognition.
- In-context Learning:
- Concept: Instead of fine-tuning, the model uses the context provided in the prompt to guide its responses. Essentially, you give the model a bit of guidance through the input to achieve the desired output.
- Use: Useful when you want to guide the model’s behavior without fine-tuning it on new data, e.g., asking GPT-3 to “Translate the following English text to French: …”
- Zero-/One-/Few-shot Learning:
- Concept: This refers to the model’s ability to perform tasks without (zero-shot), with one (one-shot), or with a few (few-shot) examples to guide it.
- Zero-shot: You ask the model to perform a task without providing any examples. E.g., “Translate the following text into French: …”
- One-shot: You provide one example to guide the model.
- Few-shot: You give multiple examples to help the model generalize the task. E.g., providing several translation pairs before asking for a new translation.
- Use: Allows the model to tackle tasks it wasn’t explicitly fine-tuned on.
Deep learning architecture
One of the more common architectures, Transformer architecture, excels in handling sequential data like text. Instead of taking each word alone, it employs attention mechanisms that allow the model to focus on different parts of an input text. This is a lot like how humans pay attention to specific words or phrases when comprehending language.
The role of context
The ability to understand context is a feature of specific deep learning architectures. The Transformer architecture, for example, utilizes self-attention mechanisms that weigh input tokens —typically a chunk or unit of text that the model processes — differently. This allows the model to focus on different parts of the input for various tasks. This mechanism is central to models like BERT, GPT, and their derivatives, enabling them to achieve state-of-the-art performance on many NLP tasks.
In contrast, earlier NLP models, like word embeddings offered a fixed representation for each word, irrespective of its context. However, modern Transformer-based models provide dynamic word representations based on context, capturing nuances like polysemy, where a word can have multiple meanings based on its usage.
The role of transfer learning
Transfer learning is where these models took a real turn into human-like performance. Transfer learning is a technique where a model developed for one task is reused (or “transferred”) as the starting point for a model on a second task. It leverages the knowledge gained from the initial task to improve learning in the new task. Sound familiar? That’s how humans learn to do new things, too.
In previous artificial intelligence iterations, AI would need to begin again from scratch every time it learned a new task. In contrast, human children learn to hold something– a bottle maybe or a rattle –and then transfer that knowledge to hold other things. It’s this ease researchers wanted to replicate in things like large language models.
Transfer learning isn’t inherently a part of the architecture but a training strategy. However, deep learning architectures, especially large neural networks, have made transfer learning particularly effective. For instance, models like BERT are pre-trained on a massive corpus to learn general language understanding and can then be fine-tuned on smaller, task-specific datasets.
This approach has become standard in many NLP tasks because training large models from scratch is computationally expensive and may require data resources that are not always available. Now that these models are capable of transfer learning, we’re getting domain-specific models.
What are some examples of large language models?
Here are a few examples of LLMs making news right now and some that continue to transform how we approach natural language processing.
OpenAI’s GPT-4 was unveiled in March of 2023 and has just about astonished everyone around. It has a deep comprehension of complex reasoning that goes beyond mere text. It has also demonstrated potential for complex coding capabilities (albeit with some controversy). It’s the first model to incorporate multimodal capabilities, accepting text and images. ChatGPT is the most salient example of GPT-4, although without the multimodal capabilities. However, Bing Chat has rolled out this capability for select users.
And obviously, if we’re talking about GPT-4, we need to give nods to previous versions. OpenAI first released GPT-3 in 2020, and GPT 3.5 powers the current version of ChatGPT for most users.
Language Model for Dialog Applications (LaMDA) is a group of LLMs developed by Google. These use decoder-only transformer language models. Google pre-trained this LLM on a large corpus of text, but many people may remember it for more sensational reasons. A former Google engineer went public claiming that the program wasn’t just human-like, but actually sentient.
Bidirectional Encoder Representations from Transformers (BERT) is another Google LLM family able to convert sequences of data into other sequences. It’s famous for its bidirectional transformers, which are able to process input data both left-to-right and right-to-left. This capability gives BERT a deeper understanding of the meaning of words in sentences and how they relate to each other, something handy in sentiment analysis. You might know it best from a 2019 update to Google search capabilities.
Large Language Model Meta AI (LLaMA) used a variety of public data sources for training, and the largest parameter is 65 billion. Although originally released to approved researchers and developers, it was leaked and is now open source.
Developed by Microsoft, Orca’s relatively small parameters (13 billion) make it small enough to (theoretically) train on something like a laptop. It’s built on top of a smaller parameter version of LLaMA but can imitate and learn from much larger models like GPT-4. It’s open source and currently making waves in the research world.
Pathways Language Model (PaLM) 2 is the next generation of PaLM, an LLM designed to generalize across domains and tasks. It’s an expansive iteration, reportedly trained on over 500 billion parameters, and shows more promise in understanding traditionally tricky language tasks like riddles, idioms, and other nuanced texts from multiple languages. It powers Google’s Bard, but users can also test it on Google’s Vertex AI platform.
phi-1 from Microsoft is a Microsoft LLM making the news for its miniature 1.3 billion parameters. It was trained in just four days on a collection of textbook-quality data — a testament to the power of truly quality data plus synthetic data. It has fewer general capabilities but notes a trend towards LLMs scaling down.
BLOOM is a massive, open source AI trained on 46 languages and 13 coding languages. This was a massive project cofounded by teams from HuggingFace, NVIDIA, Microsoft, and others and developed by over 1000 researchers for the purpose of making an open source resource.
The road ahead for LLMs
These are by far not the only LLMs out there, and we’ll continue to see more. Large language models are demonstrating a profound ability to grasp human language and generate content. We’ve already taken huge leaps in artificial intelligence and brought AI capabilities to the masses; we’ll continue to see iterations of LLMs niche down, become more efficient, and integrate with more of our everyday tasks. However, the continued question of ethical usage and understanding limitations will define the years ahead.
Elizabeth Wallace is a Nashville-based freelance writer with a soft spot for data science and AI and a background in linguistics. She spent 13 years teaching language in higher ed and now helps startups and other organizations explain – clearly – what it is they do.