More than huggingface imports — LLM Internals

A comprehensive guide to understanding the evolution of LLMs and the SOTA architectures with implementation details.

12 min readMar 7, 2024

The post provides a comprehensive overview of Large Language Models (LLMs), covering various aspects such as architectural innovations, training strategies, context length improvements, fine-tuning, multi-modal LLMs, robotics, datasets, benchmarking, efficiency, and more. It aims to give a concise yet thorough review of recent developments in LLM research, discussing relevant background concepts and advanced topics at the frontier of LLM research. It is intended as a systematic survey and a comprehensive reference for researchers and practitioners in the field, summarizing major findings and discussing key design and development aspects of LLMs.

Evolutionary Timeline

The progress in natural language processing (NLP) evolved from statistical to neural language modeling and then from pre-trained language models (PLMs) to large language models (LLMs). While conventional language modeling trains task-specific models, PLMs are trained in a self-supervised setting on a large corpus to learn generic representations shareable for various NLP tasks. LLMs significantly increase model parameters and training data, leading to better performance. Numerous LLMs have been proposed, with an increasing trend. Early LLMs employed transfer learning, until GPT-3 showed LLMs are zero-shot transferable without fine-tuning. However, LLMs still fail to follow user intent in zero-shot settings. Fine-tuning and aligning LLMs with human preferences enhances generalization and reduces misaligned behavior. Additionally, LLMs appear to have emergent abilities like reasoning and planning, acquired from their gigantic scale. Such abilities have enabled diverse applications including multi-modal, robotics, and question answering. Further improvements have been suggested via task-specific training or better prompting. However, LLMs’ extensive requirements have limited adoption, motivating more efficient architectures and training strategies like parameter efficient tuning, pruning, quantization, and knowledge distillation to enable wider utilization.

Thanks for reading Amit02093’s Substack! Subscribe for free to receive new posts and support my work.

Subscribed

What to expect in this LLM ride?

This article provides a comprehensive overview of the research and developments in large language models (LLMs). It summarizes the architectural and training details of pre-trained LLMs and explores concepts like fine-tuning, multi-modal LLMs, robotics applications, datasets, and evaluation. The key contributions include a concise yet thorough survey of LLM research aimed at giving direction, extensive summaries of major pre-trained models with architectural and training specifics, a discussion of key innovations in chronological order highlighting major findings, and coverage of concepts to understand LLMs including background, pre-training, fine-tuning, robotics, multi-modal LLMs, augmented LLMs, datasets, and evaluation. Topics discussed include LLM overview, architectures, training pipelines, utilization, key learnings from each model, crucial configuration details, training and evaluation, datasets and benchmarks, challenges, and future directions. The goal is to provide a self-contained overview to help practitioners effectively leverage LLMs.

LLMs Development Phases

A broader overview of LLMs, dividing LLMs into seven branches: 1. Pre-Training 2. Fine-Tuning 3. Efficient 4. Inference 5. Evaluation 6. Applications 7. Challenges

Large Language model development can be divided into the following phases and I believe some of these are already existing and established methodologies in the Machine Learning Modelling ecosystem.

1. Pre-Training

Pre-training is the foundational phase in the development of Large Language Models (LLMs) where the model is exposed to vast amounts of text data. This process allows the model to learn the intricacies of language, including grammar, syntax, semantics, and even some aspects of common sense and world knowledge. Pre-training is computationally intensive and requires significant resources, often involving training on datasets that span billions of words. The objective is to create a model that has a broad understanding of language, capable of generating coherent text, understanding context, and making predictions about unseen text. The effectiveness of the pre-training phase is crucial for the performance of LLMs in downstream tasks.

2. Fine-Tuning

After pre-training, LLMs undergo a fine-tuning process where the model is further trained on a smaller, task-specific dataset. This step adjusts the model’s parameters to perform well on specific tasks such as sentiment analysis, question-answering, or document summarization. Fine-tuning allows the general capabilities learned during pre-training to be adapted to the nuances and specific requirements of a particular task. This phase is critical for achieving high performance in real-world applications, as it bridges the gap between the model’s general language understanding and the specific demands of a task.

3. Efficient Inference

Efficient inference is about optimizing the model to make predictions more quickly and using fewer computational resources. This is crucial for deploying LLMs in real-world applications where response time and computational costs are significant considerations. Techniques such as model pruning, quantization, and knowledge distillation are used to reduce the size of the model and speed up its inference times without significantly compromising performance. Efficient inference is a key area of research aimed at making LLMs more accessible and practical for everyday use.

4. Evaluation

Evaluating LLMs involves assessing their performance across a range of metrics and tasks to understand their capabilities and limitations. This can include measuring the accuracy, fluency, and coherence of generated text, as well as task-specific metrics like F1 scores for question-answering tasks. Evaluation is challenging due to the subjective nature of language and the broad range of potential applications. It often involves both automated metrics and human judgment. Effective evaluation is crucial for guiding improvements to LLMs and for understanding their potential impact on various applications.

5. Applications

LLMs have a wide range of applications, from generating text in chatbots and virtual assistants to aiding in content creation, summarization, and translation. They are also used in more specialized fields like legal analysis, medical information processing, and educational tools. The versatility of LLMs stems from their broad understanding of language, allowing them to be adapted to many different tasks. As these models continue to improve, their potential applications expand, offering significant opportunities to automate and enhance various language-related tasks.

6. Challenges

Despite their capabilities, LLMs face several challenges. These include biases in the training data that can lead to biased or unfair outcomes, difficulties in ensuring the factual accuracy of generated content, and the environmental impact of training large models. There are also concerns about misuse, such as generating misleading information or impersonating individuals. Addressing these challenges is critical for the responsible development and deployment of LLMs.

Core Elements of LLM System

Tokenization

The tokenization step in building Large Language Models (LLMs) is a critical preprocessing phase where raw text is converted into a format that can be processed by the model. Here’s a detailed explanation:

Understanding Tokenization

Tokenization is the process of breaking down text into smaller units, called tokens. These tokens can be words, subwords, or even characters, depending on the granularity required by the model. The choice of tokenization method can significantly impact the model’s ability to understand and generate language.

Purpose of Tokenization in LLMs

Standardizing Input: Tokenization standardizes the input text, ensuring that the model receives data in a consistent format. This standardization is crucial for the model to learn patterns in the text.
Handling Vocabulary: LLMs have a fixed vocabulary size, and tokenization helps manage this by breaking down words into pieces that are in the model’s vocabulary. This is especially important for handling rare or out-of-vocabulary words.
Improving Efficiency: By breaking text into smaller pieces, tokenization can improve the model’s efficiency, both in terms of memory usage and processing speed. Smaller tokens mean the model can process more complex texts with fewer resources.
Enhancing Language Understanding: Tokenization strategies like subword tokenization (e.g., Byte-Pair Encoding or BPE) allow the model to understand and generate language more flexibly. It helps the model grasp the meaning of unknown words based on known subwords, improving its ability to handle diverse languages and linguistic phenomena.

Tokenization Methods

Word-Level Tokenization: This method splits text into words. It’s straightforward but can struggle with languages that don’t use spaces or have a high number of out-of-vocabulary words.
Subword Tokenization: Methods like BPE, SentencePiece, or WordPiece split words into smaller meaningful units. This approach balances the granularity between character-level and word-level tokenization, allowing the model to handle a wide range of languages and neologisms.
Byte Pair Encoding (BPE)
Byte Pair Encoding (BPE) is a tokenization method that starts with a very basic vocabulary of individual characters and iteratively merges the most frequently occurring adjacent pairs of tokens to form new tokens. This process is repeated for a fixed number of iterations or until a desired vocabulary size is reached. BPE allows for an efficient balance between the granularity of tokenization and the size of the vocabulary. It ensures that common words are kept whole, while less common words can be broken down into smaller subwords, thus helping the model to handle rare words more effectively without drastically increasing the vocabulary size.
SentencePiece
SentencePiece is a tokenization library that treats the input text as a raw input stream, which allows it to learn subword units (like BPE) directly from the raw text without the need for pre-tokenization into words. This approach is particularly useful for languages that do not use whitespace to separate words. SentencePiece supports both BPE and unigram language model-based tokenization methods. By treating the text as a sequence of raw bytes, SentencePiece ensures that the tokenization is independent of the language, making it versatile and effective for multilingual models.
WordPiece
WordPiece is similar to BPE in that it starts with a base vocabulary of characters and incrementally adds the most frequent combinations of tokens to the vocabulary. The difference lies in the criterion for choosing new tokens to add to the vocabulary; WordPiece optimizes for a balance between the total number of token occurrences and the number of unique tokens, aiming to minimize the representation loss of input text. This method allows for efficient handling of a large vocabulary by breaking down rare or unknown words into known subwords, thereby improving the model’s ability to understand and generate text.
Each of these tokenization methods has its advantages and is chosen based on the specific requirements of the task and the characteristics of the language or languages being modeled. BPE and WordPiece are particularly popular in models like BERT and GPT, while SentencePiece offers a flexible solution for multilingual models and languages with complex word boundaries.
Character-Level Tokenization: Here, text is broken down into characters. While it ensures that there are no out-of-vocabulary tokens, it requires the model to learn from a much larger sequence of inputs, which can be less efficient.

Impact on Model Performance

The choice of tokenization impacts the model’s performance, understanding, and generality. For instance, subword tokenization has become popular in LLMs because it offers a good balance by enabling the model to handle unknown words better, adapt to different languages, and reduce the size of the input sequences without significantly compromising the model’s ability to understand context and semantics.

Here is the link to the experimentation notebook that explains the training and usage of different tokenizers.

Encoding Positions

source put a nice visualization of the positional encodings

Positional embeddings in transformer architecture are vectors added to the input embeddings of tokens to encode their position or order within a sequence. Unlike recurrent neural networks (RNNs) or convolutional neural networks (CNNs), transformers do not inherently understand the order of tokens in a sequence. Therefore, positional embeddings are crucial for transformers to capture sequential information.

In transformers, positional embeddings are typically generated using mathematical functions such as sine and cosine functions of different frequencies and phases. These embeddings are added to the token embeddings before being fed into the transformer layers. By incorporating positional embeddings, transformers can distinguish between tokens based on their positions in the sequence, enabling them to capture sequential dependencies effectively.

The use of positional embeddings allows transformers to process input sequences efficiently and perform tasks such as language translation, text generation, and sequence classification with high accuracy, making them a fundamental component of transformer architecture.

Sinusoidal Positional Embeddings: The sinusoidal positional embeddings are calculated using a combination of sine and cosine functions, which are periodically repeating functions. The intuition behind using these functions is that they can represent the relative position of tokens in a sequence effectively.
PE(pos, 2i) = sin(pos / (10000^(2i/dmodel)))
PE(pos, 2i+1) = cos(pos / (10000^(2i/dmodel)))
Here, pos is the position index, i is the dimension index, and dmodel is the dimension of the model. The sine and cosine functions are calculated with different frequencies (controlled by the 10000^(2i/dmodel) term) for different dimensions, allowing the embeddings to represent different patterns of relative positions.
AIibi Positional Embeddings: AIibi positional embeddings combine learned embeddings with sinusoidal positional encodings to preserve both the learned semantics of tokens and the sequential information within the sequence. The mathematical intuition behind AIibi positional embeddings involves adding a learned embedding vector to the sinusoidal positional encoding for each position in the sequence.
In the Alibi approach, the positional embeddings are learned directly from the data, similar to word embeddings. The intuition is that by allowing the model to learn these embeddings, it can discover the most meaningful representations of position for the specific task and data.
PE(pos, j) = (pos / scale) * (-1)^(j // ndim)
Here, pos is the position index, j is the dimension index, scale is a scalar value, and ndim is the number of dimensions. The embeddings are initialized randomly and learned during training, with the (-1)^(j // ndim) term introducing a periodic pattern to encourage learning different representations for different dimensions.
RopE (Random Orthogonal Projection Embeddings): RopE positional embeddings introduce randomness and orthogonality into the positional encoding process to provide positional information in a more efficient and robust manner. The mathematical intuition behind RopE embeddings involves projecting positions onto random orthogonal vectors, ensuring that each position is represented by a unique and orthogonal embedding vector.
q_rot = q * cos(pos_emb) + q_rot_prev * sin(pos_emb)
k_rot = k * cos(pos_emb) + k_rot_prev * sin(pos_emb)
Here, q and k are the query and key vectors in the attention mechanism, pos_emb is a learned positional embedding, and q_rot_prev and k_rot_prev are the rotated vectors from the previous position. The rotation operation with cos and sin functions encode the relative position between the query and key positions, allowing the model to better capture long-range dependencies.

Comparison of Positional Embedding Approaches in Transformer Models

Further, some interesting positional encoding algorithms to study about:

Convolutional Positional Embeddings: Proposed by Gehring et al. (2017), this approach uses convolutional kernels to generate positional embeddings. The intuition is that convolutions can capture local patterns, which can be useful for representing positional information.
Transformer Product Representations: Introduced by Viswanath et al. (2022), this method represents positional information as a product of two embeddings: a global embedding that captures the entire sequence and a local embedding that captures the local context around each position.
Fourier Positional Embeddings: Proposed by Tay et al. (2021), this approach uses Fourier features to encode positional information. It is based on the idea that Fourier features can effectively represent periodic signals, which can be useful for capturing positional patterns.
Relative Positional Encoding: This approach, used in models like Transformer-XL and XLNet, represents the relative position between each pair of tokens using learned embeddings, similar to RoPE but without the rotation operation.
Untied Positional Embeddings: Introduced by Raffel et al. (2020), this method learns separate positional embeddings for each attention head, rather than using a shared set of embeddings for all heads.
Factorized Positional Embeddings: Proposed by Huang et al. (2020), this approach factorizes the positional embeddings into two components: a temporal embedding and a spatial embedding, which can be useful for tasks involving both time and space dimensions.

Positional Encoding Methods Used in Top Large Language Models

Ok, so I am hitting the roof on the total number of word count here. Let’s take up the rest of the thing in the next part. Either way, I was planning to split the same into different chunks.

Reference to the experimentation code for the above blog

Reference

https://arxiv.org/pdf/2307.06435.pdf