Lecture 1: Overview and Tokenization
Introduction to CS336 and fundamentals of tokenization for language modeling
BPE tokenizers
Byte-pair encoding (BPE) has two phases: training and tokenization (Sennrich et al., 2016). Essentially, it tries to find the most common pair of characters (or bytes) in the training data, and add those to the vocabulary until the desired is reached.
It's a greedy algorithm, but I think it can be proven to achieve maximum compression if we recursively do the training process? The idea is that we keep the sequence lengths manageable, and also the vocabulary size not too large.
This approach has been widely adopted in modern language models (Brown et al., 2020; Radford et al., 2019), and has been further refined with techniques like subword regularization (Kudo, 2018).
Transformer Architecture
While tokenization is crucial, the transformer architecture (Vaswani et al., 2017) provides the foundation for modern language models. The attention mechanism allows models to focus on relevant parts of the input sequence.
The tokenization process
The tokenization process (after training) is as follows:
- We first normalize the text (i.e. lowercase, remove punctuation, whatever) we did in the training phase.
- Split into the initial tokens using the byte-level "unicode" split. Every unicode character is a combination of bytes.
- We then keep performing the merge step until the adjacent pair is not present in the vocabulary and replace them with
[UNK]
symbol or something.
As noted by Vaswani et al. (2017), effective tokenization is essential for the attention mechanism to work properly.
Last updated on