Lecture 1: Overview and Tokenization

BPE tokenizers

Byte-pair encoding (BPE) has two phases: training and tokenization (Sennrich et al., 2016). Essentially, it tries to find the most common pair of characters (or bytes) in the training data, and add those to the vocabulary until the desired $|V|$ is reached.

It's a greedy algorithm, but I think it can be proven to achieve maximum compression if we recursively do the training process? The idea is that we keep the sequence lengths manageable, and also the vocabulary size not too large.

This approach has been widely adopted in modern language models (Brown et al., 2020; Radford et al., 2019), and has been further refined with techniques like subword regularization (Kudo, 2018).

Transformer Architecture

While tokenization is crucial, the transformer architecture (Vaswani et al., 2017) provides the foundation for modern language models. The attention mechanism allows models to focus on relevant parts of the input sequence.

The tokenization process

The tokenization process (after training) is as follows:

We first normalize the text (i.e. lowercase, remove punctuation, whatever) we did in the training phase.
Split into the initial tokens using the byte-level "unicode" split. Every unicode character is a combination of bytes.
We then keep performing the merge step until the adjacent pair $(x, y)$ is not present in the vocabulary and replace them with [UNK] symbol or something.

As noted by Vaswani et al. (2017), effective tokenization is essential for the attention mechanism to work properly.

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Saxena, G., Arora, S., & others. (2020). Language models are few-shot learners. Advances in Neural Information Processing Systems, 33, 1877–1901.

Kudo, T. (2018). Subword regularization: Improving neural network translation models with multiple subword candidates. arXiv Preprint arXiv:1804.10959.

Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., & others. (2019). Language models are unsupervised multitask learners. OpenAI Blog, 1(8), 9.

Sennrich, R., Haddow, B., & Birch, A. (2016). Neural machine translation of rare words with subword units. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 1715–1725.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 30.

Bhavit Sharma

Lecture 1: Overview and Tokenization

BPE tokenizers

Transformer Architecture

The tokenization process

On this page