How Tokenization is Limiting Generative AI Models

Please Share 🤝

Highlights

Generative AI models use tokenization to process text, breaking it into smaller units.
Tokenization introduces biases and challenges, especially in non-English languages.
Future AI models may use byte-level processing to overcome tokenization limitations.

How Generative AI Models Process Text

Generative AI models, unlike humans, do not straightforwardly process text. Understanding their token-based internal mechanisms can explain some of their peculiar behaviors and limitations. Most models, including small on-device ones like Gemma and industry leaders like OpenAI’s GPT-4, rely on an architecture known as the transformer. Transformers create associations between text and data but require a significant amount of computing power to handle raw text.

The Role of Tokenization

Transformers work with text by breaking it down into smaller pieces called tokens, a process known as tokenization. Tokens can range from entire words to syllables or individual characters. For example, the word “fantastic” might be tokenized as “fan,” “tas,” and “tic.” Tokenization allows transformers to process more semantic information before reaching their context window limit, but it can also introduce biases.

Tokenization Challenges and Biases

Tokenization can lead to odd spacing issues that confuse transformers. For instance, “once upon a time” might be tokenized differently if there is trailing whitespace. Tokenizers also handle case sensitivity variably, with “Hello” being one token while “HELLO” might be several tokens. This inconsistency can affect model performance, especially in tasks requiring precise text interpretation.

New (2h13m 😅) lecture: "Let's build the GPT Tokenizer"

Tokenizers are a completely separate stage of the LLM pipeline: they have their own training set, training algorithm (Byte Pair Encoding), and after training implement two functions: encode() from strings to tokens, and… pic.twitter.com/iSRD2la1Gv
— Andrej Karpathy (@karpathy) February 20, 2024

Language-Specific Tokenization Issues

Tokenization is particularly challenging for non-English languages. Many tokenization methods assume spaces separate words, which is not true for languages like Chinese, Japanese, Korean, Thai, or Khmer. Studies show that tasks in non-English languages can take twice as long for transformers to complete, and users of less token-efficient languages may experience poorer performance and higher costs.

Tokenization and Mathematical Challenges

Generative AI models also struggle with math due to inconsistent tokenization of digits. Models might treat “380” as one token but split “381” into two tokens (“38” and “1”), disrupting numerical relationships. This inconsistency leads to errors in understanding numerical patterns and context.

Potential Solutions and Future Directions

Addressing tokenization challenges requires innovation. Byte-level state space models like MambaByte, which bypass tokenization by processing raw bytes, show promise. MambaByte can handle text and data more effectively, including dealing with noise from swapped characters and varying capitalization. However, these models are still in early research stages and are currently computationally infeasible for large-scale transformers.

The Future of Tokenization

While tokenization presents significant challenges for generative AI, advancements in model architectures and computational techniques may offer solutions. Future models may move away from tokenization altogether, allowing for more accurate and efficient text processing.