- Generative AI models use tokenization to process text, breaking it into smaller units.
- Tokenization introduces biases and challenges, especially in non-English languages.
- Future AI models may use byte-level processing to overcome tokenization limitations.
How Generative AI Models Process Text
Generative AI models, unlike humans, do not straightforwardly process text. Understanding their token-based internal mechanisms can explain some of their peculiar behaviors and limitations. Most models, including small on-device ones like Gemma and industry leaders like OpenAI’s GPT-4, rely on an architecture known as the transformer. Transformers create associations between text and data but require a significant amount of computing power to handle raw text.

The Role of Tokenization
Transformers work with text by breaking it down into smaller pieces called tokens, a process known as tokenization. Tokens can range from entire words to syllables or individual characters. For example, the word “fantastic” might be tokenized as “fan,” “tas,” and “tic.” Tokenization allows transformers to process more semantic information before reaching their context window limit, but it can also introduce biases.
Tokenization Challenges and Biases
Tokenization can lead to odd spacing issues that confuse transformers. For instance, “once upon a time” might be tokenized differently if there is trailing whitespace. Tokenizers also handle case sensitivity variably, with “Hello” being one token while “HELLO” might be several tokens. This inconsistency can affect model performance, especially in tasks requiring precise text interpretation.
New (2h13m 😅) lecture: "Let's build the GPT Tokenizer"
— Andrej Karpathy (@karpathy) February 20, 2024
Tokenizers are a completely separate stage of the LLM pipeline: they have their own training set, training algorithm (Byte Pair Encoding), and after training implement two functions: encode() from strings to tokens, and… pic.twitter.com/iSRD2la1Gv
Language-Specific Tokenization Issues
Tokenization is particularly challenging for non-English languages. Many tokenization methods assume spaces separate words, which is not true for languages like Chinese, Japanese, Korean, Thai, or Khmer. Studies show that tasks in non-English languages can take twice as long for transformers to complete, and users of less token-efficient languages may experience poorer performance and higher costs.
Tokenization and Mathematical Challenges
Generative AI models also struggle with math due to inconsistent tokenization of digits. Models might treat “380” as one token but split “381” into two tokens (“38” and “1”), disrupting numerical relationships. This inconsistency leads to errors in understanding numerical patterns and context.
Potential Solutions and Future Directions
Addressing tokenization challenges requires innovation. Byte-level state space models like MambaByte, which bypass tokenization by processing raw bytes, show promise. MambaByte can handle text and data more effectively, including dealing with noise from swapped characters and varying capitalization. However, these models are still in early research stages and are currently computationally infeasible for large-scale transformers.
The Future of Tokenization
While tokenization presents significant challenges for generative AI, advancements in model architectures and computational techniques may offer solutions. Future models may move away from tokenization altogether, allowing for more accurate and efficient text processing.
Check out more on Artificial Intelligence(AI) around the globe!
Donald Trump’s AI-Generated Dance Video with Elon Musk Gets Viral!
Krutrim to Launch First AI Chip, Bodhi 1, by 2026
OpenAI Unveils GPT-4o mini: Powerful AI at an Affordable Price
Here’s the Full List of 28 US AI Startups That Have Raised $100M or More in 2024
Claude Adds Prompt Engineering Support to Enhance AI Apps
Bumble Users Can Now Report AI-Generated Profiles
OpenAI and Arianna Huffington Back AI Healthcare Venture
Elon Musk Reveals xAI’s Grok Chatbot Launch Timeline
How Tokenization is Limiting Generative AI Models
OpenAI Breach: The Security Threat to Companies
Source: Tokens are a big reason today’s generative AI falls short | TechCrunch