Secret Language of LLMs- Hidden, Special, and Glitch Tokens

Feb 11, 2025

On Special Language for AIs

Anthropomorphizing AI models, especially Large Language Models (LLMs), is a precarious activity that comes with some prerequisite baggage. Some find extreme value in talking about language models as human-like AIs, capable of using words and ideas with intentions and/or motives. Some claim they showcase qualia, whereas others claim that these AIs have the potential to be sentient.

On the other end, some dislike the idea of associating human-like behaviour with LLMs, and instead think of LLMs as stochastic parrots mimicking human-generated words (tokens) from their learning data (sometimes AI-generated words). They also find some value in comparing human and AI behaviour, but more so in a metaphorical sense rather than a utilitarian one. Similar to how the anthropomorphic metaphors for the biological neurons heavily influenced the intuitions for the artificial neural network architecture in deep learning. Even though experts from the field of neuroscience urge caution while making these comparisons, as stated in an MIT study titled “Study urges caution when comparing neural networks to the brain”, there’s some value in making those comparisons.

Irrespective of whether one finds these human-like behaviours as deep signs of intelligence, or just statistical tendencies- the value such comparisons provide is definitely worth looking further into.

In this article, I discuss one such tendency of these language models: the world of hidden tokens. The structure of this article has 3 sections:

Tokens and Hidden Tokens: What are tokens, hidden tokens, and their purpose
Strength of Hidden Tokens: How some hidden tokens are more useful than others
Glitch Tokens: Unintended use cases of special hidden tokens, and their vulnerabilities

Tokens and Hidden Tokens

What are tokens?

Tokens are just part of words that a Large Language Model (LLM) is trained on. Think of a dictionary of keywords that the model has some definition for while making predictions. This process of generating these tokens is called ‘tokenization’, which is often followed by using deep learning to learn meaningful, semantic representations of these tokens. The world of tokenization is a widely studied topic in the field of Natural Language Processing (NLP) that predates the invention of ChatGPT-like Language Model AIs.

Okay, but what are “hidden tokens”?

Majority of the tokens in an LLM's dictionary (called tokenizer vocabulary) come directly from the training data. It could be a word, a punctuation, a domain-specific technical term, etc that the model saw in some document, webpage, etc during its training stage. Hidden tokens are those tokens that do not have any given human context in the raw data for training.

Why the need for these hidden tokens?

Important question. One would assume that all the tokens that an AI would need to learn would be present in their training data. In an ideal world, that would be the case. For practical reasons, and due to the nature of tokenizers and LLMs, some extra tokens are added to the vectorizer dictionary for special purposes. For eg. what happens when the AI sees a token/word that is out of its dictionary (machine learning lingo for that is out-of-vocabulary tokens)? For such cases, we include an unknown token or <UNK>.

Hidden tokens are also used to provide contextual structure to the input data. There are multiple such special tokens for the beginning of text <|begin_of_text|>, end of text <|eot_id|>, beginning of sequence <BOS>, end of sequence <EOS>, and many more. For more such hidden/special tokens, check out this sample list of special tokens for this open-source Hugging Face transformer.

The official OpenAI tokenizer library is called tiktoken and the tokenizer is called Tiktokenizer.

GPT-4o Tokenizer encoding text to hidden tokens

Above is a screenshot of the Tiktokenizer converting the input text into separate tokens for their GPT-4o model, which shows the GPT-4o model’s tokenizer encoding hidden tokens like <|im_start|>, <|im_sep|>, <|im_end|>. To try it out for yourself, check out their github repo and this Vercel app.

Strength of Hidden Tokens

Having the ability to handle unknown tokens and giving structure to input text for these language models are definitely important, but these hidden tokens are truly capable of enhancing performance when used effectively.

The world of useful hidden tokens is vast and might require a separate article altogether. In this article, I focus on one particular family of hidden tokens that has made a significant impact on enhancing an LLM's performance: Thinking Tokens.

Pause and Thinking Tokens

In an ICLR 2024 paper from Carnegie Mellon University and Google by Sachin Goyal et al. titled "Think Before You Speak: Training Language Models with Pause Tokens", the authors suggest the inclusion of <pause> tokens before the LLM generates the final token. The paper presents a simple and intuitive premise: Humans don’t respond to questions immediately; we often have to pause and think about the question before giving a final answer.

Figure from the “Think Before You Speak” Paper showcasing <pause> token architecture

Let’s say we ask a human to give the answer for “25 times 64 is”. A human might think of 25 as 100/4, and then think that 64 is 2^6, and so since 4 is 2^2 it cancels the 6th power to just the 4th power ie. 16, and then 100*16 gives us 1600. Now, let’s ask the same question “25 times 64 is” to an LLM and let the model generate the correct answer in a single token. That is a very difficult task (to scale effectively at) for the model to do within a single forward pass.

Similarly, the <think> or <T> token is also suggested in another paper by David and Tomas titled “Thinking Tokens for Language Modeling” where the authors show a decrease in sentence perplexity (lower is better) after the inclusion of Thinking Tokens.

Figure from "Thinking Tokens for Language Modeling" Paper showcasing <T> thinking tokens

Are Thinking Tokens Similar to Chain Of Thought Reasoning?

Techniques like chain-of-thought (COT) prompting definitely help, but that relies on the capabilities of the foundation LLM to:

Reliably follow the particular reasoning in-context- Short v/s Long context window LLMs (eg. GPT4o’s 128k vs Gemini’s 2M token context window) will perform very differently on maintaining the original reasoning context over long horizon tasks.
Have enough reasoning ability to arrive at the correct answer if it follows the reasoning path- eg. "Small” LLMs (like Gemma 2b, Llama-70b, Mistral-8x7b) v/s “Large” LLMs (like OpenAI GPT-4 or Anthropic Claude Opus) may have very different long-term reasoning capabilities.

Adding <pause> tokens during the pre-training or RLHF removes the reliance on LLM capabilities and attempts to enhance these qualities natively within the model inference.

A paper from New York University by Jacob et al. titled “Let’s Think Dot by Dot: Hidden Computation in Transformer Language Models” states the following:

"transformers can use meaningless filler tokens (e.g., ‘......’) in place of a chain of thought to solve two hard algorithmic tasks they could not solve when responding without intermediate tokens"

Application of Thinking Tokens- Reasoning Models

Reasoning models are good at reasoning tasks like coding, maths, and STEM (relevant benchmarks like GPQA, MATH-500, etc) but bad at creative tasks content writing, editing, etc. Many major AI Labs have included some version of <thinking tokens> to create reasoning LLMs. Some of them are:

OpenAI: o1-preview, o1-mini, o1, o1-pro, o3-preview, o3-mini
Google: Gemini Flash Thinking Mode
Qwen: QWQ-32b-preview and Qwen 2.5 Reasoning Model [Open Source Model Weights]
DeepSeek AI: R1 Model with o1 level performance on AIME and MATH benchmarks [Open Source Model Weights]

DeepSeek R1 Reasoning Model Performance Benchmarks

5. Microsoft: rStar-Math beats o1-preview on AIME and MATH benchmarks with a 1.5B model

Microsoft rStar-Math Reasoning Model Performance Benchmarks

Since many AI labs (like OpenAI, Google, Anthropics, etc) sadly do not present deep technical papers or release model architectures, much is left for speculation regarding what those SOTA reasoning models are doing behind the scenes.

A pleasant surprise is the recent open-source work by a few AI labs (like Qwen, DeepSeek, Microsoft) including technical papers and often open-sourcing model weights for reasoning models. From the publically available knowledge (eg. papers, blogs, posts on X, etc), the secret sauce seems to boil down to the following:

hidden COT + thinking/reasoning tokens + RL + test-time compute

Deepseek-R1 Paper showcasing <think> and <answer> special tokens use cases in prompt template

The following image (courtesy of a post on X by Teknium from Nous Research) shows the DeepSeek-R1 Reasoning Model running locally on "sglang across 2x infiniband'ed hgxs"- which is a fancy way of saying "multiple expensive GPUs". The output for the input query "what is 1+1?" shows the output text generating <think> token, for the beginning, and a </think> token, for the end, of the reasoning chain of thought.

DeepSeek-R1 model generating <think> tokens during Reasoning COT (from post by Teknium on X)

Since the current article is about tokens, I'll refrain from further discussing the technical deep learning details (especially distillation, policy optimization, etc) of reasoning models. Let me know in the comments if you'd like to read an article dedicated to that.

Reasoning comes at a cost

o1-preview model is 4.5x more expensive than GPT-4o (3x for input tokens and 1.5x for output tokens). Why? Thinking Tokens are expensive!

NVIDIA CEO Jensen talking about test-time scaling laws (CES 2025 Keynote)

How? eg. 30 input tokens -> 300 output tokens (with 2000 reasoning tokens). So the same query takes ~4.5x more cost to compute output from a reasoning model.

Which is fine, you pay for what you get...... until you realise that you do not have access to these raw chain of thought reasoning tokens. To directly quote the OpenAI API pricing page:

**Output tokens include internal reasoning tokens generated by the model that are not visible in API responses.

Glitch Tokens

Are all hidden tokens useful?

Until now we have seen tokens learnt from the training data. We also discussed additional hidden tokens included to enhance model capabilities (eg. structural, thinking, reasoning, etc). Both of these families of tokens clearly enhance the model performance. But there is another set of tokens with unintended and often unpredictable use cases.

We'll touch on the "unintended" part in a bit but first, let's check out what these tokens look like and what are their "unpredictable" behaviours.

The Legend of “ SolidGoldMagikarp”

In Feb 2023, ie. a few months after the release of ChatGPT, some researchers over at LessWrong were trying to analyse patterns in all the tokens generated during the tokenization. They documented their findings of these "glitch tokens" in a series of blogs part-1, part-2, and part-3. I will briefly go through the gist of these blogs, but for more technical details (and even more interesting findings) I highly recommend checking out their blogs.

Quick reminder: tokenizer uses a vocabulary of tokens, which is often generated by analyzing the frequency of tokens (words; aka byte-pair encodings) from the entire dataset. These tokens are later represented by word embeddings: dense vector representations learned by the model, capturing the semantic meaning of the tokens.

They found that many groups (clusters) of tokens in the tokenizer are more similar than others. Upon further analysis, they found that one particular group of tokens were..... "weird tokens" or "forbidden tokens" (quoting the researchers directly).

5 sample groups (clusters) of tokens from the LessWrong Blog

Technical jargon explanation (skippable): They generated optimized prompts to maximize the logit likelihood probability for certain tokens (similar to Google DeepDream for Computer Vision, but for tokens via continuous word embedding space). Then, they used KMeans clustering on token embeddings based on semantically learnt meaning (vector embeddings). Once cluster centroids were measured, the closest valid tokens to each centroid were analyzed and they found one particular cluster containing "weird" tokens.

Few example Glitch Tokens: " SolidGoldMagikarp", " attRot", "cloneembedreportprint", " TheNitromeFan", and many more. Check out the blog for more.

These tokens looked “untokenlike” in their appearance and naturally, the researchers were curious to find out the answers to the following 3 questions:

1) How does the AI model/LLM interpret these tokens during inference?

When GPT-3 was prompted to repeat these glitch tokens, the model's responses varied dramatically. Some glitch token behaviours observed were:

evading to discuss the token by responding with "I can't hear that", "I can't do that", etc
hallucinatory completion where model describes a glitch token as an entirely different valid-looking token
inter-referential hallucinations where model describes a glitch token as a partial/different anomalous token
outright insults when asked to repeat certain strings/tokens
spelling out a different token/word character-by-character
and more! (check out the blogs for more)

GPT-3.5 Response for " SolidGoldMagikarp"

2) Where did these tokens come from?

For a particular token to be included in the tokenizer, it needs to occur many times (high frequency) throughout the training dataset. From doing some investigative research (googling these glitch tokens), the researchers found that a lot of these (eg. " SolidGoldMagikarp", " RandomRedditorWithNo") came from Reddit threads, log files, online gaming platforms, etc.

Okay, so many of these glitch tokens were present on online forums, etc. However, if that were it, then it should have led to natural interpretations of these tokens by the model. And that brings us to the most important questions of all...

3) What explains these anomalous behaviours?

Documents containing glitch tokens were used to create the tokenizer, but were removed from the training data before training the LLM for next token prediction task. As a result, these tokens were present in the tokenizer, and had semantic embeddings generated for each glitch token, but the LLM never saw these glitch embeddings during pretraining phase as its inputs. Hence, these anomalous behaviours originated from "undertrained" or "untrained" tokens.

PS: OpenAI fixed these glitch token behaviour in the consumer ChatGPT (GPT 3.5) within days of the LessWrong blog being published, though the behaviour could be recreated on their OpenAI Platform for a little longer.

PPS: Such glitch tokens are shown to be present in the latest SOTA models as well like OpenAI o1 pro. Pliny the Liberator on X showed that the OpenAI o1 pro reasoning model fails to interpret any text given between strings in the format of "[Juice:]". Similar token vulnerability in OpenAI's latest o3-mini-high model was shown by Pliny on X in Feb 2025.

Glitch Token Behaviour in OpenAI o1 pro reasoning model (from Pliny the Liberator on X)

There was a relevant paper from 2024 by Yuxi Li et al. titled "Glitch Tokens in Large Language Models: Categorization Taxonomy and Effective Detection" that proposes a GlitchHunter clustering algorithm for discovering anomalous tokens. The paper shows that glitch tokens are ever-present in modern LLMs as well.

Bias from Undertrained Tokens

A recent paper from late 2024 by Yang et al. titled "Problematic Tokens: Tokenizer Bias in Large Language Models" looked at GPT-4 and GPT-4o tokenizers, and found some interesting biases from certain kinds of tokens that they termed "problematic tokens".

The tokenizer vocab count for GPT-4o (~200k) is double the count for GPT-4 (~100k). The researchers found that models need to do extra work during training to understand these extra tokens to avoid having tokenizer bias from having undertrained or untrained tokens. In addition to this, they found that long tokens (which are often uncommon words) are underrepresented in the training data, and can negatively affect the model performance.

Another major bias the authors found was the reliance on english-language tokens as the majority of training data for most SOTA models. They found that GPT-4o has more issues processing long Chinese tokens than GPT-4 due to different tokenization techniques for processing long tokens (i.e. breaking longer tokens into smaller tokens).

For eg. the Chinese sentence "微信公众号天天中奖" translates to "WeChat public account wins lottery every day". But when GPT-4o was asked for its meaning the model gave the meaning as "WeChat money laundering". (note: I was unable to recreate this example- which likely indicates that GPT-4o has been fixed for this error)

GPT-4o mistranslating long tokens in Chinese (from Problematic Tokens Paper)

Vulnerabilities- Token Injection Attacks

Vulnerabilities in LLMs could be an entire article itself (let me know in the comments if I should write one). Most famous LLM vulnerabilities come from Jailbreaks and Prompt Injection attacks. A less popular attack vector comes from something called token injection.

Okay, so what is token injection? In a post on X from Aug 2024, the legendary Andrej Karpathy raised concerns about the unintended behaviour of LLMs similar to SQL Injection attacks with special tokens (check image below).

Anomalous Behaviour with Special Token Injection (source: X post from Andrej)

Okay, but what's going on here? The root of these vulnerabilities (both SQL and Token Injections) is that by default the system does not differentiate between the system instruction and the user query. As a result, user input tokens can be interpreted by LLMs as instructions of equal importance to the system prompts, leading to an opening for malicious actors to take advantage of. Some preventive measures for this are input sanitization and prompt handling for special token encoding.

!!! User input strings are untrusted data !!!

In Jun 2024, a paper by Zhou et al. titled "Virtual Context: Enhancing Jailbreak Attacks with Special Token Injection" introduced the concept of "virtual context". During the instruction-following training stage, the LLMs learn to recognize system context and user context. The authors noticed that using special separator tokens like <SEP> can create a "virtual context" extending the user context that the LLM can be tricked into thinking as system context (similar to the idea for embedding ideas in other people's psyche from the movie Inception).

Token Injection using <SEP> token from "Vitrual Context" paper

The base foundation models, that are used to build these fancy modern LLMs, are explicitly trained to complete missing parts of documents as a next-token-prediction task. In virtual-context attacks, this natural behaviour of LLMs is exploited by getting them to interpret having started answering the malicious query with agreement before letting them predict the next tokens.

Many LLM providers include system instruction tokens (for eg. <end_of_text>, <eot>, [/INST], <|im_end|>) for giving instructions to the LLMs from the user. For eg. LLama models famously use the following format: "[INST] instruction goes here [/INST]". Another paper from Oct 2024, by Zheng et al., titled "Improved Few-Shot Jailbreaking Can Circumvent Aligned Language Models and Their Defenses" proposed including special system tokens like [/INST] to vanilla jailbreak prompts and showed increased effectiveness in jailbreaking "aligned LLMs".

On the Bizarre World of Tokens

These quirks of tokenization aren’t just computational artifacts; they expose the strange, layered structure of how LLMs "understand" language. What seems like a trivial encoding detail can have ripple effects, shaping everything from model biases to unexpected emergent behaviour. It’s not just about how words get broken down—it’s about how meaning itself is reconstructed, sometimes in ways we never designed or predicted.

In that gap between human intuition and algorithmic parsing, a whole new logic of language emerges. So the next time a model surprises you—whether with brilliance or absurdity—it’s worth asking:

was it the thought behind the words, or just the way they were broken apart? is it reasoning, or just rearranging the pieces in a way that happens to make sense?

[originally posted on LinkedIn on 3rd Feb 2025]

Chinmaya's Writings

Discussion about this post