The Power of Discreteness (in Language Models)

August 12, 2025

Digital systems work inherently over discrete operations of ones and zeros. Analog systems work over continuous operations. Digital systems are strictly more accurate than their analog counterparts. What is it about discrete operations that leads to accurate predictions? The answer lies in the mechanism with which discreteness tackles the issue of error propagation. A system which produces continuous predictions will incur at least some error, however negligible, at the first step of prediction for any valid real world system. Only in theory is the error absolute zero. Over time, this small error could accumulate because the system lacks a way to keep its propagation in check. A discrete system on the other hand has to bin its predictions after every step —- hence offsetting any error it has accumulated over the previous step.

Discretization is a criterion enforced by humans with drawbacks as well as benefits. A discrete system typically runs slower, propagates less information, and is inherently mismatched to modeled continuous systems. In those terms, it would always do worse than a perfect continuous system, because of the approximations it has to make at every timestep. And so it appears we are losing right at the first design choice - when making the choice of discreteness. However, in practice, we do not necessarily care about maximum possible accuracy. What we care about more is correctness. If we can find a discrete system which can approximate the continuous data we see to a reasonable degree, then that system leads to much better predictions than a continuous system. The drawbacks simply do not matter compared to the overwhelming advantage of correctness. Systems which correctly predict processes are radically more valuable than systems which are not — even if they are slower and require more power.

So where does this intuition fall apart (or does it) when considering AI models - specifically language models? Every human language that we have discovered chunks sounds into discrete word-like concepts. Without this capability, sophisticated communication is apparently impossible¹. Within seconds of speaking with someone, we would not be able to understand what they are saying since there would be error accumulation with respect to the sounds we are hearing and our interpretation of what they could mean. But this situation does not happen in real life. That is because in real life we always have a list of words we can map each sound to and hence correctly infer it before moving on to listening to the next set of sounds. This process happens at such a fast and unconscious pace that it is hard for us recognize its existence. This mapping is done to the set of words, which is discrete. And hence error accumulation does not occur.

A standard transformer model produces a discrete output for each token it processes. Pink squares denote where the discretization is enforced in the model.

Because language models model sequences created from a discrete set of words —- predicting the next word from the current sequence of words — they inherently are able to prevent error accumulation to a great degree. Of course, I say this assuming that errors are not made in the sense of ‘putting a prediction in the wrong bin’, which could still lead to error propagation. But are they actually doing discrete computation all the way? Our initial intuition would be to say yes, since all computation eventually breaks down into binary manipulation, which is discrete. Should we not care about discretization within the design of the model at all then, given the eventual operations are discrete? Relatedly, is the supposed point at which discrete prediction seems to be happening — when predicting the next word — the right point for the kind of capability we want out of language models? Maybe we want the model to be designed in such a way where it parses entire sentences and then predicts a few tokens corresponding to the end of the final sentence. That seems like an arbitrary design choice. Why is it arbitrary? Simply because for most pieces of text, the end of the final sentence does not carry any more of a special meaning than the rest of the sentences. The overall meaning of the piece of text is hidden somewhere everywhere but not in any particular word, usually.

When operating between the low-level discrete manipulations happening when a language model is trained and the high level discrete predictions it is making at every token position we are switching abstraction levels. Design choices such as discretization are heavily dependent on the abstraction level they are enforced at, leading to different behaviors for different abstraction levels. Put another way, if we were to tokenize entire sentences than sub-words, a plausibly different behavior would emerge even though the discretization still remains at every *token* position.

Error Propagation in Continuous vs Discrete Models

The propagation of error can be understood as a combination of two different phenomena — 1) how the input is modeled and 2) how the output is modeled. In a next token prediction regime, phenomenon 1 dictates how big the one step error is when choosing to model the input as discrete vs continuous. On the other hand, phenomenon 2 dictates how this error is fed forward to the next computation step, i.e. is it offset at every step or passed as is. We can think of a simple experiment to analyze the effect of these two phenomena. Notice what happens when the input is modeled continuously — a given permutation of input elements/tokens is never exactly seen again in the training dataset (or is seen very rarely) — making it harder for the model to understand the context within those permutations. We can simulate this setting within a discrete input transformer model by keeping the contexts as is but randomly swapping the discrete tokens in the context with surrogate or duplicate tokens (and also the target tokens they would predict). Here’s what we observe:

- The validation loss worsens monotonically as the number of surrogates in the input are increased (Figure A, left). This is expected because now it is harder for the model to understand the exact context — there are increased fluctuations in the context because of the random surrogate swappings. It doesn’t seem like training on number of tokens proportional to the number of surrogates helps resolve this issue, hinting at this being not just a sample efficiency issue. For the model to see all possible permutations it has to see exponentially many samples now so it makes sense that linearly increasing the number of tokens seen will not help much. - Notice what is happening here. We are only plotting the one step loss which worsens with more surrogates. If the one step error is increased, the model would see even more “unseen” contexts as the one step predictions are fed back in the model to produce long horizon predictions (autoregressive inference). It follows that the rollouts would be much more error prone in this case.

Figure A: Loss curves for a 50M parameter nano-gpt model trained on OpenWebText over ~0.6B tokens. X-axis shows training iterations. The model uses rotary embeddings in place of positional embeddings. w/ offsetting refers to combining the probabilities of all surrogates corresponding to the correct token — similar to how error is offset when a continuous prediction is binned when modeling it discretely.

Within this discrete model that’s simulating a continuous model, each output prediction is considered independent — the loss is the same regardless of whether the prediction is a surrogate of the correct token or a different token altogether. We can simulate how discrete offsets would work in this case by grouping together predictions that are surrogates of the same token. In other words, as long as the model outputs tokens that belong to the same group of surrogates as the correct token, it is not penalized in the loss. It’s as if an oracle was telling us which surrogates belonged to which original token in the text.

- By combining predictions that lie in the same original token group, we can reduce the error at that step (Figure A, right). As long as the model predicts one of the surrogates corresponding to the right token, it is not penalized in its loss at inference. This is similar to how a small enough error in the continuous case is offset when the prediction is discretized into a fixed number of bins. Note that a perfect uniform model would identify that the probability of each surrogate for the true token is the same (something a smart tokenizer could do for us a priori for example). And so such aggregation of probabilities at inference would lead to the same loss as when the model was trained without surrogates! - A careful reader might be curious about whether the worsening of losses we see is due to 1) randomly adding surrogates in the context or 2) randomly adding surrogates in the targets. In fact, both modifications affect the loss — with ‘randomly adding surrogates in the target’ causing a larger worsening effect.

In essence, the point of this little experiment is to show how a continuous model would behave while still effectively training over discrete language model. So far we have established that discreteness is indeed important. Now let’s convince ourselves that the abstraction level at which it is enforced is equally important.

Discretize but Where?

Arguably, we should place the point where discretization is enforced at a location where the overall meaning of the piece of text is present, so as to ensure faithful propagation of the right kind of information (the overall meaning). But where is that location? It’s not really at every word, is it? It really isn’t clear what the answer is here. In the case the answer isn’t clear, wouldn’t the obvious, “safe” choice be to place it at the most atomic level, i.e., after every token, so as to correctly propagate every token’s information? Such a design could be a good design choice and the evidence is clear to see in terms of how well current next token predictor models of language perform.

> But enforcing discretization at every step might not be the best idea. The reason for that becomes clear when we think about what happens when a new token or word is predicted by the model and then fed back into the model again as a new input. The new input and the old input differ by only this word. Presumably, the newer kind of computations that can occur within the model are bounded by how different the new word can be then. Because the new word lies in a small space (say some fifty thousand words for the English language), the amount of expressivity in computations is low. It is low compared to what a continuous representation could offer. As a simple example, the continuous representation could be some statistic of a distribution of multiple words, such as the mean, which would allow the model increased flexibility to produce a certain future output.

Figure B (left): Discretization enforced at outputs of all layers may lead to much worse performance since the model is encouraged to collapse information into a few possibilities at very frequent steps. (right): The model chooses when to enforce discretization and when to continue producing continuous outputs (denoted by blue squares). Wherever discretization is not enforced, the corresponding word is not predicted / not considered in the loss.

Think of it in another way, if we were to enforce discreteness at the output of each of the transformer blocks, i.e. after each layer, that form the language model, we would be limiting the model’s computations quite a bit, and in turn worsening it’s capability to predict the next token well. Note that this is an even stricter case than the one we discussed above, where discretization is along the sequence dimension instead of the model dimension. In the former case, the model’s output grows in length so there is room for more expressivity than in the latter, where the model’s output remains the same size. Now this begs the question, why does placing the discretization point at the end of the each token prediction works as the most optimal choice? The point of this post is to say that it really doesn’t. We must be more careful in choosing the precise location.

Let’s think of alternatives to the conventional design described above. One way to do this is to allow the model to choose the location on its own. By choosing the location itself, the model makes the choice of when to predict a word in the text and when to simply predict a continuous representation. The places where the model chooses to predict a continuous representation would then simply transfer that representation to the next location for processing, at which point the model will again make a choice between predicting the actual word or continuing with a continuous representation. If we want that property, a natural related question then is whether predicting the next token should be abandoned in that case. Instead, should the model predict whichever token it chooses on its own (maybe with some criterion, say like, at least predict 80% of the tokens seen in the text). For all intermediate locations where it’s not enforced to discretize its predictions, it simply keeps producing a continuous representation. Or, should the model predict all tokens but only at points when it chooses to output an actual word? Say if the model keeps its predictions in the continuous space for the first five time steps, it could output all five tokens encountered so far in the ground truth at the sixth location.

Flip Side of the Same Coin - Smarter Tokenization

An upside down way of viewing the discussion so far is to put emphasis on the tokenization aspect itself — the first step involved in modeling language. Tokenization inherently describes where the discretization boundary lies. A natural boundary in language is after every word or after every common prefix, which is how standard tokenizers also end up defining boundaries to a large extent. However, if the tokenizer could collapse entire sentences of arbitrary lengths into single tokens, then we are suddenly predicting in a much more expressive way. In this case, even though discretization is enforced at each time step, what each time step *means* is vastly different from when it only consisted of common prefixes or single words. Furthermore, if such boundaries can be learnt by the model itself, we start to get in a very similar regime as when the model decides where discretization is enforced (in turn, which token is predicted) and where continuous outputs are allowed.

The higher level point of the above thought exercise is to show that discretization is an important property but choosing it to be at every token is an arbitrary choice. And this is because choosing where discretization happens defines the abstraction level which in turn leads to different emergent behaviors. The meaning of a piece of text lies all over the text and not at any particular token. Perhaps models of the future would do away with such a choice and flow arbitrarily between the more expressive continuous representations and the more error-reducing discrete representations. In other words, we would doing away with the initial criterion of predicting everything as is — each token at each location it is seen — in place of predicting *almost* everything in *almost* the same fashion it is seen. This kind of fuzziness can be useful in alleviating the constraint on getting every token’s prediction right. We almost never really care about predicting each token as is. What we care about is if the model has gained a good understanding of the structure in the tokens or not. And arguably, predicting each token at its originally seen location acts as a very good proxy for embedding such an understanding within the model. However, it is not to say that a better or more powerful method does not exist for the same purpose.