Capturing AI’s attention: The physics behind repetition, hallucinations, and bias

Deeply explore the physical principles of AI attention mechanism and reveal the science behind repetition, hallucination and bias.
Core content:
1. New AI attention mechanism based on physical model
2. Application of spin bath theory in AI challenges
3. Analogy analysis of Hamiltonian and attention weight matrix
In Spin-Transformer data engraving spin glass, the author summarized:
"From the perspectives of conceptual similarity, physical interpretation, and optimized parameter scale, a class of physics-inspired spin-transformers is proposed based on the new mean-field equations of vector spin magnetization :
A differentiable system of vector spins, driven by data, whose collective behavior can be shaped by training. This is a highly adaptive system, where the landscape of spin interactions is itself dynamically shaped by the inputs .”
In early April, scholars from George Washington University adopted a similar idea and derived the physics of basic attention, the core "magic" of large language models, based on the spin - bath [Reference 1].
The paper states that the theory can be used to quantitatively analyze the outstanding challenges in the current AI field, including repetition, hallucinations, harmful content, and bias caused by training fine-tuning.
Spin Bath Theory
In quantum physics, a "spin bath" is a metaphor for describing the interaction between an open quantum system and its environment . It usually includes:
A spin system (spin) is in an environment consisting of a bunch of other spins - that is, the atomic nucleus around the spin, or the lattice vibrations.
The coupling or entanglement between the spin and every spin in the environment causes the quantum state of the system to become impure and gradually evolve into a mixed state. This can be used to explain information loss, noise generation, and the emergence of classical behavior.
The spin bath theory has the following interesting analogy with the attention mechanism of large language models:
In a typical spin-bath model, the system usually consists of a single spin and multiple environmental spins , which are coupled together through some kind of interaction.
Among them : Hamiltonian of the system spin itself, : Hamiltonian of the bath, : coupling term between the system and the bath (the most critical part).
Take the isotropic Heisenberg coupling as an example:
The state of a spin system is shaped by its interaction with all the “environmental information” .
In the Transformer, the representation of a token is “contextualized” through attention interactions with other tokens.
Taking single-head Attention as an example, based on the standard attention formula,
( : query vector of the current token, : key vectors of all tokens in the context, : value vector, : similarity scoring matrix between token and context , softmax provides a "normalized weight" for aggregating information from V), we can get the following analogy perspective:
Attention can be thought of as the interaction between a token and its “context environment”. Each attention head is like learning an “interaction Hamiltonian” that defines how a token “absorbs information” from other tokens.
Different heads of multi-head attention capture interactive relationships in different dimensions, similar to the multi-body coupling structure in the spin glass model . It is also often used to study energy minimization paths and emergent behaviors in complex systems .
This is consistent with the perspective of scholar Matthias in Spin-Transformer Data Sculpting Spin Glass :
“Essentially, the softmax attention matrix can be viewed as a parameterized form of the asymmetric coupling matrix of the vector spin model. Multiple attention heads can be implemented by embedding the coupling matrix into the “ head block diagonal coupling tensor ” . The cross-attention between the encoder and decoder can be achieved by introducing the context vector into the coupling matrix.
Since attention is identified using the spin model coupling matrix, autoregressive modeling can be accomplished by applying an appropriate triangular mask to the coupling matrix, and the causal structure can be preserved during the temporal evolution .
Normalization occurs naturally in the formula. The queries and bonds are used to define the interaction between the spins from the external magnetic field; the values can then be interpreted as the magnetization at the previous time step , or the steady-state magnetization in case of convergence. ”
Path Integral is a core method in quantum mechanics and statistical physics, which is used to calculate the sum of all possible paths that a particle (or field) takes from its initial state to its final state.
The path integral representation of a token can be defined as follows:
Among them: : represents a path from the starting token to the current token (for example, the attention jump trajectory of a token in the sequence), : the "action" on the path, controlling the importance of the path, : the Value information accumulated on the path (from Attention), : the measure of the path space, representing the sum of all possible paths (similar to softmax over attention paths)
GPT4 Technical Principle 2: Phase Change and Emergence Phase change has been discussed in the article. In statistical physics, "phase change" refers to the discontinuous change or mutation of system properties at the critical values of certain control parameters.
Transformer has the complexity of high-dimensional degrees of freedom (billions or even trillions of parameters), local and global information coupling attention, multi-layer nested "dynamic evolution", etc. , so "phase transition" will naturally occur.
The following are possible phase transition behaviors in the large model deduced by the author, which need further study:
In a multi-head attention structure, each head may learn independent paths in the early stages of training, but in the later stages of training or in complex tasks, they will naturally exhibit "coupled behavior", see GPT4 Technical Principle 6: Phase transition of categories and the formation of knowledge .
The hallucination may be a "phase change" in the information path. From the perspective of path integral, the main path is lost and the low-weight path dominates the output;
Or there may be symmetry breaking in the attention structure, where some heads or layers lose symmetry and spontaneously deviate from the true path;
Or when the real context is not strong enough (or is suppressed), the model falls back to the "language model prior", that is, the language patterns that appear frequently during training.
Figure 1
(a) The most basic form of “attention” mechanism is widely used in all generative AI. There is currently no first-principles theoretical explanation of why it works and under what circumstances it fails.
(b) shows the physics of the “attention” process rigorously derived from first principles. Each spin corresponds precisely to a token in the embedding space, and its structure reflects the previous training process of the AI (such as LLM). The wavy lines in the figure represent effective two-body interactions that are naturally derived from the formula.
(c) The context vector is exactly equivalent to the projection of the two-spin Hamiltonian into the "bath", which is weighted and concentrated in the subregion of the bath containing the input spins. The theory predicts how bias (e.g. from pre-training or fine-tuning) can perturb the trained LLM output so that it is dominated by inappropriate content ("bad" outputs like "THEY ARE EVIL" versus "good" outputs).
Figure 2
Next word prediction with basic attention mechanism. Top: first iteration; Bottom: sixth iteration.
To simplify, use a vocabulary of 4 words (e.g., ?, ?, ?, ?) and embed them into a three-dimensional real number space :
The initial prompt word is ???.
All weight matrices are set to the identity matrix to avoid affecting the core function of the attention mechanism. The 4 vectors and the normalized context vector are plotted together in the 2D projection plane spanned by and .
It can be seen that in both iterations, ? (i.e., vector ) exhibits the characteristics of an "attractor": it has the largest projection on (see the blue dotted line). As the number of iterations increases, the attractor state of ? becomes more obvious, and its alignment with also increases.
The small, simplified vocabulary used in the figure above produces simple attractors , which in turn generate simple outputs that are often not very human-like (e.g. “THEY ARE EVIL EVIL EVIL…”).
However, the same analytical framework is applicable to larger vocabularies, where more complex attractors (e.g., long-period cycles) may emerge, which would break up the repetitive patterns and make the output more realistic.
Likewise, when the “GOOD” and “EVIL” vectors represent a class of “good” or “bad” words, the output “good” or “bad” words will be more diverse and therefore look more realistic.
The math behind it
The input is a prompt word, such as "THEY ARE", consisting of tokens (words). Each possible token in the vocabulary is embedded as a dimensional vector (considered as a "spin" , which is a row vector by convention), so the input is a set of spin transposed matrices:
Position encoding is not considered here (see the original paper for details on position encoding processing), so the current discussion is about self-attention.
The intermediate step in Figure 1(a) involves computing the Query, Key, and Value matrices, which are obtained by projecting the input spin into an embedding space that is biased towards specific outputs through the trained weights.
The final output is a matrix:
Among them , this is a matrix of . This expression is equivalent to the following "two-body Hamiltonian":
It is the Hamiltonian of two spins and , whose interaction is mediated by a high-dimensional embedding space (i.e., the “bath”), as shown in Figure 1(b).
Given that LLM can successfully model human content, the two-body form of attention (Equation (1)) suggests that human content itself may also rely primarily on the interaction between two words , which is also what LLM can capture. This is similar to how many physical N-body systems can often be reduced to a two-body approximation.
However, some phenomena require at least three-body correlations (such as the Laughlin wave function), so it can be speculated that extending the core attention mechanism in formula (1) to three-body interactions may lead to more powerful AI .
The two-body Hamiltonian is then subjected to a Softmax operation , whose physical equivalent is: a statistical ensemble with temperature , in which the probability of different attention systems appearing conforms to the Boltzmann distribution:
Apply it to the input Value to get the so-called context vector . The supplementary material shows that:
This is analogous to the average spin in mean field theory.
In the Spin-Transformer data engraving spin glass at the beginning of the article , scholars proposed the spin-transformer based on the new mean field equation of vector spin magnetization .
The greater the overlap between Query and Key (i.e., the more similar the input spins modified by different "embedding baths" are), the greater the contribution to . Finally, project onto Value and operate with all token vectors to obtain the probability of each token becoming the next word:
This means that the "physics" of this attention mechanism is like computing the Boltzmann probability of a two-body Hamiltonian in an unconventional spin bath. The interactions between the spins depend on the properties of the "bath", which itself consists of all possible spins consisting of an embedding space shaped by training.
The statistical probability is shifted by the input spins towards some embedded subspace - similar to a non-equilibrium system. The input spins are like the results of previous single-spin measurements, so predicting the next token is like predicting the next measurement result.
After each prediction, the two interacting spins are updated by the previous one, which makes the whole process non-Markov despite being classical and deterministic, and implies characteristics similar to "quantum collapse".
The author discussed "non-Markovity" in Transformer's Next Wave?: "If you think about it carefully, language autoregression and non-Markovity are actually the norm. In fact, time-delay systems are basically non-Markov . Attention or state-space selectivity is very critical."
A direct consequence of the linear structure of the Hamiltonian is that the output probability is prone to the phenomenon of "attractors" where a certain word appears repeatedly. This phenomenon is particularly serious when the vocabulary space is limited or the training is highly biased.
The presence of a token (such as ?) will make it contribute more to the subsequent average, which will make it more inclined in that direction, thus increasing the probability of generating the token again. The smaller the vocabulary, the more obvious this proportion effect is. Figure 2 clearly shows this phenomenon of repeated attractors , which is also the repeated output problem often observed in small models .
Furthermore, the physics framework can indicate when the output becomes “bad,” meaning it has no relevance to the cue ( hallucination ) or contains harmful content (e.g., anti-Semitism) even when the cue is benign.
This happens when some "bad" words (tokens) are buried deep in the vocabulary due to training, but temporarily have the largest projection (see Figure 2). Once this happens:
That is, when the probability of a bad word exceeds that of all “good words” (words that do not cause hallucinations or harm), it is output.
Figure 3 shows a simple example of the phase transition boundary between the 'good' and 'bad' words when the input prompt is "THEY ARE" (see Figure 1(a)). For general dimensions , the boundary is a plane with a normal vector of and dimension of . In this example, the four words are THEY, ARE, GOOD, EVIL.
Figure 3 can also be seen as a rough simplified version of a large model - all "bad" words can be classified as EVIL and "good" words can be classified as GOOD. At the same time, this also describes the transient state of a large model at a certain moment, in which the spins of a few tokens happen to be concentrated around the current one .
Figure 3
Phase diagram example, showing the behavior of a three-dimensional token embedded in a vocabulary containing four words: ???? = (0.25, 0.25, 0.1), ??? = (0.1, 0.3, 0.2), ???? = (0.4, 0.3, 0.1). Again, for simplicity, here we take ?_Q, ?_K, ?_V = ?. As long as the token "bad" (EVIL) remains in the blue area on the left, the output semantics remains "good" (GOOD); but if EVIL appears in the red area, the output content will suddenly flip to "bad" (EVIL).
The effect of bias on these output bounds can also be computed , revealing why and when new training or fine-tuning can render an otherwise credible LLM untrustworthy, with bias rotating the output bounds .
Figure 4 shows some simple examples where increasing bias induces new (repeated) tokens in the output (e.g., EVIL) while suppressing other tokens (e.g., GOOD).
Even at the scale of a single attention layer, this bias can lead to outputs dominated by harmful content — perhaps explaining why harmful content still appears in all large LLMs despite various protection mechanisms.
Figure 4
(a) Phase boundary with different linear deviations ξ = 0, 0.025, 0.05 (see the definition of δ in the Appendix, see Figure 3). The change in the phase boundary may cause a drastic change in the output content, because the red-marked token (EVIL) is now a highly likely (and recurring) output, while the blue token (GOOD) becomes extremely unlikely. (b) Phase boundary change with the addition of positional encoding (Pₖ), where (Pᵢ)₂ₘ₊₁ = sin(i / 1000 × 2^m / d), (Pᵢ)₂ₘ₊₂ = cos(i / 1000 × 2^m / d), and positional encoding weight y = 0.1, showing the results of the first 100 iterations of token generation. Here EVIL = (0.4, 0.15, 0.4). As the number of iterations increases, the phase diagram boundary usually rotates counterclockwise around the attractor (GOOD) until the boundary passes through the token EVIL, at which point EVIL becomes the new attractor. Subsequent rotations will be around EVIL. Therefore, the generated token is GOOD before the attractor transition and becomes EVIL after the transition. The token embeddings in both figures are the same as in Figure 3; for simplicity, x = 0.4.