The Parallelism Tradeoff: Limitations of Log-Precision Transformers
William Merrill
Center for Data Science New York
Università, New York, NY, USA
willm@nyu.edu
Ashish Sabharwal
Allen Institute for AI
Seattle, WA, USA
ashishs@allenai.org
Astratto
Despite their omnipresence in modern NLP,
characterizing the computational power of
transformer neural nets remains an interest-
ing open question. We prove that transformers
whose arithmetic precision is logarithmic in
the number of input tokens (and whose feedfor-
ward nets are computable using space linear in
their input) can be simulated by constant-depth
logspace-uniform threshold circuits. This pro-
vides insight on the power of transformers
using known results in complexity theory. For
esempio, if L (cid:2)= P (cioè., not all poly-time prob-
lems can be solved using logarithmic space),
then transformers cannot even accurately solve
linear equalities or check membership in an
arbitrary context-free grammar with empty
productions. Our result
intuitively emerges
from the transformer architecture’s high par-
allelizability. We thus speculatively introduce
the idea of a fundamental parallelism trade-
off: any model architecture as parallelizable as
the transformer will obey limitations similar
to it. Since parallelism is key to training mod-
els at massive scale, this suggests a potential
inherent weakness of the scaling paradigm.
1
introduzione
in large
recent breakthroughs
This work aims to characterize the computational
model implicit in transformer neural networks
(Vaswani et al., 2017), which form the basis
lingua
Di
models such as BERT (Devlin et al., 2019), T5
(Raffel et al., 2020), and GPT-3 (Brown et al.,
2020). What computational primitives can the
transformer’s components implement, and what
problems can the full system solve in aggregate?
These questions are important for interpreting
transformers in a principled way, understanding
potential limitations of their reasoning capabili-
ties, and building trust in deployed transformer-
based systems.
Early theoretical work on transformers es-
tablished their Turing completeness, albeit with
assumptions like infinite precision and arbitrarily
531
powerful feedforward subnets (P´erez et al., 2019;
Dehghani et al., 2019). D'altra parte, a strand
of more recent work uses techniques from cir-
cuit complexity theory to derive strong limitations
on the types of problems transformers can solve
given restrictions on the form of attention allowed
in the transformer. Specifically, Hahn (2020)
and Hao et al. (2022) showed that transform-
ers restricted to hard attention are very limited:
they can only solve problems in a weak com-
plexity class (non-uniform AC0) that doesn’t even
contain basic problems like majority of n bits.
Merrill et al. (2022) extended this to a more general
class of ‘‘saturated attention’’ transformers with a
floating point datatype, and showed a larger class
of problems (non-uniform TC0) as an upper bound.
This motivates analyzing a setting that strikes a
middle ground: Can we characterize transformers
whose precision and feedforward nets’ computa-
tional power are realistically bounded, but where
attention is also realistically expressive?
An important practical limitation of these prior
results is the ‘‘non-uniform’’ nature of the con-
sidered circuit classes, which makes these classes
non-realizable and the findings difficult to in-
terpret. This is because non-uniform AC0 and
TC0, while highly limited in computation, also
contain some problems that are not even decid-
able, cioè., for which there doesn’t exist any exact
algorithm. Così, non-uniform classes cannot be
directly compared with standard algorithmic com-
plexity classes such as P, NP, eccetera. This motivates
our second key question: Can we derive uniform
upper bounds on transformers?
We show that one can achieve both of these
goals by making the modest assumption that all
values in the transformer have O(log n) preci-
sion (where n is the number of input tokens),
E, allo stesso modo, that transformer’s subnetworks are
computable in O(log n) spazio. Log precision is
enough to represent the positional encodings at
the input layer of the transformer, and to encode
pointers to all other positions in the sequence at
Operazioni dell'Associazione per la Linguistica Computazionale, vol. 11, pag. 531–545, 2023. https://doi.org/10.1162/tacl a 00562
Redattore di azioni: Daniel Gildea. Lotto di invio: 8/2022; Lotto di revisione: 12/2022; Pubblicato 6/2023.
C(cid:3) 2023 Associazione per la Linguistica Computazionale. Distribuito sotto CC-BY 4.0 licenza.
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
5
6
2
2
1
3
1
1
9
1
/
/
T
l
UN
C
_
UN
_
0
0
5
6
2
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
9
S
e
P
e
M
B
e
R
2
0
2
3
later transformer layers. Assuming log precision
across all layers captures the idea that the hid-
den representations contain a constant number of
hidden states whose precision (16 O 32 bits) È
small relative to the length of the input (2048 In
GPT-3). On long sequences, the precision will not
be enough to losslessly encode the full input se-
quence into a single vector. Invece, the processing
of the sequence must somehow be distributed in
each layer and performed in parallel.
Upper Bound on Transformers. Our main
contribution is proving that log-precision trans-
formers can be simulated by uniform constant-
depth threshold circuits. Così, such transformers
can only solve problems in uniform TC0. Questo
characterization is strikingly weak compared to the
Turing-completeness of infinite-precision trans-
formers. Since we believe log precision is more
realistic for practical transformers than infinite
precision, these results point to the conclusion
Quello
transformers are not Turing-complete in
practice.
In contrast to past results, our upper bound on
transformers is a uniform circuit class, enabling
direct comparison of log-precision transformers to
many natural complexity classes. These connec-
tions reveal specific problems that define the upper
limits of log-precision transformers’ capabilities,
as discussed further in §2.
our
says
bound
Intuitively,
Quello
superiore
log-precision transformers are computationally
shallow, and that
shallowness can be
Questo
understood to emerge from their parallelizability.
Transformers’ inherent parallelism is useful for
training them efficiently at massive scale, but may
limit the complexity of the computations they
can express. We introduce the term parallelism
tradeoff to capture this idea, which represents
a potential fundamental weakness of the current
paradigm of scaling language models. Formalmente
characterizing reasoning capabilities relevant to
language models and understanding whether they
likely fall outside upper bounds implied by the
tradeoff would clarify the practical implications
of this limitation of scaling.
Instruction Following and Advice Transform-
ers. We also consider an instruction following
setting (Brown et al., 2020) where the transformer
is provided the description of a task along with an
input on which to execute the instruction. We con-
struct a practically parameterizable transformer
that can execute instructions perfectly if they are
provided in the form of TC0 circuits. This com-
plements recent work that studies transformers’
ability to follow other forms of instructions such
as regular expressions (Finlayson et al., 2022).
Based on the fundamental property that trans-
formers can correctly evaluate any given TC0
circuit on a given input, we introduce the notion of
advice transformers akin to advice-taking Turing
macchine. We show that transformers can recog-
nize any (non-uniform) TC0 language if provided
appropriate poly-size advice.
In summary, our findings provide new in-
sights on both the abilities and the limitations
of transformers, and bring out bounded precision,
threshold computations, and parallelism as key no-
tions for understanding the implicit computational
model of transformers in practice.
Roadmap. Before diving into technical details,
we discuss in §2 the implications of our results
on both fundamental as well as practical abilities
of transformers. §3 provides a brief primer on cir-
cuits as a model of computation. It then discusses
a way of serializing a circuit into a string; we later
show how to generate such serializations using a
resource-bounded algorithm, which is the key to
proving containment of transformers in uniform
circuit classes. §4 defines our formal model of
bounded-precision transformers. §5 derives our
first formal bound on log-precision transformers.
This bound involves non-uniform circuit families,
similar in spirit to prior results in this area. §6
proves our more technical main result: the first
uniform circuit complexity upper bound for trans-
formers (specifically, uniform TC0). Finalmente, §7
provides a lower bound on transformers, intro-
duces the notion of an Advice Transformer, E
connects these to the machine learning problems
of Instruction Learning and Following.
It could also be that the limitations of par-
allelism are not a curse but a blessing, if they
constrain the hypothesis space in a way useful for
apprendimento. We have no evidence that this is true,
but mention it as an alternate interpretation of the
results that could be clarified in future work.
2 Implications of Our Findings
Before diving into technical details, we discuss
the general implications of our findings on the
abilities and limitations of transformers. We will
focus here on our main result (Theorem 2), Quale
532
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
5
6
2
2
1
3
1
1
9
1
/
/
T
l
UN
C
_
UN
_
0
0
5
6
2
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
9
S
e
P
e
M
B
e
R
2
0
2
3
shows that log-precision transformers are in the
complexity class logspace-uniform TC0.
The Parallelism Tradeoff. One interpretation
of complexity classes such as NC0, AC0, E
TC0 is sets of poly-time solvable problems that
are parallelizable to a very high degree—they
can be solved in parallel in constant time with
enough parallel processors. This gives some in-
tuitive explanation of our result: log-precision
transformers end up in TC0 because they were
designed to be highly parallelizable. Since par-
allelism is an important property of
today’s
dominant paradigm of training models at mas-
sive scale,
this points to the conclusion that
any massively scaled up model—transformer or
otherwise—will likely obey restrictions similar to
the ones derived here for log-precision transform-
ers. There is thus an important tradeoff between
the massive parallelizability of today’s networks
and their representation power.
What Transformers Can/Cannot Compute.
Our result places log-precision transformers in
logspace-uniform TC0.
the complexity class
This has immediate implications on the kinds
of problems such transformers can and cannot
accurately solve.
Consider any problem X that is complete for a
complexity class C that contains logspace-uniform
TC0. By definition of completeness, every prob-
lem log-precision transformers can solve perfectly
is efficiently reducible to X and is thus no harder
than X. This implies that—despite their massive
size—the computation performed by such trans-
formers is, for instance, no harder than solving
basic L-complete problems like graph connectiv-
ità: the problem of checking whether there is a
path between two nodes in an undirected graph
(Lewis and Papadimitriou, 1982; Reingold, 2008).
By the same token, if C is strictly larger than
logspace-uniform TC0,
then such transformers
cannot perfectly solve X. Così, log-precision
transformers cannot perfectly solve the following
reasoning problems:
• Linear equalities: find x s.t. Ax = b1
• Universal context-free recognition1,2
• Propositional satisfiability (SAT)3
• Horn-clause satisfiability (HORN-SAT)1
• AI planning (Bylander, 1991)
• Permanent computation4
This highlights the limits of practical transformers
with limited-precision arithmetic, indicating that
they are far from being universal or all-powerful
as suggested by some prior studies.
One important caveat about these negative re-
sults is that they are asymptotic in nature—they
apply for ‘‘large enough’’ input size n. It’s pos-
sible for log-precision transformers to solve such
problems easily when n is small. Further, these
negative results are about exact solutions, Ma
they often also extend beyond this when formal
hardness-of-approximation results are known.
Limitations of Our Formal Model. Prior for-
mal characterizations of transformers either make
unrealistically strong assumptions (P´erez et al.,
2019; Dehghani et al., 2019) or place unrealistic
restrictions (Hahn, 2020; Hao et al., 2022;
Merrill et al., 2022). In contrasto, we make only
one assumption—namely, all intermediate values
in the transformer are limited to O(log n) bits,
where n is the number of input tokens. We next
discuss some implications of this assumption and
what our findings mean for practical transformers.
As mentioned above, our bounds are asymptotic
in nature and thus apply when n is sufficiently
large. In practice, transformers use fixed preci-
sion at each computation node, which is more
restrictive than precision growing with the input
sequence length n, as O(log n) bits. Tuttavia,
this constant could be large and thus, for rel-
atively small n, our results do not rule out
practical transformers solving difficult problems.
Our results, Tuttavia, do show that as n grows
sufficiently large, log-precision transformers are
fundamentally limited to problems within TC0
and cannot accurately solve various commonly
studied problems mentioned earlier under ‘‘What
Transformers Cannot Compute’’. Extending our
analysis to small n will help close the gap to
practice.
1Assuming logspace-uniform TC0 (cid:2)= P. Follows because
3Assuming logspace-uniform TC0
(cid:2)= NP. Follows
these problems are P-complete (Greenlaw et al., 1991).
2Takes both a grammar and a string as input and return
whether the grammar generates the string. Jones and Laaser
(1976) demonstrate P-completeness.
because SAT is NP-complete (cf. Biere et al., 2009).
4Assuming logspace-uniform TC0 (cid:2)= #P. Follows be-
cause permanent is #P-complete (Valiant, 1979). Allender
(1999) shows permanent is not in logtime-uniform TC0.
533
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
5
6
2
2
1
3
1
1
9
1
/
/
T
l
UN
C
_
UN
_
0
0
5
6
2
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
9
S
e
P
e
M
B
e
R
2
0
2
3
Our formal model is based on a binary classifi-
cation view of transformers. Tuttavia, our results
apply directly to multi-class classification as well
and can be extended to generation problems by
viewing, for instance, next word prediction in NLP
as a multi-class classification problem. Tuttavia,
if the transformer decoder is allowed to condition
on its previous output in a generation problem,
then this would violate our formal setup.
2.1 Potential Applications
Extracting Circuits from Transformers. Elhage
et al. (2021) propose extracting circuits5 that
capture the computational structure of transform-
ers. Our results suggest threshold circuit families
are a good formalism for expressing mechanisms
extracted from transformers. Constructively con-
verting transformers to threshold circuits is beyond
the scope of the current paper, although we hope
to explore this in more detail in future work.
Testing Separation Candidates in Complexity
Theory. Theorem 2 also motivates a paradigm
for quickly testing complexity theory conjectures.
If a problem is believed to separate TC0 and
NC1, a transformer can be trained on problem
instances. If the transformer generalizes perfectly
to harder instances than it was trained on, Questo
gives an empirical hint that the problem is in TC0,
providing evidence against the conjecture.
3 Circuit Computation
Let {0, 1}∗ be the set of finite binary strings. For
x ∈ {0, 1}∗, let |X| be its length. We refer to a
function from {0, 1}∗ to {0, 1}∗ as a Boolean func-
zione. Boolean functions can implement arithmetic
operations if we define a semantics for binary
strings as numbers. We will treat the intermediate
values in a transformer as binary strings, and the
internal operations as Boolean functions.
Circuits are a model of computation for com-
puting Boolean functions of fixed-length binary
strings.6 Formally, a circuit is a directed acyclic
computation graph. The leaf nodes represent bi-
nary variables and their negations. The internal
nodes represent functions in some set G, and the
5Their sense of ‘‘circuit’’ is not exactly the formal
sense we use in this paper, though the goal of capturing
transformers’ implicit computational mechanism is the same.
6For a mini-tutorial on circuit complexity theory and its
relevance to transformers, see Merrill et al. (2022).
directed edges represent the flow of function out-
puts into inputs of other functions. One or more
nodes in the circuit are marked such that their
value is the output of the circuit.
Definition 1. For a set of functions G, a G-circuit
is a directed acyclic computation graph where the
internal nodes have labels from G.
Complexity Measures. The size of a circuit is
the total number of gates in it, including negation.
The depth of a circuit is the length of the longest
path from any input node to any output node.
Circuit Families. A circuit family generalizes
a circuit to take variable-length binary strings as
input. Formalmente, a circuit family is a sequence
of circuits Cn : {0, 1}n → {0, 1} for n ∈ N.
A circuit family implicitly recognizes a formal
language defined as follows:
Definition 2. A circuit family Cn recognizes L ⊆
{0, 1}∗ if, for all x ∈ {0, 1}∗, C|X|(X) = 1 if and
only if x ∈ L.
We now define classes of languages by con-
straining the complexity of the circuit families
needed to recognize them:
Definition 3. Let non-uniform AC0 be the set of
L ⊆ {0, 1}∗ such that L is recognizable by a poly-
size, constant-depth {¬, ∧, ∨}-circuit family.
For k ∈ N, a threshold gate θ≤k takes m input
(cid:2)
M
i=1 xi ≤ k. We define
bits and returns whether
θ≥k analogously. Per esempio, θ≤3(110011) = 0.
Definition 4. Let TC0 be the set of L ⊆ {0, 1}∗
such that L is recognizable by a poly-size,
constant-depth {θ≤k, θ≥k}k∈N-circuit.
The gates ¬, ∧, and ∨ are all just special cases
of thresholds, so we can imagine TC0 circuits to
have access to these as well. Così, TC0 circuits
can implement AC0 circuits.
Circuit Serialization. We identify a circuit with
its serialization in a formal language that identifies
each node’s label and adjacency list. We will
adopt a specific grammar for concreteness, Ma
our construction can be adapted to other string
representations of circuits.
We define a circuit serialization as a traversal
of a circuit ordered by some topological sort.
In this serialization, leaf nodes (variables) are
represented by the string X. An internal node (gate)
is represented in Polish notation by the function it
534
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
5
6
2
2
1
3
1
1
9
1
/
/
T
l
UN
C
_
UN
_
0
0
5
6
2
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
9
S
e
P
e
M
B
e
R
2
0
2
3
computes (AND, OR, or NOT) followed by a list of
pointers to its arguments. Each argument &1j of
gate i encodes (in a unary) a zero-indexed pointer
to the j-th gate in the circuit, where j < i. The
final node is interpreted as the circuit output.
To serialize {∧, ∨}-circuits, we use the follow-
ing grammar, where the i parameter is passed
through Gate[i] nonterminals to track the index of
the gate in left-to-right order:
Circuit → Gate[1] Gate[2] · · · Gate[g]
Gate[i] → X | NOT Arg[i]
Arg[i] → &1j
s.t. j < i
| Op Arg[i]∗
Op → AND | OR
In the Arg[i] rule, we enforce that j < i so that
arguments must be pointers to already defined
gates. As an example of this serialization language,
the circuit for x1 ∨ ¬x2 ∨ x3 is represented as7
X X X NOT &1 OR & &111 &11
By convention (cf. §3), negations in AC0 circuits
are usually taken to occur at the beginning of the
circuit, rather than after ∧ or ∨ nodes.8 Our seri-
alization grammar does not enforce this property,
but of course any circuit with this property can be
serialized by our grammar.
It is a bit more complicated to serialize threshold
circuits. Formally, a threshold circuit serialization
is generated by the following grammar:
Circuit → Gate[1] Gate[2] · · · Gate[g]
Gate[i] → X | Dir 1k0m−k Arg[i]m
Arg[i] → &1j
s.t. j < i
Dir → <= | >=
In the rewrite rule for Gate[io], m ∈ N is the arity
of the gate, and k ≤ m is its threshold. The span 1k
after Dir can be interpreted semantically as a unary
encoding of the parameter k for a threshold gate,
padded by 0’s to the number of total arguments
of gate i. For simplicity, we imagine ¬ gates are
represented as unary θ≤0 gates. Così, the circuit
θ≥1(x1, ¬x2) would be represented as
X X <= 00 &1 >= 10 & &11
7Spaces here (and in the grammar) are added for read-
ability. We will ignore these spaces when passing circuit
serializations as inputs to a transformer in Section 7.
8We can apply De Morgan’s laws to force any AC0 circuit
to have this property.
535
Uniformity. The circuit families we have de-
fined above are non-uniform, meaning that we
do not enforce that the circuits processing dif-
ferent input sizes must be related in any way.
In degenerate cases, non-uniform circuit families
can solve undecidable problems9 because they
have infinite description length, making them a
physically unrealizable model of computation.
Complexity theorists have thus introduced uni-
form circuit families. Uniform circuit families are
a realizable model of computation with relations
to classes in computational complexity and formal
language theory.
Intuitively, in a uniform circuit family, the cir-
cuits for different input sizes must be ‘‘somewhat
similar’’ to each other. We formalize this (cf.
Arora and Barak, 2009) by saying that there exists
a resource-constrained Turing machine that maps
the input 1n to a serialization of circuit Cn.
Definition 5. A language L is (S(N), IO(N))-spazio
uniformly computable by a circuit model M iff
there exists a Turing machine that, for all n ≥
0, uses S(N) space to map 1n to an M -circuit
recognizing L on inputs of size I(N).
This notion of uniformity is more general than
the standard notion in that the input size I(N) is a
function of the problem complexity n. The reason
for this is that we will apply uniformity to subcom-
putations with different input sizes I(N) within a
larger computation of input size n. The standard
notion of uniformity corresponds to I(N) = n.
Inoltre, we will refer to a circuit family
as uniform if it is uniformly computable with
S(N) = O(log n) (cf. Arora and Barak, 2009).
We can define uniform versions of AC0 and TC0
by adopting the previous definitions exactly, Ma
also enforcing uniformity. For the rest of the paper
we will clarify whether we mean the uniform or
non-uniform variant of TC0 when unclear from
context, since both classes will come up.
4 Bounded-Precision Transformers
A transformer (Vaswani et al., 2017) is a neu-
ral network architecture made up of a constant
number of transformer layers. A transformer
layer is a module that computes self-attention
9Consider the unary language 1n such that Turing machine
N (under some arbitrary enumeration) halts. This problem is
in non-uniform AC0 since we can hard-code the right answer
for each n in Cn.
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
5
6
2
2
1
3
1
1
9
1
/
/
T
l
UN
C
_
UN
_
0
0
5
6
2
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
9
S
e
P
e
M
B
e
R
2
0
2
3
over a sequence followed by an elementwise
transformation of the output vectors.
4.1 Precision and Space
We will assume that each transformer is resource
bounded in terms of the precision of each value it
computes and, for some of our results, the space it
uses for the computation of key operations such as
embedding, Attenzione, and activation. Specifically,
we will assume precision p, cioè., the values at all
layers, as well as the outputs of all key intermediate
operations in it (Attenzione, activation, arithmetic
operators, eccetera.), are represented using p bits. Questo
is a realistic assumption as, in practice, today’s
transformers are typically limited to the 64-bit
precision of the underlying hardware. Formalmente,
we define p-precision as follows:
Definition 6. A k-ary function f : x1, . . . , xk (cid:12)→
y is p-precision if x1, . . . , xk, y ∈ {0, 1}∗ have
size at most p bits, and f can be computed by a
p-space-bounded Turing machine.
This says the size of
the function input
and output are bounded below p. Allo stesso modo,
the intermediate space used by the computation
must also be bounded below p. Così, higher pre-
cision computations cannot somehow be hidden
inside f .
Definition 6 naturally applies to functions with
bounded arity k. We will also need to define
p precision for the summation operator in the
transformer, which adds n different floats of size
p.10 Adding n floats can blow up the precision
needed to represent their sum. Per esempio, immagine-
ine adding the floating points 1 · 20 + 1 · 2c. Noi
obtain (2C + 1) · 20, whose mantissa takes c + 1
bits to represent. In practice, computers do not
preserve full precision in such situations: instead,
small terms like 1 · 20 are discarded. Così, we
define the transformer’s addition operation ⊕
to be similarly approximate (and thus preserve
precision); see §A.
4.2 Transformer Definition
4.3 Attention Heads
The core building block of a transformer is an
attention head. We define this at a high level of
abstraction as follows:
10Our proof also goes through if the transformer weights
are integers, as is sometimes done (Dettmers et al., 2022).
536
Definition 7. A p-precision attention head is spec-
ified by a binary p-precision similarity function
S : {0, 1}p × {0, 1}p → {0, 1}P.
Let h1, . . . , hn ∈ {0, 1}p be the input sequence
to a p-precision attention head, and let ⊕ be
approximate floating-point addition (§A).
Definition 8. For all (cid:3) ≥ 0, a p-precision attention
∈ {0, 1}p via
head H (cid:2)+1
computes a vector a(cid:2)+1
ih
H
UN(cid:2)+1
ih =
N(cid:3)
j=1
S(H(cid:2)
io, H(cid:2)
j)
Zi
· h(cid:2)
j,
(cid:4)
j).
N
j=1 s(H(cid:2)
where Zi =
io, H(cid:2)
Standard transformer attention heads (Vaswani
et al., 2017) are a special case of this definition
where s is scaled dot-product similarity between
keys and queries. Standard transformers also have
a linear or affine value function applied to each
H(cid:2)
j in the sum over j. By its affineness, the value
function can, without loss of generality, be re-
moved from the attention head and considered to
be part of the transformer layer (cioè., applied to the
output of the attention head).
4.4 Transformer Layers
A p-precision transformer layer is then a tuple of
heads and a function f used to combine them.
Definition 9 (p-precision transformer layer). UN
p-precision transformer layer is a tuple L(cid:2)+1 =
(cid:14)H1, · · · , Hk, F (cid:15), where each Hh is an attention
head and f : ({0, 1}P)k × {0, 1}p → {0, 1}p is a
p-precision activation function.
1, . . . , H(cid:2)
A p-precision transformer layer can be un-
derstood to define a sequence of vectors
H(cid:2)+1
, . . . , H(cid:2)+1
in terms of an input sequence
N
1
of vectors h(cid:2)
N (coming from the previ-
ous layer in the transformer) by first computing
k attention heads in parallel and then combin-
ing their output using f . The first k inputs to f
will correspond to the attention head outputs, E
the additional input is the original input from the
previous layer. Recall that a(cid:2)+1
is the output of
ih
head H (cid:2)+1
ih on input h(cid:2) at position i. The function
computed by a transformer layer can be described
formally as follows.
Definition 10 (Transformer layer computation).
For (cid:3) ≥ 0, a p-precision transformer
layer
l(cid:2)+1 recurrently computes the output sequence
H(cid:2)+1
the inputs
1
as a function of
, . . . , H(cid:2)+1
N
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
5
6
2
2
1
3
1
1
9
1
/
/
T
l
UN
C
_
UN
_
0
0
5
6
2
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
9
S
e
P
e
M
B
e
R
2
0
2
3
N, Dove, for 1 ≤ i ≤ n,
H(cid:2)
1, . . . , H(cid:2)
component is computed according to
ik , H(cid:2)
i = f (UN(cid:2)+1
H(cid:2)+1
io).
i1 , . . . , UN(cid:2)+1
the ith
f can be understood to encapsulate layernorm,
residual connections, and the feedforward sub-
layer of a standard transformer (Vaswani et al.,
2017). H(cid:2)
i is given to f to allow residual connec-
zioni. As mentioned in §4.3, f can also encapsulate
the value function for each head.
4.5 Transformer Encoder
Finalmente, we define a transformer of depth d as a
cascade of d transformer layers:
transformer). UN
Definition 11
(p-precision
p-precision transformer over alphabet Σ is a pair
consisting of a p-precision position embedding
function11 φ : Σ × N → {0, 1}p and a d-tuple of
p-precision transformer layers (cid:14)L1, . . . , Ld(cid:15).
For a position embedding function φ and w ∈
Σn, let φ(w) be the position-wise broadcasted
embedding of w: for 1 ≤ i ≤ n, φi(w) (cid:2) φ(wi, io).
(cid:5)
Definition 12 (Transformer computation). UN
computes the fol-
transformer
lowing function of a string w ∈ Σ∗:
φ, (cid:14)L1, · · · Ld(cid:15)
(cid:6)
T (w) = (Ld ◦ Ld−1 ◦ · · · ◦ L1)(φ(w)).
We will use n to denote the length of w, E
take the transformer’s depth d to be fixed w.r.t. N.
The input to the transformer can thus be
represented with N = n log |Σ| bits using a bi-
nary encoding for the vocabulary. The circuits
in subsequent sections to simu-
we construct
late transformers will also have input size N .
We will assume transformers have log-precision
relative to the size of the input, specifically,
O(log N )-precision. Since |Σ| is fixed (typically
30000 in practice), we will think in terms of
O(log n)-precision. Così, by Definition 6, all of
the intermediate functions of such transformers
are computable in O(log n) space and output (at
most) these many bits. Note that this is enough
precision to represent positional encodings and for
each position to point to a constant number of other
values, but not enough precision for non-lossy
pooling of the entire input into a single value.
11To apply the normal notion of p-precision to inputs
outside {0, 1}∗, we imagine elements of Σ are encoded as
integers ≤ |Σ| in binary, and natural numbers are represented
as integers ≤ n. Così, we assume log |Σ| + log n ≤ p.
537
Relationship to Practical Transformers. Nostro
log-precision transformers do not enforce that
S (Definition 7) and f (Definition 9) follow
the transformer structure. Tuttavia, a feedfor-
ward net whose primitive operations (per esempio., scalar
multiplication) are defined over O(log n)-size
numbers can be computed in O(log n) spazio.
Così, bounded-precision practical transformers
are a special case of our log-precision transform-
ers. This makes our setup appropriate for proving
upper bounds on transformers, which is our main
contribution.
5 Log-Precision Transformers as
Non-Uniform Threshold Circuits
We first show that log-precision transformers can
be simulated by non-uniform threshold circuits,
before presenting the more technical uniform ver-
sion of the results in §6. The initial non-uniform
result extends the findings of Merrill et al. (2022),
who showed that saturated attention transform-
ers12 can be simulated in TC0. Here, we remove
the simplifying saturated attention assumption and
other restrictions on the underlying datatype. In-
stead, we show that our log-precision assumption
is enough to prove that a transformer can be
simulated in TC0 with any attention function.
Hao et al. observed that any Boolean function
of O(log n) bits can be computed by a poly(N)
size circuit. We extend this to m-bit outputs,
which is both more convenient and more efficient
than constructing m separate Boolean circuits:
Lemma 1 (Extended from Hao et al., 2022). Let
F : {0, 1}∗ → {0, 1}m be a function. For all
c ∈ R+ and n ∈ N, there exists an AND/OR
circuit of size at most nc + c log n + m and depth
3 that computes f on inputs of size c log n.
Proof. Like Hao et al. (2022), we construct a
circuit using a DNF representation of f on inputs
of size c log n, except we use a combined DNF
representation for all output bits of f . The DNF
formula has at most 2c log n = nc terms. The circuit
has a NOT gate for each input bit, an AND
gate for each DNF term, E, for each of the m
output bits, an OR gate combining the outputs
of those AND gates (cioè., DNF terms) for which
that bit is 1.
12Saturated attention is uniform attention over a subset of
the prior layer nodes.
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
5
6
2
2
1
3
1
1
9
1
/
/
T
l
UN
C
_
UN
_
0
0
5
6
2
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
9
S
e
P
e
M
B
e
R
2
0
2
3
We now use Lemma 1 to prove the following
non-uniform result. We note that the proof goes
through even if the notion of p-precision (Defini-
zione 6) is relaxed to not require computability in
space p. This requirement will, Tuttavia, become
important for our subsequent result in §6.
Theorem 1 (Non-uniform). Any c log n-precision
depth-d transformer operating on inputs in Σn can
be simulated by a threshold circuit family of depth
3 + (9 + 2d⊕)D.
IL
input of
Proof. Let w ∈ Σn be
UN
c log n-precision transformer. Noi
show by
induction that we can construct a composition
of constant-depth, poly-size threshold circuits to
compute each layer of this transformer. Così, any
constant-depth transformer will be computable by
a constant-depth threshold circuit.
In the base case of layer 0 and token i, we con-
struct gates representing the constant i encoded in
binary. We can then compute h0
i = φ(wi, io) using
Lemma 1, yielding a poly-size depth-3 circuit.
io
io, H(cid:2)
io, H(cid:2)
In the inductive case of computing layer h(cid:2)+1
for 1 ≤ (cid:3) + 1 ≤ d, we note that each vector output
of layer h(cid:2)
i has size (at most) c log n bits because
of the log-precision assumption.
We first fix a head a(cid:2)+1
ik
(Definition 8) A
simulate. Applying Lemma 1, we can compute
S(H(cid:2)
j) with a poly-size depth-3 circuit, in par-
allel for all j. Since n floats with c log n precision
can be approximately added in TC0 (§A), we can
construct a TC0 circuit of depth d⊕ to compute
j), Zi, and h(cid:2)
Zj. Since s(H(cid:2)
i all have c log n
S(H(cid:2)
io ,H(cid:2)
j )
H(cid:2)
j with a poly-size
bits, we can compute
Zi
depth-3 circuit;13 we do this in parallel for all j.
Prossimo, we again use the fact that approximate ad-
dition of n floats is in TC0 to compute a(cid:2)+1
ih as the
approximate sum over j with a depth-d⊕ circuit.
(Definition 10)
in terms of its constituent heads. Since all argu-
ments of g have size c log n, we apply Lemma 1 A
compute g with a poly-size depth-3 circuit, yield-
ing h(cid:2)+1
. We repeat this in parallel for all i. Questo
completes the inductive step new to compute all
values in the (cid:3) + 1-st layer with a circuit depth of
9 + 2d⊕.
We now simulate a layer h(cid:2)+1
io
io
Aggregating the circuit over all d layers, IL
overall circuit depth is 3 + (9 + 2d⊕)D.
13This may seem counterintuitive since multiplication of
two n-precision numbers is outside AC0. Tuttavia, here we
leverage the fact that the precision is c log n.
538
Corollary 1.1 (Non-uniform). Any log-precision
transformer can be simulated by a non-uniform
TC0 circuit family.14
6 Log-Precision Transformers as
Uniform Threshold Circuits
We will now extend the argument from the last
section to show that O(log n)-precision transform-
ers can be simulated by uniform constant-depth
threshold circuits by capitalizing on the assump-
tion that φ, S, and f are log-precision, and thus can
be computed in O(log n) spazio. The overall proof
idea is similar, but due to the uniformity condition,
the proof becomes substantially more technical.
We must not just show the existence of a threshold
circuit family computing a transformer, but also
show that this circuit family can be generated by
a log-space Turing machine.
We first extend Lemma 1 to respect uniformity:
Lemma 2. Let f : {0, 1}∗ → {0, 1}m be a
linear-space computable function. There exists
a Turing machine that, for all n ∈ N and c ∈ R+,
uses at most c log n + log m space to map input
1n to a circuit of size at most nc + c log n + M
and depth 3 that computes f on inputs of size at
most c log n.
Proof. We give the proof in the form of an algo-
rithm to construct a circuit as a function of n and
then justify its correctness and space complexity.
Algorithm. We first print 2c log n nodes
representing unnegated and negated input nodes.15
Now, we need to show how to construct nodes
corresponding to nc DNF terms. A tal fine, we
loop over all possible inputs x ∈ {0, 1}c log n by
maintaining the c log n bit binary representation
of x (initialized with 0c log n) and incrementing it
by 1 at each step of the loop. We create a new ∧
node i with c log n arguments, defined as follows.
For j ∈ [c log n], we create an argument pointer
A (unnegated) node j if xj = 1 and to (negated)
node c log n + j otherwise.
14Here, a TC0 circuit family is a constant-depth, poly-size
circuit family computing some function {0, 1}∗ → {0, 1}∗.
While we define TC0 for decision problems in Definition 4,
it is standard and well-defined to extend the same term
to refer to circuit families computing functions as well
(Hesse, 2001).
15We ignore the initial unnegated input nodes when
considering the size of the circuit.
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
5
6
2
2
1
3
1
1
9
1
/
/
T
l
UN
C
_
UN
_
0
0
5
6
2
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
9
S
e
P
e
M
B
e
R
2
0
2
3
Now, we construct nodes computing each of the
m output nodes. We loop over k ∈ [M], construct-
ing a single node for each k. We loop over all
x ∈ {0, 1}c log n analogously above to construct
a list of arguments. By our linear-space com-
putability assumption and because x has c log n
bits, we can compute f (X) as a subroutine in
O(log n)-space to obtain fk(X). If fk(X) = 1, we
print node 2c log n + j as an argument of node k.
Correctness. We show that this Turing machine
maps input n to a serialized circuit computing f
on inputs of size n. The first layer simply produces
unnegated and negated input values. The second
layer then produce all possible DNF terms. Finalmente,
node k of the third layer computes the disjunction
over all terms x such that fk(X) = 1. Così, node
k of the third layer computes fk.
Log Space. To complete the proof, we justify
that M uses O(log n+log m) spazio. Looping over
x ∈ {0, 1}c log n is accomplished by treating x as
a binary number initialized to 0 and incrementing
it at each step. Così, the loop pointer for building
the DNF terms takes c log n space to store. For
building the m output nodes, we maintain a similar
loop pointer as well as an index k ≤ m, taking
c log n + log m space. Così, the overall algorithm
uses c log n + log m space.
Così, M uses c log n + log m space to map 1n
to a circuit of size at most nc + c log n + m and
depth 3 that computes f on size c log n inputs.
We can leverage this lemma to derive the
uniform analog of Theorem 1, come segue.
result). Any
Theorem 2
(Uniform, main
c log n-precision depth-d transformer operat-
ing on inputs in Σn can be simulated by a
family of
logspace-uniform threshold circuit
depth 3 + (9 + 2d⊕)D.
Proof. We will provide a proof by induction over
transformer layers (cid:3) that there is a Turing machine
M operating in O(log n) space that, on input 1n,
outputs a circuit that simulates the transformer’s
computation on inputs of size n. This circuit is
identical to the one in the proof of Theorem 1, E
thus has the same circuit depth.
In the base case, we use log space to track a
counter maintaining the current token i (between
1 and n) throughout the circuit construction. Noi
construct gates encoding the constant i in binary.
We can then apply Lemma 2 to construct a Tur-
ing machine that maps 1n to a constant-depth
threshold circuit computing h0
i = φ(wi, io).
In the inductive case, we assume we can output
in O(log n) space a circuit computing every value
H(cid:2)
in the previous layer (cid:3). We will show that
io
we can, in O(log n) spazio, now output a circuit
computing every value in layer (cid:3) + 1.
As in Theorem 1, we first fix a head a(cid:2)+1
ih
simulate. Recall (Definition 8) Quello
A
UN(cid:2)+1
ih =
N(cid:3)
j=1
S(H(cid:2)
io, H(cid:2)
j)
Zi
· h(cid:2)
j.
io, H(cid:2)
By Lemma 2, we can generate a depth-3 circuit
of size at most z = nc(cid:17)
+ C(cid:17) log n + 1, Dove
C(cid:17) = 2c (since the input to f is of size 2c log n)
that computes s(H(cid:2)
j) for specific i, j. We do
this sequentially for 1 ≤ j ≤ n and 1 ≤ h ≤ k,
padding each circuit with unused nodes so that
each one has size exactly z, and the z-th node
corresponds to the output. Così, the indices of
the output nodes for each of the columns will be
w(cid:2) + z(jk + H) for 1 ≤ j ≤ n, where w(cid:2) is the
index of the last output node h(cid:2)
n of the previous
layer.
At this point, we use the fact that for p = c log n,
the p-precision approximate sum of n p-precision
numbers can be computed by a uniform threshold
circuit (§A). We can thus use a Turing machine as a
sub-routine to generate, on input 1n, a k threshold
circuits, where each has size z(cid:17) that computes an
⊕ gate over n items of precision p each. We set
the inputs of circuit h to be nodes w(cid:2) + z(jk +
H) for 1 ≤ j ≤ n. By construction, this yields
j),
the normalizing constants Zi =
whose value is located at the node at index w(cid:2) +
znk + z(cid:17) for head h.
N
j=1 s(H(cid:2)
io, H(cid:2)
(cid:4)
io, H(cid:2)
Using p-precision arithmetic operator circuits,
we can now also generate a circuit to compute
S(H(cid:2)
io ,H(cid:2)
j )
j for each 1 ≤ j ≤ n and 1 ≤ h ≤ k,
H(cid:2)
Zi
by using index w(cid:2) + z(jk + H) as before for the
j) and index w(cid:2) + znk + z(cid:17)H
value of s(H(cid:2)
for the normalizing constant Zi of head h. Here
too we use circuits of identical size z(cid:17)(cid:17), making
w(cid:2) + k(zn + z(cid:17) + z(cid:17)(cid:17)io) the index of the output
nodes of these n circuits. Prossimo, we again employ a
⊕ circuit of size z(cid:17), similar to the computation of
Zi, to compute the sum of these n values. Finalmente,
we compute h(cid:2)+1
by applying f via Lemma 2.
io
Note that this requires keeping only (cid:3), io, and n
in memory, each of which takes O(log n) bits.
539
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
5
6
2
2
1
3
1
1
9
1
/
/
T
l
UN
C
_
UN
_
0
0
5
6
2
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
9
S
e
P
e
M
B
e
R
2
0
2
3
We repeat this process for all 1 ≤ i ≤ n to
compute the entire (cid:3) + 1 layer, which finishes the
inductive step: if we can output a circuit computing
layer (cid:3) in O(log n) spazio, then we can do the same
for layer (cid:3) + 1.
Because the depth derived in Theorem 2 È
constant with respect to n, it follows that:
Corollary 2.1 (Uniform, main result). Any
log-precision transformer can be simulated by
a uniform TC0 circuit family.
7 Lower Bounds for Instruction
Following and Advice Transformers
So far, we have shown that uniform TC0 is an
upper bound for log-precision transformers. Is
this upper bound tight, cioè., also a lower bound?
While we do not answer this question here, we
address a related question as a first step: we con-
struct a transformer that can evaluate TC0 circuits
on binary inputs, showing that transformers can
compute any TC0 function when their input is
augmented with the right ‘‘instructions’’.
More formally, we consider the Circuit Value
Problema (CVP) (Ladner, 1975), also referred to
as the Circuit Evaluation Problem, where the input
is a Boolean circuit C and a string x ∈ {0, 1}N, E
the task is to return the value of C(X) ∈ {0, 1}.
This problem is known to be complete for the
class P under AC0 reductions (Ladner, 1975). Noi
will assume C is serialized as described in §3 and
prove that log-precision transformers can evaluate
any TC0 circuit. Note that this is an extension of
the typical CVP since the circuit has threshold
gates, not just standard AND/OR gates.
It is known that LSTMs cannot evaluate Bool-
ean formulae (Merrill, 2020), a special case of the
CVP. In contrasto, we show that transformers can.
To demonstrate the practicality of our lower
bound construction, we will not just prove the
existence of transformers that can evaluate TC0
circuits but also specify concrete choices for the
positional embedding scheme and the class of
attention functions that are sufficient to do so.
Fractional Positional Embeddings. For a vec-
let (cid:14)X, sì(cid:15) be the vector
tor x and scalar y,
appending y onto x.16 For σ ∈ Σ, let v(P) be
the one-hot encoding of σ into R|Σ|. For w ∈ Σ∗,
16I.e., (cid:14)X, sì(cid:15)i = xi for 1 ≤ i ≤ |X|, and y if i = |X| + 1.
we define the fractional positional embedding at
token i as
φ(wi, io) = (cid:14)v(wi), i/n(cid:15).
io , H(cid:2)
Saturated Attention. We imagine f (H(cid:2)
j) È
computed via saturated attention (cf. Merrill et al.,
2022), which provides a simple model of the types
of attention we can expect
to be learned in
transformers (Merrill et al., 2021). Primo, queries
are computed as qi = Qh(cid:2)
io, and then keys
kj = Kh(cid:2)
j. Define the dot-product attention
score σij = q(cid:18)
i kj. We can then define saturated
attention as
(cid:7)
S(H(cid:2)
io, H(cid:2)
j) =
1 if σij = maxk σik
0 otherwise.
After normalization, saturated attention creates a
distribution that is uniform over a subset of posi-
zioni. Così, it is capable of parameterizing hard
Attenzione, uniform attention over the full sequence,
and various attention patterns in between.
Simple Pooling Functions. For simplicity, we
assume pooling functions f are thresholded linear
functions of their inputs. Così, they could be
implemented by a feedforward neural net. Without
loss of generality, we let attention heads have
a value function, which can be folded into the
pooling function from the last layer (see §4).
Now, we are ready to present the main result.
Our construction below is specific to the circuit
serialization scheme discussed in §3, but can be
extended to other serializations as well.
Lemma 3. For all d, there exists a transformer
with fractional positional embeddings, saturated
Attenzione, thresholded linear pooling functions,
and depth 2d that, for any threshold circuit C of
depth d, maps input (cid:14)C, X(cid:15) to the value C(X).
Proof. We will construct a pair of two transformer
layers that evaluate all the nodes at depth (cid:3) In
the threshold circuit, for any (cid:3). It follows that a
transformer of depth 2d can compute the value
C(X).
We refer to a token of type X as an input node.
Allo stesso modo, we call a token of type Dir a gate node.
Finally we call a token of type & an argument.
Base Case: Input Nodes. We construct one at-
tention layer that attends uniformly over all
positions whose value returns 1 if wi = X and
0 otherwise. Così, this head computes #(X)/N,
540
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
5
6
2
2
1
3
1
1
9
1
/
/
T
l
UN
C
_
UN
_
0
0
5
6
2
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
9
S
e
P
e
M
B
e
R
2
0
2
3
Dove #(X) is the number of occurrences of X in
w. We then define a second layer that, at input
node i, retrieves the token at position j defined by
j =
1 − #(X) + io
N
.
j is the index of the ith input value. Così, after
this layer, each input node i stores its value xi.
In the base case, we also construct an attention
head that, at the ith gate node, counts the fraction
of nodes (out of n) that are gate nodes to the left
of the current node. Così, this head computes i/n.
Also at gate node i, we construct an attention head
that counts the fraction of nodes to its right before
il prossimo & node that have value 1. This head thus
has value ki/mi where ki is the threshold value
of the i-th gate and mi is its arity. We apply the
same construction at each argument & to count the
1’s that follow until the next non-1 symbol.
Finalmente, using the first attention layer, we have
each J node attend to the first argument symbol
& to its left and retrieve its index j/n. Then, In
the second attention layer, each argument attends
uniformly over all nodes with values j/n. The net
effect is for each argument node to store j/n, cioè.,
the pointer it is encoding in unary as &1j.
Inductive Case: Gate Nodes. By our inductive
assumption over prior layers, all tokens corre-
sponding to circuit nodes at depth ≤ (cid:3) contain
their appropriate value. We now construct 2 trans-
former layers to evaluate gate nodes at depth
(cid:3) + 1.
In the first attention layer, each argument j
attends to the the closest gate node i to its left,
which is the gate it belongs to. Recall from the
base case that argument j already stores j/n. Each
argument &0j attends with position key j/n to
gate node j and retrieves its value in the previous
layer.
The second attention layer applies at gate nodes,
not arguments. At gate i of arity mi, we set the
attention s(io, j) to indicate whether argument j
belongs to gate node i, which holds for exactly m
arguments. We set the attention value to be the
binary value of the referent of argument j. Così,
the attention head computes ci/mi, where ci is
the number of arguments of node i that are 1. Noi
repeat this for all gate nodes.
At this point, for the i-th gate node, we have
computed both ci/mi and ki/mi. Thresholding
(ci − ki)/mi at 0 allows us to decide, based on
whether Dir is <= or >=, whether the current gate
node should output a 0 or a 1. Repeating this for all
gates at layer (cid:3) + 1 completes the inductive step:
We can evaluate all gate nodes in this layer.
Theorem 3. Depth-2d transformers can solve
CVP for depth-d TC0 circuits.
7.1 Instruction Following
CVP is closely related to instruction learning
(Brown et al., 2020) and instruction following
compiti (Finlayson et al., 2022). The latter task
setup provides a transformer two inputs: a regular
expression r as an ‘‘instruction’’, and z ∈ {0, 1}∗.
The goal of the task is to return whether z belongs
to the regular language represented by r. Viewed
from this lens, the circuit evaluation setup asks:
Can transformers follow instructions provided in
the form of a circuit? As discussed below, our
result says the answer is yes for all constant
depth threshold circuits. Questo, to the best of our
knowledge, provides the first non-trivial lower
bound for transformers in the instruction learning
setting.
Formalmente, an instruction I is any description17
of a function fI of {0, 1}∗. We say a trans-
former correctly follows an instruction I if, for
all x ∈ {0, 1}∗, it correctly computes fI (X) SU
input (cid:14)IO, X(cid:15). A non-uniform instruction descrip-
tion is a family of length-specific descriptions
{In}∞
n=1. We say a transformer correctly follows
a non-uniform instruction family {In} if, for all n
and all x ∈ {0, 1}N, it correctly computes fI (X) SU
input (cid:14)In, X(cid:15). The non-uniform description {In}
may take any form. When it forms a TC0 circuit
family, we refer to it as a TC0 instruction descrip-
zione. Since Theorem 3 constructs a transformer
that can evaluate any TC0 circuit, it follows that:
Corollary 3.1. There exists a depth-2d trans-
former that can correctly follow any depth-d TC0
instruction description.
Così, transformers with simple position em-
beddings, Attenzione, and pooling functions can
simulate any instruction provided in the form of
a TC0 circuit. We note that while it is unknown
whether the class of regular languages, considered
by Finlayson et al. (2022), is contained in TC0,
the other side is known: There are problems com-
putable by TC0 circuits that are not computable
17Formalmente, a function description is a fixed-size program
to compute that function under some model of computation.
541
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
5
6
2
2
1
3
1
1
9
1
/
/
T
l
UN
C
_
UN
_
0
0
5
6
2
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
9
S
e
P
e
M
B
e
R
2
0
2
3
by a regular language. These include problems in-
volving counting and arithmetic, which are beyond
regular languages. Our results thus expand the
known kinds of instructions transformers are able
to follow, at least with hand-constructed weights.
7.2 Advice Transformers
We can also view circuit evaluation abilities
of transformers (Lemma 3) from the lens of
advice-taking Turing machines which,
in ad-
input, are also provided
dition to their usual
an input
inde-
pendent) advice string. For instance, P/poly is
the class of problems decidable in polynomial
time when the Turing machine is given an advice
string of size polynomial in the input length (cf.
Arora and Barak, 2009).
length dependent
input
(Ma
In the same vein, let T/poly be the class of
log-precision, constant-depth transformers with
polynomial advice strings. In other words, SU
an input of size n, we allow the transformer to
receive an additional poly(N) bits of input that
cannot depend on the standard input. Now let
{Cn}∞
n=1 be a circuit family demonstrating that a
problem is in non-uniform TC0. Then, by passing
the description of Cn as advice for input length n,
it immediately follows from Lemma 3 that advice
transformers can simulate non-uniform TC0:
Corollary 3.2. Non-uniform TC0 ⊆ T/poly.
Since non-uniform TC0 even contains some
undecidable languages (Arora and Barak, 2009,
Claim 6.8), T/poly is clearly a very powerful class
and a strict superset of T, the class of decision
problems recognized by transformers (which are
all decidable). Così, a problem in T/poly cannot
always be solved by a transformer on its own.
Tuttavia, if given a description of how to do so
(‘‘advice’’) in the form of a TC0 circuit, our result
shows that a transformer could solve that problem.
8 Conclusione
Answering two open questions from Merrill et al.
(2022), we prove log-precision transformers with
any (including soft) attention can be simulated
by uniform constant-depth threshold circuits. Questo
establishes thresholded addition as a fundamen-
tal operation for understanding the computational
model of transformers: Any log-precision trans-
former can be re-expressed as a polynomial
number of threshold gates stacked to a constant
542
depth. This result also establishes potential lim-
its on the computational power of log-precision
transformers; per esempio., if L ⊂ P, transformers cannot
compute all poly-time functions. They are cer-
tainly very far from being universal. The intuition
at the heart of this result is that forcing a model
to be highly parallelizable likely sacrifices its ex-
pressiveness. Since parallelism seems essential to
pretraining any massive model at scale, any large
language model—transformer or otherwise—may
suffer from a similar tradeoff.
Ringraziamenti
The authors are grateful for the valuable feedback
from the anonymous reviewers and the TACL
action editor. They also thank Paul Beame and
colleagues at AI2 including Kyle Richardson,
Michal Guerquin, Peter Clark, Tushar Khot, E
especially Matthew Finlayson, whose empirical
findings about instruction learning inspired §7.
Feedback from Sam Bowman, Arya McCarthy,
Roma Patel, and Lena Strobl, and discussions
with the FLaNN, ML for Code (MILA), E
Foundations of Language Processing (Ume˚a) Rif-
search groups helped improve earlier drafts. IL
authors also appreciate Rahul Santhanam’s feed-
back. This work was funded in part by NSF award
1922658. William Merrill was supported by an
NSF graduate research fellowship and by AI2.
Riferimenti
Eric Allender. 1999. The permanent requires large
uniform threshold circuits. Chicago Journal of
Theoretical Computer Science.
Sanjeev Arora and Boaz Barak. 2009. Com-
putational Complexity: A Modern Approach.
Cambridge University Press. https://doi
.org/10.1017/CBO9780511804090
Arin Biere, Marijn Heule, Hans van Maaren, E
Toby Walsh. 2009. Handbook of Satisfiability:
Volume 185 Frontiers in Artificial Intelligence
and Applications. IOS Press.
Tom Brown, Benjamin Mann, Nick Ryder,
Melanie Subbiah, Jared D. Kaplan, Prafulla
Dhariwal, Arvind Neelakantan, Pranav Shyam,
Girish Sastry, Amanda Askell, Sandhini
Agarwal, Ariel Herbert-Voss, Gretchen
Krueger, Tom Henighan, Rewon Child, Aditya
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
5
6
2
2
1
3
1
1
9
1
/
/
T
l
UN
C
_
UN
_
0
0
5
6
2
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
9
S
e
P
e
M
B
e
R
2
0
2
3
Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens
Inverno, Chris Hesse, Mark Chen, Eric
Sigler, Mateusz Litwin, Scott Gray, Benjamin
Chess, Jack Clark, Christopher Berner, Sam
McCandlish, Alec Radford, Ilya Sutskever,
and Dario Amodei. 2020. Language models
are few-shot learners. In Advances in Neural
Information Processing Systems, volume 33,
pages 1877–1901. Curran Associates, Inc.
Tom Bylander. 1991. Complexity results for
the Interna-
Negli Atti di
planning.
tional Joint Conference on Artificial
In-
telligence. https://doi.org/10.1016
/B978-0-08-049944-4.50008-2
Andrew Chiu, George I. Davida, and Bruce E.
Litow. 2001. Division in logspace-uniform nc1.
RAIRO Theoretical Informatics and Applica-
zioni, 35:259–275. https://doi.org/10
.1051/ita:2001119
Mostafa Dehghani, Stephan Gouws, Oriol
Vinyals, Jakob Uszkoreit, and Lukasz Kaiser.
2019. Universal transformers. In International
Conference on Learning Representations.
Tim Dettmers, Mike Lewis,
and Luke
Zettlemoyer. 2022. GPT3.int8(): 8-bit matrix
multiplication for transformers at scale. In
Advances in Neural Information Processing
Sistemi.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, E
Kristina Toutanova. 2019. BERT: Pre-training
of deep bidirectional transformers for language
understanding. Negli Atti del 2019 Contro-
the North American Chapter of
ference of
the Association for Computational Linguistics:
Tecnologie del linguaggio umano.
Nelson Elhage, Neel Nanda, Catherine Olsson,
Tom Henighan, Nicholas Joseph, Ben Mann,
Amanda Askell, Yuntao Bai, Anna Chen,
Tom Conerly, Nova DasSarma, Dawn Drain,
Deep Ganguli, Zac Hatfield-Dodds, Danny
Hernandez, Andy Jones, Jackson Kernion,
Liane Lovitt, Kamal Ndousse, Dario Amodei,
Tom Brown, Jack Clark, Jared Kaplan, Sam
McCandlish, and Chris Olah. 2021. A math-
ematical framework for transformer circuits.
Transformer Circuits Thread.
Matthew Finlayson, Kyle Richardson, Ashish
Sabharwal, and Peter Clark. 2022. What makes
instruction learning hard? An investigation and
a new challenge in a synthetic environment. In
Atti del 2022 Conference on Empir-
ical Methods in Natural Language Processing.
Raymond Greenlaw, James M. Hoover, E
Walter L. Ruzzo. 1991. A compendium of
problems complete for P. Technical Report
TR91-11, University of Alberta.
Michael Hahn. 2020. Theoretical limitations of
self-attention in neural sequence models. Trans-
actions of the Association for Computational
Linguistica, 8:156–171. https://doi.org
/10.1162/tacl_a_00306
Yiding Hao, Dana Angluin, and Robert Frank.
2022. Formal language recognition by hard at-
tention transformers: Perspectives from circuit
complexity. Transactions of the Association
for Computational Linguistics, 10:800–810.
https://doi.org/10.1162/tacl a 00490
William Hesse. 2001. Division is
in uni-
form T C 0. In International Colloquium on
Automata, Languages, and Programming,
104–114. https://doi.org/10
pagine
.1007/3-540-48224-5_9
Neil Immerman. 2012. Descriptive Complexity.
Springer Science & Business Media.
Neil D.
Jones
and William T. Laaser.
for determinis-
1976. Complete problems
time. Theoretical Computer
tic polynomial
Scienza, 3(1):105–117. https://doi.org
/10.1016/0304-3975(76)90068-2
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
5
6
2
2
1
3
1
1
9
1
/
/
T
l
UN
C
_
UN
_
0
0
5
6
2
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
9
S
e
P
e
M
B
e
R
2
0
2
3
Richard E. Ladner. 1975. The circuit value prob-
lem is log space complete for P. ACM SIGACT
News, 7(1):18–20. https://doi.org/10
.1145/990518.990519
Harry R. Lewis and Christos H. Papadimitriou.
1982. Symmetric space-bounded computation.
Theoretical Computer Science, 19:161–187.
https://doi.org/10.1016/03045
-3975(82)90058-
William Merrill, Vivek Ramanujan, Yoav
Goldberg, Roy Schwartz, and Noah A. Smith.
2021. Effects of parameter norm growth during
transformer training: Inductive bias from gra-
dient descent. Negli Atti di
IL 2021
Conference on Empirical Methods in Natural
Language Processing. https://doi.org
/10.18653/v1/2021.emnlp-main.133
543
William Cooper Merrill. 2020. On the linguistic
capacity of real-time counter automata. ArXiv,
abs/2004.06866.
William Cooper Merrill, Ashish Sabharwal,
and Noah A. Smith. 2022. Saturated trans-
formers are constant-depth threshold circuits.
Transactions of the Association for Compu-
linguistica nazionale, 10:843–856. https://
doi.org/10.1162/tacl a 00493
Jorge P´erez,
Javier Marinkovi´c, and Pablo
Barcel´o. 2019. On the Turing completeness
of modern neural network architectures. In
International Conference on Learning Repre-
sentations.
Colin Raffel, Noam M. Shazeer, Adam Roberts,
Katherine Lee, Sharan Narang, Michael
Matena, Yanqi Zhou, Wei Li, and Peter J. Liu.
2020. Exploring the limits of transfer learning
with a unified text-to-text transformer. Journal
of Machine Learning Research, 21(140).
Omer Reingold. 2008. Undirected connectivity in
log-space. Journal of the ACM, 55:17:1–17:24.
https://doi.org/10.1145/1391289
.1391291
Leslie G. Valiant. 1979. The complexity of com-
puting the permanent. Theoretical Computer
8:189–201. https://doi.org
Scienza,
/10.1016/0304-3975(79)90044-6
Ashish Vaswani, Noam Shazeer, Niki Parmar,
Jakob Uszkoreit, Llion Jones, Aidan N. Gomez,
Łukasz Kaiser, and Illia Polosukhin. 2017. A-
tention is all you need. In Advances in Neural
Information Processing Systems, volume 30.
Curran Associates, Inc.
A Iterated p-Precision Float Addition
We interpret a p-bit string x as a p-precision float
by taking the first p/2 bits18 of x as a signed in-
teger m encoding the mantissa and the remaining
p/2 bits of x as another signed integer e encod-
ing the exponent. A float with mantissa m and
exponent e, denoted (cid:14)M, e(cid:15), encodes m · 2e.
Computing the sum of n n-bit integers (known
as iterated addition or simply summation) È
well-known to be in uniform TC0 (Hesse, 2001;
A
Chiu et al., 2001). We leverage this fact
18We assume w.l.o.g. that p is even.
the same holds for the sum of n
show that
O(log n)-precision floats. A subtlety of adding
p-precision floats is that their sum can require
more than p bits to represent precisely as a float.
For instance, while each of 2r and 1 is rep-
resentable with a (signed) mantissa of only 2
bits, their exact sum, 2R + 1, requires a mantissa
of r + 1 bits. Hence, p-precision transformers
must sacrifice some precision when performing
summation.
We define float addition by mapping the floats
to integers, adding the integers exactly, and then
mapping the sum back to a float (with possible
= 2q − 1 be the
loss of precision). Let I max
q = −I max
greatest q-bit signed integer, and I min
.
Let F max
be the greatest value representable by
a p-precision float. Since the exponent of a float
φ can be negative and represent a fraction, we
rescale φ by 2
p/2 when mapping it to an inte-
ger gp(φ):
−I min
P
q
q
Definition 13. The integer mapping of a p-bit
float φ = (cid:14)M, e(cid:15) is defined as gp(φ) = m · 2e−I min
p/2.
Definition 14. The p-truncated float mapping of
an integer z is defined as fp(z) = (cid:14)M, e(cid:15) where19
m = rshift(z, max{0, sizeof(z) − p/2})
e = sizeof(z) − sizeof(M) + I min
p/2
when e ≤ I max
we set m = e = I max
p/2 ; otherwise (cioè., when z > F max
),
p/2 to properly handle overflow.
P
Definition 15 (Iterated p-precision float addition).
We define the sum of k p-precision floats as
(cid:8)
k(cid:9)
(cid:10)
φi = fp
gp(φi)
.
i=1
k(cid:3)
i=1
We first verify that Definition 14 closely ap-
proximates exact addition.
Lemma 4. Let φ = (cid:14)e, M(cid:15) be a float such that
|φ| ≤ F max
and e ≥ I min
p/2 . Then φ and fp(gp(φ))
differ by a factor of at most 1 ± 2−p/2+2.
P
Proof. Let z = gp(φ), which is well-defined be-
cause of the precondition e ≥ I min
p/2 of the lemma.
Let φ(cid:17) = (cid:14)M(cid:17), e(cid:17)(cid:15) = fp(z).
19For x (cid:2)= 0, sizeof(X) = (cid:21)log |X|(cid:22) + 2; sizeof(0) = 2.
For y ≥ 0, rshift(X, sì) right-shifts x by y bits.
544
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
5
6
2
2
1
3
1
1
9
1
/
/
T
l
UN
C
_
UN
_
0
0
5
6
2
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
9
S
e
P
e
M
B
e
R
2
0
2
3
First consider the easy case where sizeof(z) ≤
p/2. Then m(cid:17) = z and e(cid:17) = I min
p/2 from Defini-
zione 14. Since z = m · 2e−I min
p/2 by Definition 13,
it follows that φ and φ(cid:17) have exactly the same
value.
P
Now assume sizeof(z) > p/2. It follows from
the precondition |φ| ≤ F max
of the lemma that
there is no overflow when applying Definition 14
to compute (cid:14)M(cid:17), e(cid:17)(cid:15). Thus m(cid:17) consists of the p/2
highest-order bits (including the sign bit) of z and
e(cid:17) = (cid:3) + I min
p/2 , Dove (cid:3) = sizeof(z) − p/2 is the
number of bits truncated from z to obtain m(cid:17). Let
δ denote the (non-negative) integer formed by the
(cid:3) lowest-order bits of z that are truncated. Then
δ ≤ 2(cid:2) − 1 = 2sizeof(z)−p/2 − 1 < z · 2−p/2+2.
−I min
Recall that the value of φ is gp(φ) · 2
−I min
p/2. By the above argument, we also have that
p/2, which is
p/2. Thus, φ and φ(cid:17) are
z·2
the value of φ(cid:17) is within (z ± δ) · 2
within z · (1 ± 2−p/2+2) · 2
within a factor of 1 ± 2−p/2+2 of each other.
p/2 =
−I min
−I min
Finally, we show that, with log precision,
computing ⊕ (Definition 14) is in uniform TC0.
k
Lemma 5. Let p ≤ c log n and φ =
i=1 φi,
where k ≤ n and each φi is p-precision. Then φ is
computable by a constant-depth uniform threshold
circuit of size poly(n).
(cid:4)
Proof. Let N = c log n + 2nc. We first use
Lemma 1 to map each φi = (cid:14)mi, ei(cid:15) to the integer
zi = mi · 2ei−I min
p/2, which has size sizeof(mi) +
(ei − I min) ≤ p/2 + 2 · 2p/2 ≤ c log n + 2nc = N .
For 1 ≤ i ≤ k, we pad zi
to N bits, and
for k < i ≤ N , we create an N -bit integer
k
zi = 0. We can then compute z =
i=1 zi
with a constant-depth uniform threshold circuit
of size poly(N ) using the classical construction
to sum N N -bit integers (cf. Immerman, 2012,
exercise 5.29). The size of this circuit is also poly-
nomial in n by the definition of N . Finally, we
compute f †(z) using a constant-depth AND/OR
circuit.
(cid:2)
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
6
2
2
1
3
1
1
9
1
/
/
t
l
a
c
_
a
_
0
0
5
6
2
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
545
Scarica il pdf