Saturated Transformers are Constant-Depth Threshold Circuits

Saturated Transformers are Constant-Depth Threshold Circuits

William Merrill∗† Ashish Sabharwal∗ Noah A. Smith∗‡
∗Allen Institute for AI, USA †New York University, USA ‡University of Washington, USA
willm@nyu.edu {ashishs, noah}@allenai.org

Abstrakt

Transformers have become a standard neural
network architecture for many NLP problems,
motivating theoretical analysis of their power
in terms of formal languages. Recent work
has shown that transformers with hard atten-
tion are quite limited in power (Hahn, 2020),
as they can be simulated by constant-depth
AND/OR circuits (Hao et al., 2022). Jedoch,
hard attention is a strong assumption, welche
may complicate the relevance of these results
in practice. In this work, we analyze the circuit
complexity of transformers with saturated at-
Aufmerksamkeit: a generalization of hard attention that
more closely captures the attention patterns
learnable in practical transformers. We first
show that saturated transformers transcend the
known limitations of hard-attention transform-
ers. We then prove saturated transformers
with floating-point values can be simulated by
constant-depth threshold circuits, giving the
class TC0 as an upper bound on the formal
languages they recognize.

1

Einführung

Opening the ‘‘black box’’ (Alishahi et al., 2020)
of the representations within neural networks is
an important step towards building systems with
robust and interpretable behavior. In NLP, eins
part of this question is analyzing the languages that
networks can model, and the mechanisms they use
to represent linguistic structure and dependencies.
One path toward this goal is via formal analysis
of specific network architectures (Merrill, 2021);
Zum Beispiel, recurrent neural networks (RNNs).
Due to their autoregressive formulation, formal
linguistic analysis of RNNs has often character-
ized their power by relating them to automata-
theoretic classes of formal languages (Weiss et al.,
2018; Peng et al., 2018; Merrill, 2019, Unter anderem).
Kürzlich, Jedoch, RNNs have largely been over-
taken in NLP by a new class of models: verwandeln-
ers (Vaswani et al., 2017). Transformers are not
autoregressive, and therefore less naturally resem-
ble automata, posing challenges to characterizing

843

their linguistic capacity or inductive biases in the
same terms as RNNs. Stattdessen, some recent work
has related them to circuit complexity classes, A
direction that we continue to pursue in this paper.
Drawing on classical circuit lower bound results,
Hao et al. (2022) and Hahn (2020) derive theo-
retical limitations of transformers with hard at-
Aufmerksamkeit, meaning the attention distributions focus
all probability mass on one index. Zusammen, their
results show that AC0—the class of languages
recognizable by constant-depth circuit families—
upper bounds the formal languages hard-attention
transformers can recognize.

Jedoch, hard attention is a strong assumption,
making it unclear how these results transfer to
practical transformers. Zum Beispiel, Bhattamishra
et al. (2020) showed how transformers can solve
synthetic counting tasks by using uniform atten-
tion patterns, which hard attention does not allow.
Motivated by this potential disconnect between
theory and practice, we aim to extend circuit-
based analysis to transformers with saturated at-
Aufmerksamkeit: a generalization of hard attention that has
been argued to approximate attention patterns ac-
quired through gradient descent (Merrill et al.,
2021). Broadly speaking, saturated attention goes
beyond hard attention in that it can ‘‘tie’’ across a
subset of positions, rather than selecting just one
Position. The tied positions are then aggregated
by averaging. Qualitatively, saturated attention
heads can ‘‘count’’: a capability observed in trans-
formers in practice (Bhattamishra et al., 2020).
Weiter, Merrill et al. (2021) show that trans-
former training dynamics lead attention heads in
several pretrained transformers to approximate
saturated attention. Zusammenfassend, saturated atten-
tion strictly generalizes hard attention and should
more closely reflect the attention patterns ac-
quired in practical transformers.

Our main contributions are twofold. Erste, Wir
show that saturated transformers can recognize
languages outside AC0. Dann, as depicted in
Tisch 1, we prove that transformers with floating

Transactions of the Association for Computational Linguistics, Bd. 10, S. 843–856, 2022. https://doi.org/10.1162/tacl a 00493
Action Editor: Mark Johnson. Submission batch: 9/2021; Revision batch: 2/2022; Published 8/2022.
C(cid:3) 2022 Verein für Computerlinguistik. Distributed under a CC-BY 4.0 Lizenz.

l

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

P

:
/
/

D
ich
R
e
C
T
.

M

ich
T
.

e
D
u

/
T

A
C
l
/

l

A
R
T
ich
C
e

P
D

F
/

D
Ö

ich
/

.

1
0
1
1
6
2

/
T

l

A
C
_
A
_
0
0
4
9
3
2
0
3
8
5
0
6

/

/
T

l

A
C
_
A
_
0
0
4
9
3
P
D

.

F

B
j
G
u
e
S
T

T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

Hard (η)
Saturated (ζ)

Float (F)
⊆ AC0
⊆ TC0

Rational (Q)
⊆ AC0
⊆ ALL

Tisch 1: Summary of combined results from
Hao et al. (2022) and this paper. Each cell α, D
characterizes the languages recognizable by trans-
formers with attention function α and datatype D
(floats F or rationals Q). TC0 and TC0 are circuit
complexity classes, and AC0 ⊂ TC0. ALL is the
set of all formal languages over alphabet {0, 1}.
See §4 for formal definitions. Out of these re-
Ergebnisse, we view saturated attention with floats as
the best model of practical transformers.

point activations and saturated attention can only
recognize formal languages in the circuit com-
plexity class TC0, constituting an upper bound for
a more realistic model of transformers than past
results with hard attention.

2 Roadmap

In §3, we formally define our model of the trans-
ehemalig, including defining saturated attention in
contrast to hard attention. §4 introduces circuits
in theoretical computer science and relevant com-
plexity measures and classes for them.

In §5, we first briefly analyze saturated trans-
formers with rational values where the embedding,
scoring, and activation functions are allowed to
be any size-preserving function. We find such
transformers to be universally powerful. We also
observe that when the positional embeddings are
computed in time linear in the sequence length,
saturated rational-valued transformers are exactly
as powerful as the complexity class of their acti-
vation functions, because the full input sequence
can be pooled to a single position, and an acti-
vation function can be used as an oracle over the
full input sequence. Jedoch, this setup relies on
the use of unrealistic embedding functions. To
move to a more realistic model of computation,
we then focus on saturated transformers whose
values are restricted to be floats, which have a
coarser granularity and, daher, cannot encode the
full input sequence into a single position.

Building on results of P´erez et al. (2019), Wir
demonstrate in §6 that saturated transformers with
float activations transcend the theoretical limita-
tions of hard-attention transformers. Insbesondere,
we will show that they can recognize the majority

Sprache, which lies outside AC0. We experimen-
tally validate that transformers can learn to recog-
nize the majority language. Taken together, diese
results suggest that the very weak characterization
of hard-attention transformers does not hold in
practice for saturated or soft attention.

In §7, we show that, on input sequences of length
N, the size of each state vector in a transformer
over floats is O(log n) bits, similar to saturated
LSTMs (vgl. Merrill, 2019). Daher, the full trans-
former state at any given layer has size O(n log n),
although each feedforward block can only locally
access a small, Ö(log n) ‘‘piece’’. Daher, while
hierarchical representations can be implemented
in a transformer (z.B., to process arbitrary-depth
Dyck languages or reverse strings as in Weiss et al.
[2021]), our result implies that they must be dis-
tributed in some way across n state vectors, eher
than represented compactly within a single vector.
Endlich, in §8, we use the bounded size of trans-
former representations to upper bound the for-
mal languages that can be recognized by saturated
transformers with floating-point values. In partic-
ular, we show that such transformers can be sim-
ulated by constant-depth threshold circuits, Und
thus only recognize languages in TC0. Informally,
this suggests that moving from hard attention to
saturated attention can be thought of as extend-
ing the implicit class of circuit gates available in
the network to include threshold gates.

Our results make progress in the analysis of
transformers by deriving upper bounds for a more
realistic model of transformers than has previ-
ously been analyzed. RoBERTa, T5, and other
pretrained transformers have been shown to be ap-
proximately saturated (Merrill et al., 2021), so our
results imply that TC0 may be a meaningful up-
per bound on the computation expressible within
such networks. Our analysis also motivates future
work further refining the circuit characterization
of saturated transformers, as well as comparing
transformers with soft and saturated attention.

3 Definitions and Notation

We will often use w to refer to a string over any
generic alphabet Σ, das ist, w ∈ Σ∗. Semantically,
w corresponds to the string a transformer receives
as input. Im Gegensatz, we use x and other symbols
to refer to binary strings in {0, 1}∗. These binary
strings will represent intermediate values within

844

l

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

P

:
/
/

D
ich
R
e
C
T
.

M

ich
T
.

e
D
u

/
T

A
C
l
/

l

A
R
T
ich
C
e

P
D

F
/

D
Ö

ich
/

.

1
0
1
1
6
2

/
T

l

A
C
_
A
_
0
0
4
9
3
2
0
3
8
5
0
6

/

/
T

l

A
C
_
A
_
0
0
4
9
3
P
D

.

F

B
j
G
u
e
S
T

T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

the transformer computation, rather than the raw
input to the transformer.

3.1 Datatypes

Under our model, all values in the transformer are
binary strings. In order to compute self attention
and other operations over binary strings, we need
to define datatypes describing the semantics of
these binary strings as numbers. We will describe
a semantics for binary strings as integers, as often
comes up in circuit complexity. We then extend
this to rational numbers and floats, welche sind
necessary for representing the division operations
that occur in attention heads within transformers.

Unsigned Integers We can interpret binary
strings x ∈ {0, 1}∗ as unsigned integers in the
standard way, nämlich, the numerical value of
x ∈ {0, 1}n is

[[X]]Z =

n−1(cid:2)

i=1

2i−1xi.

We allow standard integer operations like +Z,
·Z, 0 ⇐⇒ w ∈ L.

This says that the decision problem of recog-
nizing L must be linearly separable using the first
value in the last layer of the transformer. In prac-
tice, the first token in a transformer is often set to
CLS, and its output can be passed to a classifier
during finetuning (Devlin et al., 2019). This in-
spires Definition 5. There are other potential ways
to define language recognition and generation for
Transformer (Hewitt et al., 2020; Yao et al., 2021),
but they do not lead to meaningful differences for
our purposes.

Endlich, we define AHAT(D) as the set of
languages recognizable by some saturated trans-
former over D, where the internal functions can
be any size-preserving function.4

Definition 6. Let AHAT(D) be the set of lan-
guages L such that there exists a transformer (cid:7)Σ,
D, ζ, L, H, Phi, S(cid:2),H, F(cid:2)(cid:8) that recognizes L where
each φ, S(cid:2),H, F(cid:2) ∈ P.5

We note that size preservation is a weak con-
dition to assume about the internal functions in
practical transformers: Because any linear-time-
computable function is size-preserving,
Ist
strictly weaker than assuming that the internal
functions can be computed in linear time. To fur-
ther justify this condition, we explicitly show in
Appendix B that the component functions within
transformers are size-preserving.

Es

4 Circuit Complexity

Circuit complexity is a branch of computational
complexity theory that studies circuit families as
a model of computation.6 Intuitively, circuits are
useful for formally studying the types of computa-
tional problems that can be efficiently solved with
parallelism, as the depth of a circuit corresponds
to the runtime of a program on an idealized, völlig
parallel computer. We review background on cir-

4The name AHAT standards for ‘‘averaging hard attention

transformer’’, and is taken from Hao et al. (2022).

5To apply size preservation to the embedding function φ,

we consider the size of a token to be log(|Σ|).

6For more reference material on circuit complexity, Wir
refer the reader to chapters 6 Und 14 of Arora and Barak (2009)
or chapters 1 Und 2 of the Handbook of Theoretical Computer
Wissenschaft, Volume A (van Emde Boas, 1991; Johnson, 1991).

Figur 1: A circuit that takes a string x ∈ {0, 1}5 Und
returns whether it contains the bigram 11.

cuits, circuit families, and relevant complexity
measures and classes.

Circuits For a fixed n, a circuit is a computation
graph, where leaves correspond to input bits xi
and their negations ¬xi, and the internal nodes are
logic gates (typically ∧ and ∨), with one labeled as
the output node. The gates can conventionally be
taken to have either binary or unbounded fan-in.
The circuit computes a function f : {0, 1}n →
{0, 1} by substituting the input values into the leaf
Knoten, propagating the computation through the
graph, and returning the value of the output node.
Figur 1 shows an example circuit that takes inputs
of length 5, and returns whether they contain the
bigram 11.

Circuit Families A circuit family is an ordered
set of circuits {Cn}n∈N where each circuit is
identified with a particular input size n. We say
a circuit family recognizes a formal language
L ⊆ {0, 1}∗ iff, for all w ∈ L,7

C|w|(w) = 1 ⇐⇒ w ∈ L.

Circuit Complexity Two important notions of
complexity for a circuit are its size and depth. Der
size of a circuit is the number of gates. The depth
is the longest path from an input node to the output
node. For a circuit family, both quantities can be
expressed as functions of the input size n. A circuit
complexity class is a set of formal languages that
can be recognized by circuit families of a certain
Größe, depth, and set of gates. Insbesondere, we will
discuss the classes AC0 and TC0.

Definition 7. AC0 is the set of languages L ⊆
{0, 1}∗ such that there exists a circuit family
recognizing L with unbounded arity {, } gates,
poly(N) Größe, and O(1) depth.

7Ähnlich, for any alphabet Σ and L ⊆ Σ∗, we interpret
wi as a one-hot vector over Σ and define the family to
recognize L iff, for all w ∈ L, C|w|·|Σ|(w) = 1 ⇐⇒ w ∈ L.

847

l

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

P

:
/
/

D
ich
R
e
C
T
.

M

ich
T
.

e
D
u

/
T

A
C
l
/

l

A
R
T
ich
C
e

P
D

F
/

D
Ö

ich
/

.

1
0
1
1
6
2

/
T

l

A
C
_
A
_
0
0
4
9
3
2
0
3
8
5
0
6

/

/
T

l

A
C
_
A
_
0
0
4
9
3
P
D

.

F

B
j
G
u
e
S
T

T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

Intuitively, AC0 represents the class of prob-
lems that are highly parallelizable when the com-
putational primitives are standard logic gates. In
Kontrast, TC0 will also represent highly paral-
lelizable computation, but when the gates are ex-
panded to include threshold gates.

For a bitstring x ∈ {0, 1}∗, define the thresh-
old gate θ≥k(X) to return 1 iff ≥ k bits in x
Sind 1, and equivalently for ≤ k. Zum Beispiel,
θ≥3(110011) = 1.

Definition 8. TC0 is the set of languages L ⊆
{0, 1}∗ such that there exists a circuit family rec-
ognizing L with unbounded arity {, , θ} gates,
poly(N) Größe, and O(1) depth.

It is known that AC0 ⊂ TC0 ⊆ NC1, where NC1
denotes the languages recognizable by O(log n)-
depth circuits with bounded gate arity. Whether
or not the latter containment between TC0 and
NC1 is strict is an open question. Whereas parity
and other basic regular languages are outside AC0
(Furst et al., 1981), TC0 properly contains parity,
although it is unknown whether it contains all
the regular languages. Between AC0 and TC0 lies
the class ACC0 (Yao, 1990).

Uniformity The circuit classes defined above
(and which we will use in this paper) are non-
uniform, meaning circuits for different input sizes
are not constrained to have any relation to each
andere. Non-uniform circuit families can recognize
some uncomputable languages, such as the lan-
guage of strings 1k such that Turing machine k
does not halt on the null input (vgl. Arora and Barak,
2009). Im Gegensatz, the uniform variants of cir-
cuit families are constrained such that a log-space
Turing machine must output a string encoding
of circuit Cn on the input string 1n, forcing any
language the circuit family can recognize to be
computable. For these uniform classes (which we
write with a u prefix), it is known that

uTC0 ⊆ uNC1 ⊆ L ⊆ P,

where L and P denote the conventional complexity
classes of log-space and polynomial-time decision
problems. Daher, it is unknown whether uTC0 is
restricted compared to general polynomial-time
Berechnung, but if we accept the common con-
jecture that one (if not all) of the above con-
tainments are strict, then uTC0 forms a restricted
family of problems compared to P, welche, intui-

aktiv, are more parallelizable than other prob-
lems in P.

5 Aren’t Transformers Universal?

We now begin our analysis of saturated trans-
formers. Hao et al. (2022) and Hahn (2020) war
able to give upper bounds on the power of hard
attention without imposing any constraints on the
embedding, scoring, and activation functions. Der
same will not be the case with saturated attention:
any bounds on transformers will require lever-
aging some properties constraining their internal
functions. One property we use will be size preser-
vation. We will first show, obwohl, that size pre-
servation is not enough on its own: Deriving
a nontrivial upper bound will depend on subtle
assumptions about the transformer’s datatype.

With rational values and size-preserving inter-
nal functions, we will show saturated transformers
can recognize any formal language, nämlich, Die
class ALL = {L | L ⊆ {0, 1}∗}. Our construction
resembles the universal approximation construc-
tion of Yun et al. (2020), which relies on the ability
of the transformer to uniquely encode the full in-
put string into a single value vector. Nach dem
full sequence is encoded locally into a single vec-
tor, the activation block can be used as a black box
to recognize any language.
Theorem 1. AHAT(Q) = ALL.

Proof. We construct a 1-layer rational-valued
transformer with a single head to recognize every
string w in any formal language L ∈ ALL. We will
omit (cid:6), h subscripts. Let pi denote the ith prime
number. The embedding layer encodes each input
token according to

(cid:3)

Phi(wi, ich) =

1/pi
0

if wi = 1
ansonsten.

Since pi ∼ i log i for large i by the prime number
theorem (vgl. Goldstein, 1973), the number of bits
needed to represent the denominator of φ(wi, ich) Ist

≤ c log(i log i) ≤ c log(i2) = 2c log i.

Since i had size log i, this implies φ is size-
preserving.

Jetzt, we define a single uniform attention head
1
that sums across all i, outputting
. Der
(cid:5)
wi=1
pi
denominator q of this sum is the product
i=1 pi.
Observe that wi = 1 iff pi divides q. Daher, Wir

(cid:4)

848

l

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

P

:
/
/

D
ich
R
e
C
T
.

M

ich
T
.

e
D
u

/
T

A
C
l
/

l

A
R
T
ich
C
e

P
D

F
/

D
Ö

ich
/

.

1
0
1
1
6
2

/
T

l

A
C
_
A
_
0
0
4
9
3
2
0
3
8
5
0
6

/

/
T

l

A
C
_
A
_
0
0
4
9
3
P
D

.

F

B
j
G
u
e
S
T

T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

can define a function f that extracts the input
sequence w from q by checking whether, für jede
ich, pi divides q. We let g be a function recognizing
L, and set f = g ◦ f . The output of the transformer
will now compute whether w ∈ L, since f out-
puts an encoding of the original input sequence w,
and g decides whether w ∈ L. Note that any func-
tion solving a decision problem is size-preserving,
hence f ∈ P.

Theorem 1 says that our transformer architec-
ture parameterized with a rational datatype can
recognize any formal language. But a construc-
tion of this form feels unrealistic for two reasons.
Erste, it requires the embedding layer to imple-
ment an unconventional prime encoding scheme
in the embedding layer. Zweite, we are using the
activation layer as a black box to recognize any
language—even uncomputable ones! Auf dem anderen
Hand, the feedforward subnetworks used in prac-
tice in transformers cannot even implement all
computable functions when the weights are fixed
independent of the sequence length n. We can get
around both these issues by instead restricting the
datatype to floats, which is the direction we will
pursue in the remaining sections.8

5.1 Resource-Bounded Transformers

In Appendix C, we develop an alternate perspec-
tive on the universality of transformers, showing
Das, if the embedding function is allowed to be
computed in time linear in the sequence length,
then the transformer’s complexity is equivalent to
its activation functions’ complexity.

Theorem 2 (Informal). If φ can be any function
computable in time linear in n, and the scoring
and activation functions can be computed in T (M)
time on inputs of size m with T (M) (cid:2) M, Dann
languages recognizable by the transformer are
TIME(T (M)).

Appendix C contains a formal statement and
proof. Zum Beispiel, allowing polynomial-time
functions inside the transformer implies that the
transformer will recognize exactly the complexity
class P. A major unrealism about this setup is
the assumption that φ can be an arbitrary function

8It may also be possible to derive tighter bounds for
rational-valued transformers by imposing stronger constraints
on the internal functions. Jedoch, with floats, we will see
that size preservation is sufficient to derive a tighter char-
acterization of transformers’ power. We leave this alternate
direction to future work.

computable in time linear in n, motivating our
main results in a more constrained setting in §8.

5.2 Diskussion

We are not stating the results in this section as
evidence that practical transformers are capable
of universal or arbitrary polynomial computation.
Eher, the unnaturalness of these constructions
(speziell, the prime numbers based position
encoding) motivates us to slightly constrain our
model of the transformer in a realistic way: Wir
will switch the datatype from rationals to floats,
because even using only simple uniform attention,
a model with rationals and unconstrained internal
functions is universal. We will soon see that this
realistic constraint prevents universal simulation,
and in fact bounds the capacity of the saturated
transformer within TC0.

6 Beyond Hard Attention, with Floats

We now move to the setting of saturated trans-
formers over floats. Hao et al. (2022) identified
that hard-attention transformers can only recog-
nize languages within AC0. Im Gegensatz, saturated
transformers over floats can recognize the ‘‘ma-
jority’’ language MAJ, which is known to lie
outside AC0 (Furst et al., 1981). P´erez et al. (2019,
Prop. 3.3) show how MAJ can be recognized by
Transformer. In Theorem 3, we offer a simpler
construction that leverages only a single uniform
attention head, as opposed to the model of trans-
formers they were considering. Daher, this con-
struction is achievable with saturated attention.

Theorem 3. AHAT(F) (cid:10)⊆ AC0.

Proof. Let #σ(w) ∈ N denote the number of σ
tokens in string w ∈ {0, 1}∗. Let #(w) denote a
count vector where each element corresponds to
some σ ∈ {0, 1}. We define MAJ as follows:
(cid:7)

(cid:6)

MAJ =

w ∈ {0, 1}+ | #1(w) > #0(w)

.

We will construct a 1-layer transformer with a
single head to recognize MAJ, omitting (cid:6), h sub-
scripts from s, F, X, B. Figur 2 gives the same
construction in RASP (Weiss et al., 2021).

Let xi = φ(wi, ich) be a 1-hot encoding of wi.
For all i, J, set s(xi, xj) = 1, resulting in a single
head attending everywhere:

bi =

#(w)
N

.

849

l

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

P

:
/
/

D
ich
R
e
C
T
.

M

ich
T
.

e
D
u

/
T

A
C
l
/

l

A
R
T
ich
C
e

P
D

F
/

D
Ö

ich
/

.

1
0
1
1
6
2

/
T

l

A
C
_
A
_
0
0
4
9
3
2
0
3
8
5
0
6

/

/
T

l

A
C
_
A
_
0
0
4
9
3
P
D

.

F

B
j
G
u
e
S
T

T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

Figur 2: A program recognizing MAJ in RASP, A
programming language designed to abstract away de-
tails of transformer computation (Weiss et al., 2021).
frac{0,1} measure the fraction of inputs that are 0
oder 1. Then maj checks whether frac1 > frac0.

Endlich, set f (bi) to return whether #1(w)/
n > #0(w)/N, welche, for n > 0, is true iff
w ∈ MAJ.

Vor allem, the construction in Theorem 3 is not
just possible within our generalized transformer
Rahmen, but can also be implemented by the
standard parameterization of φ, S, and f in real
Transformer (Vaswani et al., 2017). The uniform
attention pattern can be implemented by setting all
query and key attention parameters to 0. Dann, Wir
can use the affine transformation that aggregates
the head outputs to compute the tuple:

(cid:8)

#1(w) − #0(w)
N

(cid:9)

, 0

.

This tuple is then passed through layer normal-
ization (Ba et al., 2016), resulting in a new tuple
(cid:7)t1, t2(cid:8). Crucially, t1 > t2 iff the same applies to
the quantities in the original tuple. Daher, a linear
classifier can decide whether t1 > t2 to success-
fully recognize the language, as per Definition 5.

6.1 Empirical Validation

In Abbildung 3, we show empirically that a 1-layer
transformer can learn and generalize MAJ. Das
supports our argument that the theoretical limita-
tions of hard-attention transformers do not apply
to practical transformers. We train with three dif-
ferent types of positional encoding: none, Bedeutung
no positional information; gelernt, where each po-
sition gets a trainable embedding vector, und das
sinusoidal scheme of Vaswani et al. (2017). Der
model with no positional embeddings generalizes
the best, followed by the learned embeddings.
It appears that while MAJ is in the capacity of the
transformer, the standard sinusoidal positional em-
bedding scheme provides the wrong inductive bias
for learning it. This recalls the finding of Yao et al.

Figur 3: In der Praxis, transformers can learn the major-
ity language (which lies outside AC0). We train 1-layer
transformers on majority, where each line represents a
different positional encoding scheme. Training string
length was binomial with n = 100. Trained mod-
els were then evaluated on generalization sets with n
ranging from 100 Zu 500. Mean length (x axis) is n/2.

(2021) that the choice of positional encodings
seems to greatly impact the transformer’s abil-
ity to generalize formal language tasks to longer
Sequenzen.

7 Size of Transformer Values

The theoretical limits on hard-attention transform-
ers were derived by Hao et al. (2022) by bounding
the size in bits of the representation v(cid:2),i at each
layer (cid:6) and position i. Speziell, they show that
the value v(cid:2),i is representable in O(log n) bits on
input sequences of length n. Daher, each value can
only contain limited information about the input
sequence, intuitively explaining their upper bound
on hard-attention transformers. Inspired by their
Analyse, this section will show that, in a saturated
transformer, each v(cid:2),i also has a size of O(log n)
bits. Later in §8, we will use this property to show
that saturated attention transformers are limited in
the formal languages they can recognize.

7.1 Size of Float Sums

How many bits does it take to represent the value of
an attention head within a saturated transformer?
As a naive bound, the output of a saturated at-
tention head is specified by a float for each of n
values attended over from the last layer, welche
would take at least linearly many bits in n. Wie-
immer, this upper bound on its size is not tight.
Stattdessen, we will show that all head and activation

850

l

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

P

:
/
/

D
ich
R
e
C
T
.

M

ich
T
.

e
D
u

/
T

A
C
l
/

l

A
R
T
ich
C
e

P
D

F
/

D
Ö

ich
/

.

1
0
1
1
6
2

/
T

l

A
C
_
A
_
0
0
4
9
3
2
0
3
8
5
0
6

/

/
T

l

A
C
_
A
_
0
0
4
9
3
P
D

.

F

B
j
G
u
e
S
T

T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

values can be represented in O(log n) bits. Unser
analysis will rely heavily on the following lemma:

Proof. By induction over (cid:6). The proof follows the
definition of transformer computation in §3.2.

Lemma 1. Let v1, · · · , vn be a sequence of floats,
each with size at most z. Then there exists c such
Das

N
i=1 vi has size at most 4cz + 2 log n + 1.

(cid:4)

Proof. Let pi, qi denote the numerator and de-
nominator of the floating point vi, jeweils.
Ähnlich, let ps, qs be the numerator and denom-
inator of the float s. By assumption, there exists
c such that each pi, qi both have size ≤ cz for
large enough n. We let pmax = maxi pi and anal-
ogously for qmax. Because all qi’s are powers of 2,
the numerator ps is

N(cid:2)

i=1

pi · qmax
Qi

≤ npmaxqmax,

welche, represented as an integer, has size:

≤ log n + cz + cz = 2cz + log n.

Andererseits, the denominator qs = qmax,
which has size ≤ z. daher, the float repre-
senting the sum has size

≤ 1 + 2 max(2cz + log n, z)
= 4cz + 2 log n + 1,

which completes the proof.

Insbesondere, we will use Lemma 1 to show
Das, when each of a sequence of n values has size
Ö(log n), the sum will also have size O(log n).

7.2 Size of Transformer Values

We will now leverage Lemma 1 to show that the
values are of bounded size in any transformer
over floats with an elementwise-size-preserving
attention function.

Definition 9. A function α : Dn → Dn is
elementwise-size-preserving if, für 1 ≤ i ≤ n,
the function xi (cid:20)→ α(X)i is size-preserving (Wo
x ∈ Dn).

Note that saturated attention satisfies this defi-
Nation. We are ready to prove a theorem bounding
the size of the representations in transformers with
elementwise-size-preserving attention.

Theorem 4. For any transformer over F with φ,
S(cid:2),H, F(cid:2) ∈ P and α elementwise-size-preserving,
for all (cid:6) ≤ L and i ≤ n, v(cid:2),i has size O(log n).

851

Base Case wi ∈ Σ has size O(1), and i ∈ [N]
has size O(log n). Since φ ∈ P, v0,i = φ(wi, ich)
has size O(log n) for all i.

Inductive Case Assume v(cid:2),i has size O(log n).
Since s(cid:2)+1,h ∈ P, A(cid:2)+1,H,ich,j = s(cid:2)+1,H(v(cid:2),ich, v(cid:2),J)
has size O(log n) for all
ich, J. Since α is
elementwise-size-preserving, we can conclude
that α(A(cid:2)+1,H,ich,:)j also has size O(log n) for all
H, ich, J. Multiplying two floats is size-preserving
(vgl. Appendix B), so α(A(cid:2)+1,H,ich,:)j · v(cid:2),j has size
Ö(log n) for all h, ich, J. We then apply Lemma 1
to conclude that b(cid:2)+1,H,i has size O(log n),
Wo, recall,

B(cid:2)+1,H,i =

N(cid:2)

j=1

α(A(cid:2),H,ich,:)j · v(cid:2),J.

Endlich, computing v(cid:2)+1,i = f(cid:2)+1(v(cid:2),ich, B(cid:2),:,ich), Wir
conclude that v(cid:2)+1,i has size O(log n) for all i
by size preservation.

Corollary 4.1. For any saturated transformer
over F with size-preserving internal functions, für
alle (cid:6) ≤ L and i ≤ n, v(cid:2),i has size O(log n).

Corollary 4.1 follows because saturated at-
tention is elementwise-size-preserving. Softmax
attention, andererseits, is not guaranteed
to fulfill this property, because it requires com-
puting the exponential function. This technical
challenge prevents generalizing our technique to
soft attention.

7.3 Diskussion

Similar to hard-attention transformers (Hao et al.,
2022), the size of each vector representation in
a saturated transformer over floats is O(log n).
This is enough memory for individual vectors to
‘‘count’’, a behavior that has been observed in
both LSTMs (Weiss et al., 2018) and transformers
(Bhattamishra et al., 2020). Andererseits,
Ö(log n) space is not enough memory for indi-
vidual vectors (Zum Beispiel, the CLS output) Zu
encode arbitrarily large combinatorial objects like
trees. Jedoch, transformers are not limited to
computing in an ‘‘online’’ fashion where tokens
are consumed sequentially, meaning that their ef-
fective state is n values of size O(log n). Vor allem,
trees with n leaves can be encoded in a distributed
fashion across n values of size O(log n). Eins

l

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

P

:
/
/

D
ich
R
e
C
T
.

M

ich
T
.

e
D
u

/
T

A
C
l
/

l

A
R
T
ich
C
e

P
D

F
/

D
Ö

ich
/

.

1
0
1
1
6
2

/
T

l

A
C
_
A
_
0
0
4
9
3
2
0
3
8
5
0
6

/

/
T

l

A
C
_
A
_
0
0
4
9
3
P
D

.

F

B
j
G
u
e
S
T

T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

construction for this is, at index i, to store wi and
ich, along with a pointer j to the parent. Since i, J
can both be represented in log n bits, each vector
uses only O(log n) bits.

Zusätzlich, the O(log n) space bound has
implications from the perspective of circuit com-
plexity. While saturated attention cannot be sim-
ulated in AC0, we will show in §8 that saturated
transformers over F can be simulated by TC0
circuits.

8 Threshold Circuit Simulation

We have proved that each value vector in a sat-
urated transformer over floats has O(log n) Größe.
Jetzt, we show how this implies saturated trans-
formers can be simulated by TC0 circuits. Unser
results heavily leverage the following lemmas:

Lemma 2 (Hao et al. 2022). Any function f :
{0, 1}c → {0, 1}d can be computed by a Boolean
circuit of depth 3 and size at most (2C + C + 1)D.

So that our results are self-contained, we re-
produce a proof of this lemma in Appendix D.
Applying Lemma 2 to a size-preserving function
with at most c log n input bits immediately yields:

Corollary 2.1. Any size-preserving function with
at most c log n input bits can be computed by a
Boolean circuit of depth 3 and polynomial size.

Mit anderen Worten, such functions can be computed
with AC0 circuits. Zusätzlich, we will show that
the sum of n floats of size at most c log n can be
computed by TC0 circuits.

Lemma 3. Let v1, · · · , vn be a sequence of floats,
each with size at most c log n for some c. Then the
N
i=1 vi is computable by a threshold circuit
sum
of constant depth and polynomial size.

(cid:4)

Proof. Let the integers pi, qi be the numerator
and denominator of vi. We first compute qmax, Die
maximum qi, using an AC0 circuit that compares
all pairs qi, qj, and returns the first qi such that
qj ≥ qj for all j. We then use the fact that
multiplication and right shift (qi is a power of 2)
are in TC0, in order to compute

ri = pi

qmax
Qi

in parallel for all i. Note that qmax and qi are both
powers of 2, so the division will be exact. Nächste,
we leverage the fact that the sum of n integers of

(cid:4)

size O(log n) is in TC0 (Kayal, 2015), in order
to compute the numerator of the sum p(cid:9) =
i ri.
We select the denominator as q(cid:9) = qmax. Endlich,
we can add an AC0 circuit that ‘‘reduces’’ the
fraction by removing shared trailing zeros from
P(cid:9), Q(cid:9), which is possible by Corollary 2.1. Daher,
we have constructed a TC0 circuit to compute
the sum of n floats with size O(log n).

We now construct a TC0 circuit that simulates

a saturated transformer over floats.

Theorem 5. AHAT(F) ⊆ TC0.
Proof. For each n, we construct a TC0 circuit
that simulates a saturated transformer on inputs of
size n. We construct the circuit modularly, mit
one subcircuit for the attention mechanism, Und
another for the feedforward subnetwork.

Attention Head Fix a single head in some layer.
We will construct a TC0 subcircuit that simulates
the attention mechanism at position i. The head
attends over vectors v1, · · · , vn. For all j, vj has
size O(log n) by Theorem 4. In parallel for each
J, we compute the scores ai,j = s(vi, vj) mit
an AC0 circuit by Corollary 2.1. We then com-
pute ai,max (cid:3) maxj ai,j with an AC0 circuit by
comparing all vj pairwise, and selecting the first
vk such that vk ≥ vj for all j. We then com-
pute ‘‘masked’’ values ui,j for each j via an AC0
circuit by Lemma 2:
(cid:3)

ui,J (cid:3)

vj
0

if ai,j ≥ ai,max
ansonsten.

N
We then compute the sum si (cid:3)
j=1 ui,j by
Lemma 3. By Lemma 1, si has size O(log n).
Jetzt, we similarly define

(cid:4)

(cid:3)

zi,J (cid:3)

1 if ai,j ≥ ai,max
0 ansonsten.

Using an analogous sum construction with zi,J
instead of ui,J, we can use a TC0 circuit to com-
Hure |M(A)|: the number of j such that ai,j ≥
ai,max. Endlich, since dividing floats is in TC0
(vgl. §9), we can compute the head output as
si/|M(A)|, which has size O(log n) by size
preservation of division.

Feedforward As input, f receives vi as well as
H head outputs, all of which have size O(log n).
As the total size of the input is O(log n), we can
use Corollary 2.1 to compute the output of f with

852

l

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

P

:
/
/

D
ich
R
e
C
T
.

M

ich
T
.

e
D
u

/
T

A
C
l
/

l

A
R
T
ich
C
e

P
D

F
/

D
Ö

ich
/

.

1
0
1
1
6
2

/
T

l

A
C
_
A
_
0
0
4
9
3
2
0
3
8
5
0
6

/

/
T

l

A
C
_
A
_
0
0
4
9
3
P
D

.

F

B
j
G
u
e
S
T

T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

an AC0 circuit. The size of the output is O(log n)
by size preservation of f . The same idea holds for
φ as well as the linear classification head.

We have simulated each transformer com-
ponent with a TC0 subcircuit, completing the
proof.

8.1 Diskussion

Recall that, over rationals, we found that size-
preserving saturated transformers could recognize
any language. Im Gegensatz, we have now shown
that using floating-point representations places
such transformers within TC0. In diesem Papier, Wir
have only considered non-uniform AC0 and TC0,
as opposed to the uniform variants of these classes,
which are more closely connected to familiar for-
mal language classes like the regular and context-
free languages (vgl. Cojocaru, 2016; Mahajan,
2007). As transformers satisfy some intuitive no-
tion of uniformity, an open question is whether
saturated transformers also fall into uniform TC0.

9 Abschluss

Compared with hard attention, saturated atten-
tion adds theoretical power to transformers. Wir
showed that saturated attention lets transformers
recognize languages outside AC0, which is the
upper bound with hard attention. Weiter, while
saturated transformers with rational values and
size-preserving internal functions can recognize
any language, we characterize the limits of size-
preserving saturated transformers with floats. Spe-
cifically, saturated transformers with float values
fall in TC0, a more powerful circuit class than AC0.
Daher, going from hard to saturated attention can be
understood as augmenting the model with thresh-
old gates. This illustrates one way that the circuit
complexity paradigm characterizes the power of
Transformer. Going forward, there are many in-
teresting open questions that circuit analysis can
Antwort, such as comparing the power of satu-
rated and soft attention, and refining existing up-
per bounds for transformers in terms of uniform
circuit families.

Danksagungen

Appendix A Float Division

Let / be truncated division between integers. Wir
divide a float by an integer p by defining an
approximate multiplicative inverse p−1. The nu-
merator is 2|P|/p and the denominator is 2|P|. Für
division by a float p, Q, we simply apply the inte-
ger approach and then multiply by q. This yields
numerator 2|P|/p · q and denominator 2|P|.

The fact that float division is defined in terms
of integer multiplication and division implies that
it is size-preserving and can be simulated in TC0,
which we use in §8.

Appendix B Justifying Size Preservation

We justify that feedforward neural networks are
size-preserving over floats. Feedforward neural
networks are made up of a fixed (with respect
to n) number of addition, multiplication, division,
ReLU, and square root (for layer norm) Operationen.
daher, it suffices to show that these operations
are all in S(F).

For addition, the numerator is

≤ p1q2 + p2q1 ≤ 2pmaxqmax,
which has size ≤ log 2 + |pmax| + |qmax| ≤
2(|pmax| + |qmax|) for large enough input size.

For multiplication, the numerator is just p1 · p2,
which has size ≤ 2|pmax|. Let the denominators be
q1 = 2k1 and q2 = 2k2. Then the denominator is
2k1+k2, which has size ≤ 2|qmax|.

Division can be analyzed in terms of the approx-
imate multiplicative inverse (§9).9 Its numerator
has size ≤ |P| + 1 + |Q| ≤ 2(|P| + |Q|) for large
enough input size. The denominator has size
≤ |P| + 1 ≤ 2|P| for large enough input size.

Size preservation is trivially satisfied for ReLU,

which cannot expand the size of the input.

To make layer norm work, we just need to ana-
lyze square root, which we will define in a trun-
cated fashion over integers. The square root of a
rational, Dann, simply takes the square root of p
P| ≤ |P| and analogously
and q. We have that |
for q.

Appendix C Resource-Bounded
Transformers

Thanks to Yiding Hao, Dana Angluin, and Robert
Frank for sharing an early draft of their work.
We also appreciate helpful feedback from Dana
Angluin, Matt Gardner, Yoav Goldberg, Michael
Hahn, Kyle Richardson, and Roy Schwartz.

Size preservation is one way to characterize the
constraints on transformers’ internal functions; A
9The exact multiplicative inverse (cid:7)P, Q(cid:8) (cid:20) (cid:7)Q, P(cid:8) über
unconstrained rationals is also size-preserving. Daher, neural
networks are size-preserving over both floats and rationals.

853

l

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

P

:
/
/

D
ich
R
e
C
T
.

M

ich
T
.

e
D
u

/
T

A
C
l
/

l

A
R
T
ich
C
e

P
D

F
/

D
Ö

ich
/

.

1
0
1
1
6
2

/
T

l

A
C
_
A
_
0
0
4
9
3
2
0
3
8
5
0
6

/

/
T

l

A
C
_
A
_
0
0
4
9
3
P
D

.

F

B
j
G
u
e
S
T

T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

slightly different perspective is to fix φ and ana-
lyze how the language recognition abilities of the
transformer change depending on the computa-
tional resources allotted to each s(cid:2),h and f(cid:2). In diesem
section, we derive an alternate universality theo-
rem in terms of time complexity classes. We will
show that as long as φ is powerful enough, solch
transformers have equivalent time complexity to
their activation functions.

Recall that a transformer is a tuple (cid:7)Σ, D, α,
L, H, Phi, S(cid:2),H, F(cid:2)(cid:8). In contrast to AHAT(D) (vgl.
Definition 6), we will now work with a different
class of transformer languages AHAT(D, T (M))
We will allow the embedding functions to be lin-
ear in the sequence length, and explore the effect
of varying the complexity of the other internal
functions. Let FTIME(T (M)) be the set of func-
tions computable by a Turing machine in T (M)
time.10
Definition 10. Let AHAT(D, T (N)) be the class
of languages L ⊆ Σ∗ such that there exists a trans-
ehemalig (cid:7)Σ, D, α, L, H, Phi, S(cid:2),H, F(cid:2)(cid:8) that recognizes
L, where φ runs in time linear in the sequence
length n, and s(cid:2),H, F(cid:2) ∈ FTIME(T (M)).

For any T (M) (cid:2) M, we will show transformers
AHAT(D, T (M)) have the complexity of their
activation functions. Formally:

Theorem 2
D ∈ {F, Q} and T (M) (cid:2) M,

(Formal version of Theorem 2). Für

Each of these components is computable in time
linear in n. Define three heads b1,i, b2,i, b3,i. Mit-
out loss of generality, consider bh,ich
to act on
Phi(wi, ich)h alone, rather than the full embeding
vector. b1,i is defined as a uniform head, while b2,i
and b3,i are computed with sh(vi, vj) = vj. Daher,

b1,i =

1
N

(cid:2)

wj =1

2j−1

b2,i = n
b3,i = 2|N|.

Endlich, we discuss how to set f to compute
whether w ∈ L. Let p be the function that extracts
the numerator of a float or rational number, welche
is computable in O(M) time on float of size m.
Within f , we compute u = p(b1,i). At this point,
we proceed in two cases depending on the data-
type D:

1. Rationals: If D = Q, then u is the binary
string w. Any L ∈ TIME(T (M)) has an in-
dicator function δ ∈ FTIME(T (M)), welche
we now apply to recognize whether w ∈ L.
2. Floats: If D = F, then u = 2|N|/n · w as in
§9. daher, in linear time, we compute

b2,i
b3,i

· u =

N
2|N|

· 2|N|w
N

= w,

AHAT(D, T (M)) ⊆ TIME(T (M)).

and feed w through δ as in the D = Q case.

Proof. Erste, observe that AHAT(D, T (M))
TIME(T (M)), since the embedding function and
saturated attention can be computed in time linear
in the input sequence length, and the other internal
functions can be computed in FTIME(T (M)) von
construction.

We now show TIME(M) ⊆ AHAT(D, T (M)).
We adopt a 1-layer transformer construction, Und
thus omit (cid:6), h subscripts. We define three compo-
nents of the embedding function φ : Σ × N → D3:

Phi(wi, ich)1 =

(cid:3)

2i−1
0

if wi = 1
ansonsten

Phi(wi, ich)2 = i
Phi(wi, ich)3 = 2|ich|.

10We write FTIME(M) instead of

the conventional

FTIME(N) to avoid confusion with the sequence length n.

Also, TIME(T (M)) ⊆ AHAT(D, T (M)).

Appendix D Proof from Hao et al. (2022)

The proof for Lemma 2 largely follows the proof
of a core lemma of Hao et al. (2022). We re-
produce a slightly adapted version of their proof
Hier, because their manuscript is not yet pub-
licly available, and we wish for our paper to be
self-contained.

Lemma 2. Any function f : {0, 1}c → {0, 1}D
can be computed by a Boolean circuit of depth 3
and size at most d(2C + C + 1).

Proof. The idea of the proof is to define d subcir-
cuits of size at most 2c + C + 1 that compute the d
output bits of f in parallel. We will build a circuit
that computes each output bit of f according to its
representation in disjunctive normal form (DNF).

854

l

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

P

:
/
/

D
ich
R
e
C
T
.

M

ich
T
.

e
D
u

/
T

A
C
l
/

l

A
R
T
ich
C
e

P
D

F
/

D
Ö

ich
/

.

1
0
1
1
6
2

/
T

l

A
C
_
A
_
0
0
4
9
3
2
0
3
8
5
0
6

/

/
T

l

A
C
_
A
_
0
0
4
9
3
P
D

.

F

B
j
G
u
e
S
T

T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

We define a first layer of the circuit that computes
the negation of each input, which takes c gates.
The second layer then computes the value of each
DNF term by computing a conjunction (∧ gate)
over the corresponding literals or negated literals.
Note that a formula of c variables has at most 2c
DNF terms. Endlich, the third layer of the circuit
computes a disjunction (∨ gate) over the values
of all terms, yielding the output of f , and adding
a single gate. Zusammenfassend, we have shown how to
compute each output bit with a circuit of size at
most 2c + C + 1, which implies the full function
f can be computed by a circuit of size at most
D(2C + C + 1).

Verweise

Afra Alishahi, Yonatan Belinkov, Grzegorz
Chrupała, Dieuwke Hupkes, Yuval Pinter, Und
Hassan Sajjad, editors. 2020. Proceedings of
the Third BlackboxNLP Workshop on Ana-
lyzing and Interpreting Neural Networks for
NLP. Association for Computational Linguis-
Tics, Online.

Sanjeev Arora and Boaz Barak. 2009. Com-
putational Complexity: A Modern Approach.
Cambridge University Press. https://doi
.org/10.1017/CBO9780511804090

Jimmy Ba, Jamie Ryan Kiros, and Geoffrey E.
Hinton. 2016. Layer normalization. ArXiv, abs/
1607.06450.

Satwik Bhattamishra, Kabir Ahuja, and Navin
Goyal. 2020. On the ability and limitations of
transformers to recognize formal languages. In
Verfahren der 2020 Conference on Empir-
ical Methods in Natural Language Processing
(EMNLP), pages 7096–7116, Online. Associa-
tion for Computational Linguistics. https://
doi.org/10.18653/v1/2020.emnlp-main
.576

Liliana Cojocaru. 2016. Advanced Studies on the
Complexity of Formal Languages. Ph.D. These,
University of Tampere.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, Und
Kristina Toutanova. 2019. BERT: Pre-training
of deep bidirectional transformers for language
Verständnis. In Proceedings of the 2019 Con-
the North American Chapter of
ference of
the Association for Computational Linguis-
Tics: Human Language Technologies, Volumen 1

(Long and Short Papers), pages 4171–4186,
Minneapolis, Minnesota. Association for Com-
putational Linguistics.

Peter van Emde Boas. 1991. Machine Models and
Simulations, Kapitel 1. MIT Press, Cambridge,
MA, USA. https://doi.org/10.1016
/B978-0-444-88071-0.50006-0

Merrick Furst, James B. Sachsen, and Michael Sipser.
1981. Parity, circuits, and the polynomial-time
hierarchy. In Proceedings of the 22nd Annual
Symposium on Foundations of Computer Sci-
enz, SFCS ’81, pages 260–270, USA. IEEE
Computer Society. https://doi.org/10
.1109/SFCS.1981.35

Larry J. Goldstein. 1973. A history of the prime
number theorem. The American Mathematical
Monthly, 80(6):599–615. https://doi.org
/10.1080/00029890.1973.11993338

Michael Hahn. 2020. Theoretical limitations of
self-attention in neural sequence models. Trans-
actions of the Association for Computational
Linguistik, 8:156–171. https://doi.org
/10.1162/tacl_a_00306

Yiding Hao, Dana Angluin, and Robert Frank.
2022. Formal language recognition by hard at-
tention transformers: Perspectives from circuit
complexity.

John Hewitt, Michael Hahn, Surya Ganguli, Percy
Liang, and Christopher D. Manning. 2020.
RNNs can generate bounded hierarchical lan-
guages with optimal memory. In Proceedings
of the 2020 Conference on Empirical Meth-
ods in Natural Language Processing (EMNLP),
pages 1978–2010, Online. Association for Com-
putational Linguistics. https://doi.org/10
.18653/v1/2020.emnlp-main.156

David S. Johnson. 1991. A Catalog of Complex-
ity Classes, Kapitel 2. MIT Press, Cambridge,
MA, USA. https://doi.org/10.1016
/B978-0-444-88071-0.50007-2

Neeraj Kayal. 2015. Lecture notes for topics in

complexity theory.

Meena Mahajan. 2007. Polynomial size log depth
circuits: Between NC1 and AC1. Bulletin of the
EATCS, 91:42–56.

William Merrill. 2019. Sequential neural networks
as automata. In Proceedings of the Workshop on

855

l

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

P

:
/
/

D
ich
R
e
C
T
.

M

ich
T
.

e
D
u

/
T

A
C
l
/

l

A
R
T
ich
C
e

P
D

F
/

D
Ö

ich
/

.

1
0
1
1
6
2

/
T

l

A
C
_
A
_
0
0
4
9
3
2
0
3
8
5
0
6

/

/
T

l

A
C
_
A
_
0
0
4
9
3
P
D

.

F

B
j
G
u
e
S
T

T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

Information Processing Systems, Volumen 30.
Curran Associates, Inc.

Gail Weiss, Yoav Goldberg, and Eran Yahav.
2018. On the practical computational power of
finite precision RNNs for language recognition.
In Proceedings of the 56th Annual Meeting
of the Association for Computational Linguis-
Tics (Volumen 2: Short Papers), pages 740–745,
Melbourne, Australia. Association for Compu-
tational Linguistics. https://doi.org/10
.18653/v1/P18-2117

Gail Weiss, Yoav Goldberg, and Eran Yahav.
2021. Thinking like transformers. ArXiv, abs/
2106.06981.

Andrew C.-C. Yao. 1990. On ACC and threshold
circuits. In Proceedings of the 31st Annual Sym-
posium on Foundations of Computer Science,
pages 619–627 vol.2.

Shunyu

Yao,

Peng,

Binghui

Christos
Papadimitriou, and Karthik Narasimhan. 2021.
Self-attention networks can process bounded
hierarchical languages. In Proceedings of the
59th Annual Meeting of the Association for
Computational Linguistics and the 11th In-
ternational Joint Conference on Natural Lan-
guage Processing (Volumen 1: Long Papers),
pages 3770–3785, Online. Association for
Computerlinguistik.

Chulhee Yun, Srinadh Bhojanapalli, Ankit Singh
Rawat, Sashank Reddi, and Sanjiv Kumar.
2020. Are transformers universal approximators
of sequence-to-sequence functions? In Interna-
tional Conference on Learning Representations.

Deep Learning and Formal Languages: Build-
ing Bridges, pages 1–13, Florence. Association
für Computerlinguistik. https://doi
.org/10.18653/v1/W19-3901

William Merrill. 2021. Formal language theory
meets modern NLP. ArXiv, abs/2102.10094.

William Merrill, Vivek Ramanujan, Yoav
Goldberg, Roy Schwartz, and Noah A. Schmied.
2021. Effects of parameter norm growth dur-
ing transformer training: Inductive bias from
gradient descent. In Proceedings of the 2021
Conference on Empirical Methods in Natu-
ral Language Processing, pages 1766–1781,
Online and Punta Cana, Dominican Republic.
Verein für Computerlinguistik.
https://doi.org/10.18653/v1/2021
.emnlp-main.133

Hao Peng, Roy Schwartz, Sam Thomson, Und
Noah A. Schmied. 2018. Rational recurrences. In
Verfahren der 2018 Conference on Empir-
ical Methods in Natural Language Processing,
pages 1203–1214, Brussels, Belgien. Associa-
tion for Computational Linguistics. https://
doi.org/10.18653/v1/D18-1152

Jorge P´erez, Javier Marinkovic, and Pablo Barcel´o.
2019. On the Turing completeness of modern
neural network architectures. In International
Conference on Learning Representations.

Ashish Vaswani, Noam Shazeer, Niki Parmar,
Jakob Uszkoreit, Llion Jones, Aidan N. Gomez,
Łukasz Kaiser, and Illia Polosukhin. 2017. Bei-
tention is all you need. In Advances in Neural

856

l

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

P

:
/
/

D
ich
R
e
C
T
.

M

ich
T
.

e
D
u

/
T

A
C
l
/

l

A
R
T
ich
C
e

P
D

F
/

D
Ö

ich
/

.

1
0
1
1
6
2

/
T

l

A
C
_
A
_
0
0
4
9
3
2
0
3
8
5
0
6

/

/
T

l

A
C
_
A
_
0
0
4
9
3
P
D

.

F

B
j
G
u
e
S
T

T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
PDF Herunterladen