Formal Language Recognition by Hard Attention Transformers:
Perspectives from Circuit Complexity
Yiding Hao, Dana Angluin, and Robert Frank
Yale University
新天堂, CT, 美国
{yiding.hao, robert.frank, dana.angluin}@yale.edu
抽象的
This paper analyzes three formal models of
Transformer encoders that differ in the form
of their self-attention mechanism: unique hard
注意力 (UHAT); generalized unique hard at-
注意力 (GUHAT), which generalizes UHAT;
and averaging hard attention (AHAT). 我们
show that UHAT and GUHAT Transformers,
viewed as string acceptors, can only recog-
nize formal languages in the complexity class
AC0, the class of languages recognizable by
families of Boolean circuits of constant depth
and polynomial size. This upper bound sub-
sumes Hahn’s (2020) results that GUHAT
cannot recognize the DYCK languages or the
PARITY language, since those languages are
outside AC0 (Furst et al., 1984). 相比之下, 这
non-AC0 languages MAJORITY and DYCK-1
are recognizable by AHAT networks, imply-
ing that AHAT can recognize languages that
UHAT and GUHAT cannot.
1
介绍
The Transformer architecture for neural networks
(Vaswani et al., 2017) has yielded remarkable
advances in performance on a variety of bench-
mark tasks in natural language processing. 这些
advances have spurred considerable interest in
understanding the capabilities and limitations of
the Transformer architecture. While Transformer
networks are extremely complex when deployed
at scale, theoretical studies such as those of P´erez
等人. (2019), Yun et al. (2020), Hahn (2020),
and Merrill et al. (2022) have uncovered mean-
ingful
the expressive power of
Transformers by formulating abstract models of
the self-attention mechanism and analyzing their
computational power.
insights about
在这项工作中, we analyze three restricted models
of self-attention based on their ability to recog-
nize formal languages. All three models use hard
attention—meaning that each attention head at-
tends only to the position or positions with the
800
highest attention score, with no attention paid to
any of the other positions—but differ in how they
behave in the case of ties in the maximum at-
tention value. In the first two models we study,
the attention mechanism returns the value at ex-
actly one position (例如, the leftmost) 在
case several positions tie for the maximum at-
tention value. The first such model, 我们
call generalized unique hard attention Transform-
呃 (GUHAT) and was defined by Hahn (2020),
imposes no restrictions on the nature of activa-
tion values or the functions the network uses to
compute them. The second model, unique hard
attention Transformers (UHAT), was defined and
studied by Yao et al. (2021) and is a more concrete
version of GUHAT that incorporates restrictions
on the nature of activation values and computa-
系统蒸发散. In the third model, which we call averaging
hard attention Transformers (AHAT), the atten-
tion mechanism returns the uniform average of
the values at positions with the maximum atten-
tion value. This is the definition of hard attention
used by P´erez et al. (2019), Yun et al. (2020), 和
Merrill et al. (2022).1
Our main contribution is to prove that GUHAT
and UHAT can only recognize formal languages
in AC0, the class of formal languages recognized
by a family of Boolean circuits of constant depth
and polynomial size, whereas AHAT can rec-
ognize formal languages outside of AC0. 更多的
formally, we prove that any formal language rec-
ognized using a GUHAT is also recognized by
a family of Boolean circuits of constant depth
and polynomial size, establishing AC0 as an up-
per bound on the expressiveness of UHAT and
GUHAT. We also show that every UHAT can
be simulated by an AHAT, establishing UHAT
as a subclass of AHAT. Based on the classical
results of Furst et al. (1984), our upper bound sub-
sumes Hahn’s (2020) results that GUHAT cannot
1Merrill et al. (2022) call it saturated hard attention.
计算语言学协会会刊, 卷. 10, PP. 800–810, 2022. https://doi.org/10.1162/tacl 00490
动作编辑器: Carlos G´omez-Rodr´ıguez. 提交批次: 1/2022; 修改批次: 4/2022; 已发表 8/2022.
C(西德:2) 2022 计算语言学协会. 根据 CC-BY 分发 4.0 执照.
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
t
A
C
我
/
我
A
r
t
我
C
e
–
p
d
F
/
d
哦
我
/
.
1
0
1
1
6
2
/
t
我
A
C
_
A
_
0
0
4
9
0
2
0
3
7
1
2
4
/
/
t
我
A
C
_
A
_
0
0
4
9
0
p
d
.
F
乙
y
G
你
e
s
t
t
哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3
recognize the DYCK languages or the PARITY
语言, neither of which belongs to AC0. 毛皮-
瑟莫雷, our result combines with P´erez et al.’s
(2019) AHAT implementation of the MAJORITY
language and Bhattamishra et al.’s (2020) AHAT
implementation of DYCK-1 (neither of which is
in AC0) to show that AHAT can recognize lan-
guages that GUHAT cannot. 最近, Merrill
等人. (2022) have given an upper bound on the
power of AHAT: 即, that every formal lan-
guage recognizable using averaging hard attention
is recognizable using a family of circuits of con-
stant depth and polynomial size with Boolean and
majority gates; 那是, a family of circuits in the
complexity class TC0, known to be a strict superset
of AC0. 合在一起, our paper establishes the
following relationships between the three models
we consider of hard-attention Transformers.
UHAT ⊆ GUHAT ⊆ AC0
UHAT (西德:2) AHAT (西德:3) AC0
2 Preliminaries
Let Σ be a fixed finite alphabet of symbols, 和
让 $ be a distinct end-of-sequence symbol not in Σ. The set of strings of symbols over Σ of length n is denoted by Σn, and the set of all finite strings over Σ is denoted Σ∗. A (formal) language over Σ is any subset of Σ∗. The set of integers between i and j inclusive is denoted [i..j]. Define the function (西德:2) : N → N as (西德:2)(n) = (西德:6)log2(n + 1)(西德:7) for all n, and define bin(我, n) to be the binary representation of i as a string of length (西德:2)(n), for every i ∈ [1..n]. 例如, bin(6, 30) = 00110. If P is a logical predicate, 然后 {磷 } denotes the truth value of P ; 例如, {(x ≥ y) ∨ (z > 3)} = 1 when x ≥ y or z > 3, 并且是 0 否则. 3 Circuit Complexity Our analysis of UHAT and GUHAT is carried out within the framework of circuit complexity, in which the complexity of a computational system is measured by the size, 深度, and types of gates of a Boolean circuit implementing that system. In this section we review the basic concepts, definitions, and results of circuit complexity used by our analysis. A detailed overview is provided in Chapter 6 of Arora and Barak (2009). 3.1 Boolean Circuits on logic systems Boolean circuits are a formal model of com- putational gates. based Roughly speaking, a Boolean circuit consists of binary-valued input and output layers, with feedforward connections2 to one another via intermediate gates that implement logical opera- 系统蒸发散. We use the following definition of Boolean circuits. Definition 1. A Boolean circuit with n inputs and m outputs is a labeled directed acyclic graph satisfying the following conditions. There are n distinguished input vertices labeled with the vari- ables x1, x2, . . . , xn. Each input vertex has fan-in 0. The rest of the vertices are gates, each hav- ing a label from Constant-0, Constant-1, NOT, AND, or OR. The Constant-0 and Constant-1 gates have fan-in 0, NOT gates have fan-in 1, and AND and OR gates have unbounded fan-in. 最后, the labels z1, z2, . . . , zm are applied to some (not necessarily distinct) vertices; these are the outputs. We refer to the edges of a Boolean circuit as wires. The size of a circuit is the number of wires it contains, and the depth of a circuit is the maximum length of a directed path of wires from an input vertex to an output. A Boolean circuit C computes a Boolean function from {0, 1}n to {0, 1}米; we denote its output on input x = x1x2 . . . xn by C(X). Observe that a Boolean circuit has a fixed number of input vertices, and therefore can only take as input bit strings of a fixed length. We would like to define circuit computation for a map defined on all of {0, 1}∗. To that end, we allow different circuits for inputs of different lengths. Definition 2. A family of circuits is a sequence {中文}, where for each integer n ≥ 0, Cn is a Boolean circuit with n inputs and one output. A map f from {0, 1}∗ to {0, 1} is computed by a family of circuits {中文} if and only if for all n and all x ∈ {0, 1}n, F (X) = Cn(X). The class AC0 is defined by setting restrictions on the size and depth of circuits within a family of Boolean circuits. Definition 3. A family of circuits is of constant depth if there exists a constant K such that the depth of Cn is bounded by K for all n. A family 2We consider only acyclic circuits. 801 l 从http下载 : / / 直接的 . 米特 . 呃呃 / t a c l / 拉蒂斯 – df / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 9 0 2 0 3 7 1 2 4 / / t l a c _ a _ 0 0 4 9 0 压力 . 来宾来访 0 7 九月 2 0 2 3 of circuits is of polynomial size if there exists a constant c such that the size of Cn is bounded by nc + c for all n. The set AC0 is the set of families of Boolean circuits of both constant depth and polynomial size. We relate formal languages with families of circuits by identifying languages L ⊆ Σ∗ with Boolean functions that classify strings as belong- ing to L or not. Formally speaking, let Σ be a finite alphabet. A binary symbol encoding of Σ is an injective map h from Σ to {0, 1}s, where s = (西德:2)(|Σ|). Thus h maps each symbol to a dis- tinct binary string of length s. We extend h to a homomorphism on strings from Σ∗. We say that the circuit family {中文} recognizes the language L over Σ if there is a binary symbol encoding h of Σ such that for every n and every x ∈ Σn, Csn(H(X)) = 1 if and only if x ∈ L. With this definition of language recognition via Boolean circuits, we say that a language is in AC0 if and only if it is recognized by a family of Boolean circuits in AC0. 3.2 Non-AC0 Languages Having defined the class AC0, we present some examples of languages not belonging to this class. 第一的, the following three languages were shown by Furst et al. (1984) to fall outside AC0. Definition 4. We define the following languages over the alphabet {0, 1}. The language PARITY is the set of all strings containing an even number of 1s; MAJORITY is the set of strings with at least as many 1s as 0s; and EQUALITY is the set of strings with exactly as many 1s as 0s. 此外, we show later in this paper (Corollary 3) that DYCK-1 also falls outside AC0. Definition 5. The language DYCK-k is the set of strings over an alphabet of k types of pairs of brackets that are correctly nested and matched. 例如, DYCK-2 over the alphabet {(, ), [, ]} can be described by a context free grammar with productions S → ε, S → (S), S → [S], and S → SS. The language DYCK-(k, D) is the set of strings in DYCK-k in which the depth of nest- ing of brackets never exceeds D. The language SHUFFLE-k is the shuffle (arbitrary interleaving) of strings from k versions of DYCK-1 each using a different type of bracket pair. 最后, we define PALINDROMES, a language shown in Section 5 to be in GUHAT. 802 Definition 6. The language PALINDROMES is the set of strings equal to their reverses, which can be described by the context free grammar with productions S → ε, S → σ, and S → σSσ for each alphabet symbol σ. 4 Hard Attention Transformers We now define the three kinds of hard atten- tion Transformers studied in this paper: GUHAT, UHAT, and AHAT. These formalisms are models of computation inspired by the encoder portion of the Transformer architecture. They conceptu- alize Transformers as cascading layers of feature extractors that convert a sequence of embeddings into increasingly higher-level representations. 4.1 General Framework We begin by presenting a general framework that subsumes the three hard attention Transformer models. 正式地, a generalized Transformer is a device that maps a string x ∈ Σ∗$ to 1 或者 0, signi-
fying that x is accepted or rejected, 分别.
Each such device is parameterized by a collection
of functions described as follows.
Definition 7. A generalized Transformer with K
layers and H attention heads is a tuple T =
(Σ, A, F, f att
k , G | k ∈ [1..K], h ∈
[1..H]) 在哪里
k,H, f pool, f act
• Σ is the input alphabet,
• A is the set of activation values,
• f : Σ ∪ {$} × N × N → A is the input function, k,H : A × A → R is the attention function for layer k and head h, • f att • f pool : A∗ × R∗ → A is the pooling function, • f act : AH+1 → A is the activation function k for layer k, and • g : A → {0, 1} is the model output function. On input x1x2 . . . xn where xn = $, a string
y(0)
1 y(0)
n ∈ An of initial activation values
2
is given by
. . . y(0)
y(0)
i = f (希, 我, n)
for all i. Each layer k then produces a string
y(k) = y(k)
n ∈ An of activation val-
ues from the previous activation values y(k−1) =
1 y(k)
. . . y(k)
2
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
t
A
C
我
/
我
A
r
t
我
C
e
–
p
d
F
/
d
哦
我
/
.
1
0
1
1
6
2
/
t
我
A
C
_
A
_
0
0
4
9
0
2
0
3
7
1
2
4
/
/
t
我
A
C
_
A
_
0
0
4
9
0
p
d
.
F
乙
y
G
你
e
s
t
t
哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3
2
y(k−1)
. . . y(k−1)
n
y(k−1)
as follows. 第一的, each atten-
1
tion head h produces an n × n matrix of attention
scores ai,j,k,h given by
人工智能,j,k,h = f att
k,H
(西德:2)
y(k−1)
我
, y(k−1)
j
(西德:3)
for all positions i, j of the input string. 下一个, 这
pooling function converts each row of attention
scores into an activation value based on y(k−1):
(西德:2)
双,k,h = f pool
y(k−1)
1
, y(k−1)
2
, . . . , y(k−1)
n
,
(西德:3)
人工智能,1,k,H, 人工智能,2,k,H, . . . , 人工智能,n,k,H
.
最后, the layer output y(k) is computed using
the layer’s activation function:
y(k)
i = f act
k
(西德:2)
y(k−1)
我
(西德:3)
, 双,k,1, . . . , 双,k,H
.
When y(k) has been computed for all k ∈ [1..K],
the final output of the generalized Transformer
时间 (X) is computed by applying the model output
function to the last symbol of y(K); 那是,
时间 (X) = g
(西德:2)
(西德:3)
y(K)
n
.
If T (X) = 1, we say that T accepts x; 否则,
we say that T rejects x. The language recognized
by T , denoted L(时间 ), is the set of strings x ∈ Σ∗
such that T accepts x$. The formalism we have presented above is fully generalized in the sense that we have placed no restrictions on the activation values A or the functions f , f att k , or g, other than to specify their domains and co-domains. The three hard attention Transformer models are derived by placing restrictions upon these elements. k,H, f pool, f act 4.2 Unique and Averaging Hard Attention The first restriction we consider is on the form of the pooling function. We consider two types of pooling functions: the unique hard attention function, used in GUHAT and UHAT, and the averaging hard attention function, used in AHAT. In unique hard attention, the pooling function simply selects the activation value from the previ- ous layer corresponding to the argmax of the row of attention scores. In case of a tie, the leftmost activation value is selected. Definition 8. The unique hard attention function is the pooling function f UHA : A∗ × R∗ → A defined as follows. On inputs (y1, y2, . . . , yn) ∈ 803 An and (a1, a2, . . . , 一个) ∈ Rn, let j ∈ [1..n] be the least position that maximizes aj. 然后, f UHA(y1, . . . , yn, a1, . . . , 一个) = yj. Averaging hard attention is similar to unique hard attention, except that in the case of a tie, the selected activation values are averaged. Definition 9. Let A be a vector space over a field containing Q. The averaging hard attention function is the pooling function f AHA : A∗×R∗ → A defined as follows. On inputs (y1, y2, . . . , yn) ∈ An and (a1, a2, . . . , 一个) ∈ Rn, let j1, j2, . . . , jm ∈ [1..n] be all the positions that maximize aj. 然后, f AHA(y1, . . . , yn, a1, . . . , 一个) = 1 m m(西德:4) i=1 yji. The GUHAT model is defined as the class of generalized Transformers that use unique hard attention. Definition 10. A generalized unique hard atten- tion Transformer is a generalized Transformer whose pooling function is f UHA. We use the term GUHAT to refer to the class of generalized unique hard attention Transformers, and also to the class of languages they recognize. k The GUHAT model mostly follows the defi- nitions of Hahn (2020). It is slightly generalized in allowing the input function f to depend on the input length n, and in allowing the activa- tion function f act to depend on the layer k, but these generalizations are immaterial. 尤其- 拉尔, it is not necessary to assume that the input length n is provided to the input function: if the input function were f (σ, 我) = (σ, 我), the subse- quent layer could direct attention at every position to position n (because it uniquely contains the end-of-sequence symbol $), at which point the
value of n is available at every position.
k,H, f act
4.3 Restricted Models: UHAT and AHAT
GUHAT allows the activation values A and the
functions f , f att
k , and g to take on any
arbitrary mathematical value. In practical appli-
cations of Transformer networks, 然而, 这些
components are restricted in specific ways. 许多
variations of hard attention Transformers attempt
to incorporate these restrictions into theoretical
型号, though they do not entirely agree on the
details of these restrictions.
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
t
A
C
我
/
我
A
r
t
我
C
e
–
p
d
F
/
d
哦
我
/
.
1
0
1
1
6
2
/
t
我
A
C
_
A
_
0
0
4
9
0
2
0
3
7
1
2
4
/
/
t
我
A
C
_
A
_
0
0
4
9
0
p
d
.
F
乙
y
G
你
e
s
t
t
哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3
The UHAT and AHAT models adopt many
of these restrictions, largely following the def-
initions of Yao et al. (2021). For the sake of
computability, we require activation values to
be vectors of rational numbers. Following P´erez
等人. (2019), we restrict scalars to be rational as
出色地. 下一个, we assume that the input function f
is decomposed into a token embedding function
and a position embedding function. Mirroring the
more familiar description of attention functions
in terms of query, 钥匙, and value matrices, 我们
use a bilinear form for attention functions pro-
posed in Luong et al. (2015). In addition to the
unique and averaging hard attention mechanisms,
we allow the pooling function to be future-masked
(where for position i only those positions j with
j ≤ i are considered in the attention compu-
站) or past-masked (similarly for j ≥ i).
最后, we assume that activation functions and
the model output function are computed by feed-
forward neural networks with ReLU activation.
These restrictions are summarized below.
Definition 11. For d ∈ N, a restricted Trans-
former of dimension d is a generalized Trans-
former such that
• the set of activation values is A = Qd;
• the input function is given by
F (σ, 我, n) = fe(σ) + p(我, n),
where fe : Σ ∪ {$} → Qd is the token embedding function and p : N × N → Qd is the position embedding function; • each attention function is of the form k,H(y, y(西德:13)) = y(西德:14)和,hy(西德:13), f att where Ak,h ∈ Qd×d; • the pooling function may be future-masked or past-masked; • each activation function is computed by a feedforward neural network with ReLU activation; • the output function g is computed by a feed- forward neural network with ReLU activation followed by a softmax layer, with g(y) = 1 if and only if the output of the network on input y is greater than or equal to 1/2. Because Σ is finite, we may assume that the token embedding function is given by a table lookup. Our formulation of position embedding is somewhat more general than the definition of Yao et al. (2021), who take the position embedding to be a scalar defined as p(我, n) = i/n that occupies one position of the initial activation vector. The UHAT and AHAT models are defined to be restricted Transformers that satisfy the above conditions and use unique and averaging hard attention, 分别. Definition 12. A unique hard attention Trans- former is a restricted Transformer whose pooling function is f UHA or a future- or past-masked version thereof. An averaging hard atten- tion Transformer is a restricted Transformer whose pooling function is f AHA or a future-or past-masked version thereof. We use the terms UHAT and AHAT, 分别, for these classes of Transformers, and also for the classes of languages they recognize. UHAT is clearly a subclass of GUHAT because the former imposes restrictions on the form of the input, 注意力, activation, and output functions. We suspect, but do not prove, that this inclusion is proper. 而且, we briefly argue below that UHAT is properly contained in AHAT. Proposition 1. UHAT is a strict subclass of AHAT. Proof sketch. Because AHAT recognizes non- AC0 languages (P´erez et al., 2019; Bhattamishra et al., 2020), it suffices to show that UHAT ⊆ AHAT. Let T be a UHAT of dimension d rec- ognizing L. We define a UHAT ˆT of dimension d + 2 recognizing L that has no ties in its at- tention values. Since the pooling functions f UHA used in UHAT and f AHA used in AHAT are identical in the absence of ties, replacing the pool- ing function of ˆT with f AHA gives us an AHAT recognizing L. i Let N be a sufficiently large integer depending on n, specified below. Each activation value ˆy(k) in ˆT is y(k) from T with two additional constant components, set to 1 and i/N by the input func- , ˆy(k−1) 的. The attention function ˆf att ) j computes ai,j,k,h using the original attention func- tion and activation values, subtracting the value j/N . This is achievable with a bilinear map. k,H(ˆy(k−1) i i If for j < (cid:2) the attention values ai,j,k,h and ai,(cid:2),k,h are tied in T , then after subtracting j/N 804 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 9 0 2 0 3 7 1 2 4 / / t l a c _ a _ 0 0 4 9 0 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Figure 1: Activation values computed by a GUHAT Transformer for PALINDROMES as it rejects the input abcca$.
and (cid:2)/N respectively, the tie is broken in favor
of j. However, N must also be large enough to
preserve the order of any two attention values that
are not tied. There are a finite number of different
attention values that arise in the computation of
T on all the inputs of length n, and it suffices to
choose N so that n/N is less than the distance
between any pair of such attention values.
4.4 Prior Results for These Models
Hahn (2020) shows that the languages 1∗ and
{anbn | n ≥ 1} are in GUHAT, and the lan-
guages PARITY and DYCK-k for all k ≥ 1 are
not in GUHAT. P´erez et al. (2019) show that
even without positional information, the language
MAJORITY is in AHAT. Bhattamishra et al.
(2020) show that SHUFFLE-k is in AHAT, which
implies that DYCK-1 is in AHAT. Yao et al. (2021)
show that the language DYCK-(k, D) is in UHAT.
The latter two results use positional masking, but
no other positional information.
5 PALINDROMES in GUHAT
Let us now illustrate how a GUHAT computes
by way of example. In this section, we describe a
GUHAT with 2 layers and 1 head that recognizes
the language PALINDROMES over the alphabet
Σ = {a, b, c}. Broadly speaking, this Transformer
works as follows. The first layer is responsible for
comparing each symbol of the input string with
the corresponding symbol on the opposite side of
the string, and marking whether the two symbols
match. The second layer reads these markings,
searching for a mismatch identified by the first
layer. If one is found, the model output function
returns 0; otherwise, it returns 1. For intuition,
we simultaneously illustrate the Transformer’s
computation on the input abcca$, which should be rejected. The input function is defined as f (σ, i, n) = (σ, i, n) for each σ ∈ Σ and i ∈ [1..n]. For our example input, the initial (layer 0) activation values are shown in the first row of Figure 1. These activation values are not rational-valued vectors, of course, but the GUHAT model imposes no restriction on the form these values can take. We define the attention function for layer 1, 1,1 ((σ, i, n), (σ(cid:13), j, n)), to be {(j = n − i) ∨ (i = f att j = n)}. For each position i < n, this selects the activation at the correct corresponding position, n − i. For position n, it selects the activation at position n. We define the activation function for layer 1 as f act 1 ((xi, i, n), (xj, j, n)) = ⎧ ⎪⎨ ⎪⎩ (1, i), xi (cid:15)= xj (1, i), (0, i), i = j = n otherwise. The layer 1 activation values for our example input are shown in the second row of Figure 1. This indicates that positions 2 and 4 found mismatched symbols, and positions 1, 3, and 5 did not. Layer 2 gathers the relevant information from layer 1 into the last position. The layer 2 attention function is defined by f att 2,1 ((r, i), (s, j)) = s. This directs the attention at every position to the left- most activation value (s, j) from layer 1 such that s = 1. In our example, the leftmost such position is 2, with its activation of (1, 2). If the input se- quence had instead been a valid palindrome, none of the positions i ∈ [1..n − 1] would have been marked with (1, i) by layer 1. In this case, the leftmost position with s = 1 would have been the final position n, which has the activation value of (1, n). We define the layer 2 activation function as f act 2 ((r, i), (s, j)) = (i, j). For our example input, the activation values for layer 2 are shown in the third row of Figure 1. The activation value at position n will be (n, n) if and only if no earlier position found a symbol mismatch, so the model output function is simply g((i, j)) = {i = j}. With the input sequence abcca$, the activation for
position 6 at layer 2 is (6, 2) and the output value
is 0. For a valid palindrome such as abcba$, the 805 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 9 0 2 0 3 7 1 2 4 / / t l a c _ a _ 0 0 4 9 0 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 activation for position 6 at layer 2 is (6, 6) and the output value is 1. Generalizing this construction to an arbitrary alphabet Σ, we have the following. Proposition 2. For any finite alphabet Σ, the language PALINDROMES over Σ is in GUHAT. 6 A Normal Form for GUHAT Despite the abstractness and generality of the GUHAT model, we can define a normal form representation and show that every Transformer T in GUHAT is equivalent to a Transformer in GUHAT in this normal form with the same number of layers and heads. The key idea is to preserve in the activation values all the information from previous layers that has been used to compute them, by requiring that the input and activation functions just return the tuple of their arguments. We also require that attention values be integers in the smallest relevant range. Definition 13. A GUHAT with K layers and H heads is in informative normal form if and only if the following conditions are satisfied. • The input function is f (σ, i, n) = (σ, i, n). • For each layer k ∈ [1..K], the activation values are (H +1)-tuples of activation values at layer k − 1, and the activation function is defined by f act k (y, b1, . . . , bH ) = (y, b1, . . . , bH ). • For each layer k ∈ [1..K] and attention head h ∈ [1..H], the attention function f att k,h returns an integer in [0..N − 1], where N is the total number of possible ordered pairs of activation values at layer k − 1. Lemma 1. For any Transformer T ∈ GUHAT, there exists a Transformer ˆT ∈ GUHAT in in- formative normal form such that L(T ) = L( ˆT ). Moreover, ˆT has the same number of layers and heads as T . Proof. Let T be a GUHAT with K layers and H heads, with input alphabet Σ, input function f , attention functions f att k,h, activation functions f act k , and output function g. We describe how to construct functions for an equivalent Transformer ˆT in GUHAT in informative normal form, which also has K layers and H heads. We assume that n is the input length. For ˆT the input function ˆf (σ, i, n) is defined to return the triple (σ, i, n). Note that there are at most |Σ|n possible initial activation values. We also define a function t0 that translates initial activation values for ˆT into initial activation values for T by t0(σ, i, n) = f (σ, i, n). Now, we induct on the layers of T and ˆT . Assume that we have defined attention and acti- vation functions for ˆT for layers before k (where the initial activation values are treated as ‘‘layer 0’’), and a translation function tk−1 that trans- lates all possible activation values for ˆT from the previous layer into activation values for T from the previous layer. To define the attention function for ˆT for layer k for head h, we enumer- ate all the possible pairs ˆyi and ˆyj of activation values of ˆT at layer k − 1, and determine the corresponding attention values of T , which we denote by vk,h(ˆyi, ˆyj) = f att k,h(tk−1(ˆyi), tk−1(ˆyj)). We make a list of all the distinct resulting values and sort them into increasing order. Then we de- fine ˆf att k,h(ˆyi, ˆyj) to be the rank of vk,h(ˆyi, ˆyj) in this sorted list. The activation function for ˆT for layer k is, by definition, ˆf act k (y, b1, . . . , bH ) = (y, b1, . . . , bH ). The translation function for layer k is defined by tk(y, b1, . . . , bH ) k (tk−1(y), tk−1(b1), . . . , tk−1(bH )), = f act that is, we translate each of the component activa- tion values using tk−1 and then apply the activation function of T . Finally, the output function for ˆT is defined by ˆg(ˆy) = g(tK(ˆy)), that is, we translate the layer K activation value ˆy of ˆT to the layer K activation value of T , and apply the output function of T . By construction, ˆT is in informative normal form, and it has K layers and H heads. It is not difficult to see that for any input x, the translations tk(ˆy) of the activation values ˆy of ˆT are equal to the corresponding activation values of T , and the outputs ˆT (x) = T (x) are equal as well. Thus, L( ˆT ) = L(T ). To illustrate the construction of ˆT in the proof of Lemma 1, we briefly show how an informa- tive normal form version of the Transformer for PALINDROMES from Section 5 would process the input x = abcca$. Because the attention functions
in that example return 0 or 1, their translation
is simplified.
806
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
4
9
0
2
0
3
7
1
2
4
/
/
t
l
a
c
_
a
_
0
0
4
9
0
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
The initial activation values and layer 1 atten-
tion function are the same as in the example. The
resulting layer 1 activation sequence, consisting
of a sequence of paired initial activations and
attention values, is
((a, 1, 6), (a, 5, 6)),
((b, 2, 6), (c, 4, 6)), . . . , (($, 6, 6), ($, 6, 6)).
The translation t1 maps ((xi, i, n), (xj, j, n)) to
(cid:15)= xj. When
(0, i) if xi = xj and (1, i) if xi
applied to the above activation sequence, this
yields the previous example’s layer 1 activation
sequence.
The layer 2 attention function applied to a pair
of layer 1 activation values ((xi, i, n), (xj, j, n))
and ((xk, k, n), (x(cid:2), (cid:2), n)) first applies the transla-
tion function t1 to these two activation values to
recover the pairs (r, i) and (s, j), and then applies
the example’s layer 2 attention function to these
to yield the attention value s.
The layer 2 translation function maps a layer 2
activation value
(((xi, i, n), (xj, j, n)), ((xk, k, n), (x(cid:2), (cid:2), n)))
to (i, k). For layer 2 and position 6 the activation
value for this input is
((($, 6, 6), ($, 6, 6)), ((b, 2, 6), (c, 4, 6))),
which is mapped to (6, 2) by t2. The previous
example’s output function compares 6 and 2 and
returns 0, rejecting the input x.
7 From GUHAT to Circuits
In this section we show that for every language
L ∈ GUHAT, we can construct a family of Bool-
ean circuits of constant depth and polynomial
size that also recognizes L. This will prove the
following, which is our main result.
Theorem 1. Every language in GUHAT is recog-
nized by a family of circuits in AC 0.
Let L be a language over Σ that is in GUHAT.
By Lemma 1, we may assume that L is recognized
by GUHAT Transformer T in informative normal
form. Assume T has K layers and H heads.
What we describe below is a family of circuits
to recognize the end-marked language L$, which can easily be converted to a family of circuits that recognizes L by hard-wiring the representation of the end-of-sequence symbol $ at the end of
the input string using constant gates. Let s =
(cid:2)(|Σ|+1) and let h be any binary symbol encoding
for Σ ∪ {$}. We construct a family of Boolean circuits {Cn} of constant depth and polynomial size such that for all positive integers n and all x ∈ Σn−1, x ∈ L if and only if Csn(h(x$)) = 1.
The key step of the proof is to bound the
number of bits needed to represent attention and
activation values for an input sequence of length
n by O(log n), where the suppressed constants
depend on K and H.
Lemma 2. Let T be a GUHAT in informative
normal form with K layers and H heads, and
alphabet Σ. Let s = (cid:2)(|Σ| + 1). Then for any input
of length n and any k ∈ [0..K], the activation
values at layer k can be represented by (H +
1)k(2(cid:2)(n) + s) bits, and for k ≥ 1, the attention
scores at layer k can be represented by 2(H +
1)k−1(2(cid:2)(n) + s) bits.
Proof. For an input sequence of length n, the
initial activation values are (σ, i, n), where σ ∈
Σ ∪ {$} and i ∈ [1..n]. This can be represented
by a string of 2(cid:2)(n) + s bits. At each successive
layer, the activation values are a tuple of (H + 1)
values from the previous layer, which multiplies
the number of bits required to represent them
by (H + 1). Also, the range of attention scores
is bounded by the number of ordered pairs of
activation values at the previous layer, so atten-
tion values can be represented by twice the num-
ber of bits to represent an activation value at the
previous layer.
It is worth observing that the bounds provided
by Lemma 2 do not hold in the case of AHAT.
Attention scores may be the result of the average
of an arbitrary subset of the possible inputs, which
means that there are exponentially more possible
activation values at each layer.
The following elementary facts about Boolean
circuits will be useful.
Lemma 3. An arbitrary Boolean function f of n
inputs and m outputs can be computed by a depth
3 circuit of size at most m(n2n + 2n + n).
Proof. Express each output zi of f as a disjunctive
normal form (DNF) formula of at most 2n terms,
each with at most n literals. Convert each DNF
formula to a circuit with one OR gate with inputs
from an AND gate for each term, each of whose
inputs is either an input to the function, or the
result of applying a NOT gate to an input. In each
807
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
4
9
0
2
0
3
7
1
2
4
/
/
t
l
a
c
_
a
_
0
0
4
9
0
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
such circuit, the OR gate has at most 2n input
wires, each AND gate has at most n input wires,
and each of at most n NOT gates has one input
wire, for a total size bounded by n2n + 2n + n.
The final circuit consists of these m separate cir-
cuits computing in parallel, and its size is at most
m times the size of each one. The longest pos-
sible path to an output from an input is through a
NOT, an AND, and the OR gate, for a depth of
at most 3.
Corollary 1. If a Boolean function f has at most
c log n inputs and at most d log n outputs, then it
may be computed by a Boolean circuit of depth
3 and size at most (d log n)(nc(c log n) + nc +
c log n).
With the O(log n) bound on the number of
bits to represent activation and attention values,
Lemma 2 yields circuits of constant depth and size
polynomial in n for the input, attention, activa-
tion, and output functions. Additional circuitry is
necessary to implement the comparison of atten-
tion scores and selection of the activation value to
attend to for each position, layer, and head.
We construct the overall circuit Csn according
to the layers of T , starting with the input function.
Let the inputs to T be xi for i ∈ [1..n]. The
inputs to Csn are xi,j for i ∈ [1..n] and j ∈ [1..s],
where xi,j are the bits of h(xi), representing the
binary encoding of input symbol xi. At layer 0
for position i, the value of y(0)
i = f (xi, i, n) =
(xi, i, n) is achieved by having the input wires xi,j
for j ∈ [1..s] followed by a sequence of constants
0 or 1 representing bin(i, n) and bin(n, n) for a
total of 2(cid:2)(n) + s wires representing the value
(xi, i, n).
i
Inducting on layers, we assume that for some
k ∈ [1..K] the circuit Csn has been constructed
to contain the wires representing all the activation
values y(k−1)
for i ∈ [1..n] at layer k − 1. The
portion of the circuit computing the representa-
tions of activation values at layer k is described
as follows. Fix a position i ∈ [1..n] and a head
h ∈ [1..H]. For each j ∈ [1..n], there is a circuit
Ai,j,k,h that has as input the wires for the activation
values y(k−1)
and as output, wires rep-
resenting the nonnegative integer attention score
ai,j,k,h in binary. Each of these circuits Ai,j,k,h
has 2(H + 1)k−1(2(cid:2)(n) + s) inputs and outputs by
Lemma 2, and therefore can be computed using
depth 3 and size polynomial in n, by Corollary 1.
and y(k−1)
j
i
All Hn2 such circuits for layer k operate in
parallel, for overall depth 3 and size polyno-
mial in n.
We next describe the circuit that implements the
pooling function f UHA. For each pair j, j(cid:13) ∈ [1..n],
there is a circuit Di,j,j(cid:13),k,h whose inputs are the
outputs of Ai,j,k,h and Ai,j(cid:13),k,h and whose output
is a single wire gi,j,j(cid:13),k,h with a value of 1 if
ai,j,k,h ≥ ai,j(cid:13),k,h and 0 otherwise. Because of
the bounds on the number of inputs and outputs,
each of these circuits can have depth 3 and size
polynomial in n by Corollary 1. These n2 circuits
all compute in parallel.3 Then for each position
j, whether j maximizes ai,j,k,h can be computed
by an AND gate whose inputs are gi,j,j(cid:13),k,h for all
j(cid:13) ∈ [1..n]. Let the output of this AND gate be
denoted mi,j,k,h. Then mi,j,k,h = 1 if and only if
the position j maximizes ai,j,k,h. This increases
the depth by 1.
For each j, an indicator zi,j,k,h is computed
by an AND gate whose inputs are mi,j,k,h and
NOT(mi,j(cid:13),k,h) for all j(cid:13) < j. Thus, zi,j,k,h = 1
if and only if j is the leftmost position that max-
imizes ai,j,k,h. This increases the depth by 2.
j
Finally, these indicator values are used to com-
bine the layer k − 1 activation values in a selection
circuit, yielding the representation of the activa-
tion value bi,k,h = y(k−1)
such that zi,j,k,h = 1.
In general, such a selection circuit takes as in-
put t selector bits z1, . . . , zt, where exactly one
zj = 1, and t input values w1, . . . , wt, where each
wr consists of S bits. It outputs S bits represent-
ing the selected wj (for which zj = 1). Letting
wr,s denote bit s of wr, the computation can be
described as vr,s = wr,s ∧ zr for r ∈ [1..t] and
s ∈ [1..S], which can be computed by one layer
of tS AND gates in parallel. Then the bits of the
(cid:9)
S
r=1 vr,s for s ∈ [1..S], which
output are us =
can be computed by one layer of S OR gates in
parallel. Thus, the selection circuit adds 2 to the
depth, and a polynomial in n to the size.
Because each activation function for a GUHAT
in informative normal form simply returns its
arguments, no further computation is needed for
the activation values. The representation of the
activation value y(k)
is just the sequence of wires
i
representing y(k−1)
followed by those representing
bi,k,1, through bi,k,H .
i
3In fact, comparison of two b-bit integers can be done
with a Boolean circuit of constant depth and size polynomial
in b, but that is not necessary for the present purpose.
808
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
4
9
0
2
0
3
7
1
2
4
/
/
t
l
a
c
_
a
_
0
0
4
9
0
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
To produce the output of the circuit, we note
that the representation of y(L)
has O(log n) bits
n
and the output of g is a single bit, so g can
be implemented by a Boolean circuit of constant
depth and size polynomial in n, by Corollary 1.
This concludes the proof of Theorem 1.
Furst et al. (1984) prove that the PARITY,
EQUALITY, and MAJORITY languages are not in
AC0, which immediately implies the following.
Corollary 2. GUHAT does not contain the lan-
guages PARITY, MAJORITY, or EQUALITY.
To see that the DYCK-k languages are also not
in AC0, we reduce from the EQUALITY language.
Corollary 3. For all k ≥ 1, the language DYCK-k
is not in AC0, and is therefore not in GUHAT.
Proof. It suffices to prove this for k = 1. Assume
that there is a family {Cn} of Boolean circuits in
AC0 that recognizes DYCK-1. We may assume that
the binary symbol encoding is h([) = 0 and h(]) =
1. We show how to use this to construct a family
{En} of Boolean circuits in AC0 that recognizes
the EQUALITY language, a contradiction.
En is constructed from C3n as follows. If the
inputs to En are x1, . . . , xn, then En consists of
C3n with its first n inputs set to the constant 0, its
middle n inputs set to x1, . . . , xn, and its last n
inputs set to the constant 1.
Let x be any element of {0, 1}n. If the number
of occurrences of 0 is not equal to the number
of occurrences of 1 in x, then the input to C3n
has unequal numbers of [ and ] symbols, which
is not in DYCK-1 and x is rejected. If the number
of occurrences of 0 is equal to the number of
occurrences of 1 in x, then in any prefix of the
input to C3n, the number of occurrences of ] is less
than or equal to the number of occurrences of [.
At the end of the input, the number of occurrences
of [ is equal to the number of occurrences of ], so
the input to C3n is in DYCK-1 and x is accepted.
Thus {En} is a family of Boolean circuits in
AC0 that recognizes the language EQUALITY,
a contradiction.
8 Discussion and Conclusions
We have defined formal language recognition by
the encoder portion of a Transformer network us-
ing generalized unique hard attention (GUHAT),
unique hard attention (UHAT), and averaging hard
attention (AHAT), and shown that languages in
UHAT and GUHAT are recognizable by constant
depth, polynomial size families of circuits, that
is, families of circuits in the complexity class
AC0. This strengthens the negative result of Hahn
(2020) that the languages PARITY and DYCK-k
are not in GUHAT, and provides a simpler and
more general proof. Combined with prior results
of P´erez et al. (2019) showing that the language
MAJORITY is in AHAT, or Bhattamishra et al.
(2020) showing that the language DYCK-1 is in
AHAT, this shows that AHAT contains languages
that are not in GUHAT or UHAT.
Many intriguing open questions remain. What
classical closure properties hold for the classes of
languages GUHAT, UHAT, and AHAT? Closure
under complement just requires complementing
the output function g, and closure under pairwise
union and intersection should be straightforward
using a parallel approach; but what about closure
under homomorphism, inverse homomorphism,
concatenation, or Kleene star? We briefly observe
that GUHAT and UHAT cannot be closed un-
der both Kleene star and concatentation lest they
contain all regular languages, including PARITY.
Existing formal models and indeed practical
implementations of Transformers vary in their
representation of position information, whether
as an absolute representation of position, a ratio
(e.g., position i in a sequence of length n as i/n),
through angle information (e.g., position i by the
pair (cos θi, sin θi) where θi = πi/2n), or as an
arbitrary learned embedding. In the UHAT and
AHAT models, the choice of positional encod-
ing can facilitate positional comparison (e.g., an
angle-based encoding allows for equality testing
via dot products) or make it uncomputable (e.g., if
positional encodings enumerate Turing machines
that halt on their own encodings). It remains to be
understood what effect such differences in posi-
tion representation have on the expressive power
of a model.
More generally, is it possible to prove that soft
attention, which we have not addressed here, is
strictly more powerful than even averaging hard
attention? Yao et al. (2021, Theorem B.3) present a
construction for a soft attention Transformer that
recognizes DYCK-k. This construction crucially
uses specialized encodings of position and layer
normalization, whose formal power remains to
be understood.
809
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
4
9
0
2
0
3
7
1
2
4
/
/
t
l
a
c
_
a
_
0
0
4
9
0
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Finally, given the success that Transformers
have had as models of natural language, it is
perhaps surprising that these models’ expressive
power seems to be best characterized (or at least
bounded) in terms of circuit complexity. Mathe-
matical explorations of natural language have most
commonly used the approach to language com-
plexity afforded by the Chomsky hierarchy and
its refinements, which is based on automata and
formal grammars. The apparent incomparability
of these approaches suggests that the exploration
of different types of Transformer models might
offer a new approach to the study of the formal
properties of natural language.
Acknowledgments
We thank the reviewers and the action editor for
their work in reviewing this paper.
References
Sanjeev Arora and Boaz Barak. 2009. Com-
putational Complexity: A Modern Approach.
Cambridge University Press, Cambridge,
United Kingdom. https://doi.org/10
.1017/CBO9780511804090
Linguistics, 8:156–171. https://doi.org
/10.1162/tacl_a_00306
approaches
Thang Luong, Hieu Pham, and Christopher D.
Manning. 2015. Effective
to
attention-based neural machine translation. In
Proceedings of the 2015 Conference on Empir-
ical Methods in Natural Language Processing,
pages 1412–1421, Lisbon, Portugal. Associa-
tion for Computational Linguistics. https://
doi.org/10.18653/v1/D15-1166
William Merrill, Ashish Sabharwal, and Noah A.
are
Smith. 2022. Saturated transformers
constant-depth threshold circuits. Computing
Research Repository, arXiv:2106.16213v3 [cs].
Jorge P´erez,
Javier Marinkovi´c, and Pablo
Barcel´o. 2019. On the Turing completeness
of modern neural network architectures. In
ICLR 2019 Conference Track, New Orleans,
LA, USA. OpenReview.
Ashish Vaswani, Noam Shazeer, Niki Parmar,
Jakob Uszkoreit, Llion Jones, Aidan N. Gomez,
Łukasz Kaiser, and Illia Polosukhin. 2017.
Attention is all you need. In Advances in
Neural Information Processing Systems 30,
pages 5998–6008, Long Beach, CA, USA.
Curran Associates, Inc.
transformers
to recognize formal
Satwik Bhattamishra, Kabir Ahuja, and Navin
Goyal. 2020. On the ability and limitations
of
lan-
guages. In Proceedings of the 2020 Conference
on Empirical Methods in Natural Language
Processing (EMNLP), pages 7096–7116, On-
line. Association for Computational Linguis-
tics. https://doi.org/10.18653/v1
/2020.emnlp-main.576
Merrick Furst, James B. Saxe, and Michael Sipser.
1984. Parity, circuits, and the polynomial-
time hierarchy. Mathematical Systems Theory,
17(1):13–27. https://doi.org/10.1007
/BF01744431
Michael Hahn. 2020. Theoretical limitations of
self-attention in neural sequence models. Trans-
actions of the Association for Computational
Shunyu
Yao,
Peng,
Binghui
Christos
Papadimitriou, and Karthik Narasimhan. 2021.
Self-attention networks can process bounded
hierarchical languages. In Proceedings of the
59th Annual Meeting of the Association for
Computational Linguistics and the 11th Inter-
national Joint Conference on Natural Lan-
guage Processing, volume 1: Long Papers,
pages 3770–3785, Online. Association for
Computational Linguistics. https://doi.org
/10.18653/v1/2021.acl-long.292
Chulhee Yun, Srinadh Bhojanapalli, Ankit Singh
Rawat, Sashank Reddi, and Sanjiv Kumar.
2020. Are Transformers universal approx-
imators of sequence-to-sequence functions?
In ICLR 2020 Conference Track, Online.
OpenReview.
810
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
4
9
0
2
0
3
7
1
2
4
/
/
t
l
a
c
_
a
_
0
0
4
9
0
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
下载pdf