Formal Language Recognition by Hard Attention Transformers:

Formal Language Recognition by Hard Attention Transformers:
Perspectives from Circuit Complexity

Yiding Hao, Fondo Angluin, y robert frank
Yale University
nuevo refugio, CT, EE.UU
{yiding.hao, robert.frank, dana.angluin}@yale.edu

Abstracto

This paper analyzes three formal models of
Transformer encoders that differ in the form
of their self-attention mechanism: unique hard
atención (UHAT); generalized unique hard at-
tention (GUHAT), which generalizes UHAT;
and averaging hard attention (AHAT). Nosotros
show that UHAT and GUHAT Transformers,
viewed as string acceptors, can only recog-
nize formal languages in the complexity class
AC0, the class of languages recognizable by
families of Boolean circuits of constant depth
and polynomial size. This upper bound sub-
sumes Hahn’s (2020) results that GUHAT
cannot recognize the DYCK languages or the
PARITY language, since those languages are
outside AC0 (Furst et al., 1984). A diferencia de, el
non-AC0 languages MAJORITY and DYCK-1
are recognizable by AHAT networks, imply-
ing that AHAT can recognize languages that
UHAT and GUHAT cannot.

1

Introducción

The Transformer architecture for neural networks
(Vaswani et al., 2017) has yielded remarkable
advances in performance on a variety of bench-
mark tasks in natural language processing. Estos
advances have spurred considerable interest in
understanding the capabilities and limitations of
the Transformer architecture. While Transformer
networks are extremely complex when deployed
at scale, theoretical studies such as those of P´erez
et al. (2019), Yun et al. (2020), Hahn (2020),
and Merrill et al. (2022) have uncovered mean-
ingful
the expressive power of
Transformers by formulating abstract models of
the self-attention mechanism and analyzing their
computational power.

insights about

En este trabajo, we analyze three restricted models
of self-attention based on their ability to recog-
nize formal languages. All three models use hard
attention—meaning that each attention head at-
tends only to the position or positions with the

800

highest attention score, with no attention paid to
any of the other positions—but differ in how they
behave in the case of ties in the maximum at-
tention value. In the first two models we study,
the attention mechanism returns the value at ex-
actly one position (Por ejemplo, the leftmost) en
case several positions tie for the maximum at-
tention value. The first such model, which we
call generalized unique hard attention Transform-
ers (GUHAT) and was defined by Hahn (2020),
imposes no restrictions on the nature of activa-
tion values or the functions the network uses to
compute them. The second model, unique hard
attention Transformers (UHAT), was defined and
studied by Yao et al. (2021) and is a more concrete
version of GUHAT that incorporates restrictions
on the nature of activation values and computa-
ciones. In the third model, which we call averaging
hard attention Transformers (AHAT), the atten-
tion mechanism returns the uniform average of
the values at positions with the maximum atten-
tion value. This is the definition of hard attention
used by P´erez et al. (2019), Yun et al. (2020), y
Merrill y otros. (2022).1

Our main contribution is to prove that GUHAT
and UHAT can only recognize formal languages
in AC0, the class of formal languages recognized
by a family of Boolean circuits of constant depth
and polynomial size, whereas AHAT can rec-
ognize formal languages outside of AC0. Más
formally, we prove that any formal language rec-
ognized using a GUHAT is also recognized by
a family of Boolean circuits of constant depth
and polynomial size, establishing AC0 as an up-
per bound on the expressiveness of UHAT and
GUHAT. We also show that every UHAT can
be simulated by an AHAT, establishing UHAT
as a subclass of AHAT. Based on the classical
results of Furst et al. (1984), our upper bound sub-
sumes Hahn’s (2020) results that GUHAT cannot

1Merrill y otros. (2022) call it saturated hard attention.

Transacciones de la Asociación de Lingüística Computacional, volumen. 10, páginas. 800–810, 2022. https://doi.org/10.1162/tacl a 00490
Editor de acciones: Carlos G´omez-Rodr´ıguez. Lote de envío: 1/2022; Lote de revisión: 4/2022; Publicado 8/2022.
C(cid:2) 2022 Asociación de Lingüística Computacional. Distribuido bajo CC-BY 4.0 licencia.

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
4
9
0
2
0
3
7
1
2
4

/

/
t

yo

a
C
_
a
_
0
0
4
9
0
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

recognize the DYCK languages or the PARITY
idioma, neither of which belongs to AC0. Fur-
thermore, our result combines with P´erez et al.’s
(2019) AHAT implementation of the MAJORITY
language and Bhattamishra et al.’s (2020) AHAT
implementation of DYCK-1 (neither of which is
in AC0) to show that AHAT can recognize lan-
guages that GUHAT cannot. Recientemente, Merril
et al. (2022) have given an upper bound on the
power of AHAT: a saber, that every formal lan-
guage recognizable using averaging hard attention
is recognizable using a family of circuits of con-
stant depth and polynomial size with Boolean and
majority gates; eso es, a family of circuits in the
complexity class TC0, known to be a strict superset
of AC0. Tomados juntos, our paper establishes the
following relationships between the three models
we consider of hard-attention Transformers.

UHAT ⊆ GUHAT ⊆ AC0
UHAT (cid:2) AHAT (cid:3) AC0

2 Preliminaries

Let Σ be a fixed finite alphabet of symbols, y
dejar $ be a distinct end-of-sequence symbol not in Σ. The set of strings of symbols over Σ of length n is denoted by Σn, and the set of all finite strings over Σ is denoted Σ∗. A (formal) language over Σ is any subset of Σ∗. The set of integers between i and j inclusive is denoted [i..j]. Define the function (cid:2) : N → N as (cid:2)(norte) = (cid:6)log2(norte + 1)(cid:7) para todos norte, and define bin(i, norte) to be the binary representation of i as a string of length (cid:2)(norte), for every i ∈ [1..norte]. Por ejemplo, bin(6, 30) = 00110. If P is a logical predicate, entonces {PAG } denotes the truth value of P ; Por ejemplo, {(x ≥ y) ∨ (z > 3)} = 1 when x ≥ y or z > 3, and is 0 de lo contrario. 3 Circuit Complexity Our analysis of UHAT and GUHAT is carried out within the framework of circuit complexity, in which the complexity of a computational system is measured by the size, profundidad, and types of gates of a Boolean circuit implementing that system. In this section we review the basic concepts, definitions, and results of circuit complexity used by our analysis. A detailed overview is provided in Chapter 6 of Arora and Barak (2009). 3.1 Boolean Circuits on logic systems Boolean circuits are a formal model of com- putational gates. based Roughly speaking, a Boolean circuit consists of binary-valued input and output layers, with feedforward connections2 to one another via intermediate gates that implement logical opera- ciones. We use the following definition of Boolean circuits. Definición 1. A Boolean circuit with n inputs and m outputs is a labeled directed acyclic graph satisfying the following conditions. There are n distinguished input vertices labeled with the vari- ables x1, x2, . . . , xn. Each input vertex has fan-in 0. The rest of the vertices are gates, each hav- ing a label from Constant-0, Constant-1, NOT, Y, or OR. The Constant-0 and Constant-1 gates have fan-in 0, NOT gates have fan-in 1, and AND and OR gates have unbounded fan-in. Finalmente, the labels z1, z2, . . . , zm are applied to some (not necessarily distinct) vertices; these are the outputs. We refer to the edges of a Boolean circuit as wires. The size of a circuit is the number of wires it contains, and the depth of a circuit is the maximum length of a directed path of wires from an input vertex to an output. A Boolean circuit C computes a Boolean function from {0, 1}n to {0, 1}metro; we denote its output on input x = x1x2 . . . xn by C(X). Observe that a Boolean circuit has a fixed number of input vertices, and therefore can only take as input bit strings of a fixed length. We would like to define circuit computation for a map defined on all of {0, 1}∗. Con ese fin, we allow different circuits for inputs of different lengths. Definición 2. A family of circuits is a sequence {cn}, where for each integer n ≥ 0, Cn is a Boolean circuit with n inputs and one output. A map f from {0, 1}∗ a {0, 1} is computed by a family of circuits {cn} if and only if for all n and all x ∈ {0, 1}norte, F (X) = Cn(X). The class AC0 is defined by setting restrictions on the size and depth of circuits within a family of Boolean circuits. Definición 3. A family of circuits is of constant depth if there exists a constant K such that the depth of Cn is bounded by K for all n. A family 2We consider only acyclic circuits. 801 l D o w n o a d e desde h t t p : / / directo . mi t . e d u / t a c l / lartice – pdf / ¿yo? / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 9 0 2 0 3 7 1 2 4 / / t l a c _ a _ 0 0 4 9 0 pd . f por invitado 0 7 septiembre 2 0 2 3 of circuits is of polynomial size if there exists a constant c such that the size of Cn is bounded by nc + c for all n. The set AC0 is the set of families of Boolean circuits of both constant depth and polynomial size. We relate formal languages with families of circuits by identifying languages L ⊆ Σ∗ with Boolean functions that classify strings as belong- ing to L or not. Formally speaking, let Σ be a finite alphabet. A binary symbol encoding of Σ is an injective map h from Σ to {0, 1}s, where s = (cid:2)(|S|). Thus h maps each symbol to a dis- tinct binary string of length s. We extend h to a homomorphism on strings from Σ∗. We say that the circuit family {cn} recognizes the language L over Σ if there is a binary symbol encoding h of Σ such that for every n and every x ∈ Σn, Csn(h(X)) = 1 if and only if x ∈ L. With this definition of language recognition via Boolean circuits, we say that a language is in AC0 if and only if it is recognized by a family of Boolean circuits in AC0. 3.2 Non-AC0 Languages Having defined the class AC0, we present some examples of languages not belonging to this class. Primero, the following three languages were shown by Furst et al. (1984) to fall outside AC0. Definición 4. We define the following languages over the alphabet {0, 1}. The language PARITY is the set of all strings containing an even number of 1s; MAJORITY is the set of strings with at least as many 1s as 0s; and EQUALITY is the set of strings with exactly as many 1s as 0s. Además, we show later in this paper (Corolario 3) that DYCK-1 also falls outside AC0. Definición 5. The language DYCK-k is the set of strings over an alphabet of k types of pairs of brackets that are correctly nested and matched. Por ejemplo, DYCK-2 over the alphabet {(, ), [, ]} can be described by a context free grammar with productions S → ε, S → (S), S → [S], and S → SS. The language DYCK-(k, D) is the set of strings in DYCK-k in which the depth of nest- ing of brackets never exceeds D. The language SHUFFLE-k is the shuffle (arbitrary interleaving) of strings from k versions of DYCK-1 each using a different type of bracket pair. Finalmente, we define PALINDROMES, a language shown in Section 5 to be in GUHAT. 802 Definición 6. The language PALINDROMES is the set of strings equal to their reverses, which can be described by the context free grammar with productions S → ε, S → σ, and S → σSσ for each alphabet symbol σ. 4 Hard Attention Transformers We now define the three kinds of hard atten- tion Transformers studied in this paper: GUHAT, UHAT, and AHAT. These formalisms are models of computation inspired by the encoder portion of the Transformer architecture. They conceptu- alize Transformers as cascading layers of feature extractors that convert a sequence of embeddings into increasingly higher-level representations. 4.1 General Framework We begin by presenting a general framework that subsumes the three hard attention Transformer models. Formalmente, a generalized Transformer is a device that maps a string x ∈ Σ∗$ to 1 o 0, signi-
fying that x is accepted or rejected, respectivamente.
Each such device is parameterized by a collection
of functions described as follows.

Definición 7. A generalized Transformer with K
layers and H attention heads is a tuple T =
(S, A, F, f att
k , gramo | k ∈ [1..k], h ∈
[1..h]) dónde

k,h, f pool, f act

• Σ is the input alphabet,

• A is the set of activation values,

• f : Σ ∪ {$} × N × N → A is the input function, k,h : A × A → R is the attention function for layer k and head h, • f att • f pool : A∗ × R∗ → A is the pooling function, • f act : AH+1 → A is the activation function k for layer k, and • g : A → {0, 1} is the model output function. On input x1x2 . . . xn where xn = $, a string
y(0)
1 y(0)
n ∈ An of initial activation values
2
is given by

. . . y(0)

y(0)
yo = f (xi, i, norte)

for all i. Each layer k then produces a string
y(k) = y(k)
n ∈ An of activation val-
ues from the previous activation values y(k−1) =

1 y(k)

. . . y(k)

2

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
4
9
0
2
0
3
7
1
2
4

/

/
t

yo

a
C
_
a
_
0
0
4
9
0
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

2

y(k−1)

. . . y(k−1)
norte

y(k−1)
como sigue. Primero, each atten-
1
tion head h produces an n × n matrix of attention
scores ai,j,k,h given by

ai,j,k,h = f att
k,h

(cid:2)

y(k−1)
i

, y(k−1)
j

(cid:3)

for all positions i, j of the input string. Próximo, el
pooling function converts each row of attention
scores into an activation value based on y(k−1):

(cid:2)

bi,k,h = f pool

y(k−1)

1

, y(k−1)

2

, . . . , y(k−1)
norte

,

(cid:3)

ai,1,k,h, ai,2,k,h, . . . , ai,norte,k,h

.

Finalmente, the layer output y(k) is computed using
the layer’s activation function:

y(k)
i = f act
k

(cid:2)

y(k−1)
i

(cid:3)

, bi,k,1, . . . , bi,k,h

.

When y(k) has been computed for all k ∈ [1..k],
the final output of the generalized Transformer
t (X) is computed by applying the model output
function to the last symbol of y(k); eso es,

t (X) = g

(cid:2)

(cid:3)

y(k)
norte

.

If T (X) = 1, we say that T accepts x; de lo contrario,
we say that T rejects x. The language recognized
by T , denoted L(t ), is the set of strings x ∈ Σ∗
such that T accepts x$. The formalism we have presented above is fully generalized in the sense that we have placed no restrictions on the activation values A or the functions f , f att k , or g, other than to specify their domains and co-domains. The three hard attention Transformer models are derived by placing restrictions upon these elements. k,h, f pool, f act 4.2 Unique and Averaging Hard Attention The first restriction we consider is on the form of the pooling function. We consider two types of pooling functions: the unique hard attention function, used in GUHAT and UHAT, and the averaging hard attention function, used in AHAT. In unique hard attention, the pooling function simply selects the activation value from the previ- ous layer corresponding to the argmax of the row of attention scores. In case of a tie, the leftmost activation value is selected. Definición 8. The unique hard attention function is the pooling function f UHA : A∗ × R∗ → A defined as follows. On inputs (y1, y2, . . . , en) ∈ 803 An and (a1, a2, . . . , un) ∈ Rn, let j ∈ [1..norte] be the least position that maximizes aj. Entonces, f UHA(y1, . . . , en, a1, . . . , un) = yj. Averaging hard attention is similar to unique hard attention, except that in the case of a tie, the selected activation values are averaged. Definición 9. Let A be a vector space over a field containing Q. The averaging hard attention function is the pooling function f AHA : A∗×R∗ → A defined as follows. On inputs (y1, y2, . . . , en) ∈ An and (a1, a2, . . . , un) ∈ Rn, let j1, j2, . . . , jm ∈ [1..norte] be all the positions that maximize aj. Entonces, f AHA(y1, . . . , en, a1, . . . , un) = 1 m m(cid:4) i=1 yji. The GUHAT model is defined as the class of generalized Transformers that use unique hard attention. Definición 10. A generalized unique hard atten- tion Transformer is a generalized Transformer whose pooling function is f UHA. We use the term GUHAT to refer to the class of generalized unique hard attention Transformers, and also to the class of languages they recognize. k The GUHAT model mostly follows the defi- nitions of Hahn (2020). It is slightly generalized in allowing the input function f to depend on the input length n, and in allowing the activa- tion function f act to depend on the layer k, but these generalizations are immaterial. In particu- lar, it is not necessary to assume that the input length n is provided to the input function: if the input function were f (pag, i) = (pag, i), the subse- quent layer could direct attention at every position to position n (because it uniquely contains the end-of-sequence symbol $), at which point the
value of n is available at every position.

k,h, f act

4.3 Restricted Models: UHAT and AHAT
GUHAT allows the activation values A and the
functions f , f att
k , and g to take on any
arbitrary mathematical value. In practical appli-
cations of Transformer networks, sin embargo, estos
components are restricted in specific ways. Many
variations of hard attention Transformers attempt
to incorporate these restrictions into theoretical
modelos, though they do not entirely agree on the
details of these restrictions.

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
4
9
0
2
0
3
7
1
2
4

/

/
t

yo

a
C
_
a
_
0
0
4
9
0
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

The UHAT and AHAT models adopt many
of these restrictions, largely following the def-
initions of Yao et al. (2021). For the sake of
computability, we require activation values to
be vectors of rational numbers. Following P´erez
et al. (2019), we restrict scalars to be rational as
Bueno. Próximo, we assume that the input function f
is decomposed into a token embedding function
and a position embedding function. Mirroring the
more familiar description of attention functions
in terms of query, key, and value matrices, nosotros
use a bilinear form for attention functions pro-
posed in Luong et al. (2015). In addition to the
unique and averaging hard attention mechanisms,
we allow the pooling function to be future-masked
(where for position i only those positions j with
j ≤ i are considered in the attention compu-
tation) or past-masked (similarly for j ≥ i).
Finalmente, we assume that activation functions and
the model output function are computed by feed-
forward neural networks with ReLU activation.
These restrictions are summarized below.
Definición 11. For d ∈ N, a restricted Trans-
former of dimension d is a generalized Trans-
former such that

• the set of activation values is A = Qd;

• the input function is given by

F (pag, i, norte) = fe(pag) + pag(i, norte),

where fe : Σ ∪ {$} → Qd is the token embedding function and p : N × N → Qd is the position embedding function; • each attention function is of the form k,h(y, y(cid:13)) = y(cid:14)Ak,hy(cid:13), f att where Ak,h ∈ Qd×d; • the pooling function may be future-masked or past-masked; • each activation function is computed by a feedforward neural network with ReLU activation; • the output function g is computed by a feed- forward neural network with ReLU activation followed by a softmax layer, with g(y) = 1 if and only if the output of the network on input y is greater than or equal to 1/2. Because Σ is finite, we may assume that the token embedding function is given by a table lookup. Our formulation of position embedding is somewhat more general than the definition of Yao et al. (2021), who take the position embedding to be a scalar defined as p(i, norte) = i/n that occupies one position of the initial activation vector. The UHAT and AHAT models are defined to be restricted Transformers that satisfy the above conditions and use unique and averaging hard attention, respectivamente. Definición 12. A unique hard attention Trans- former is a restricted Transformer whose pooling function is f UHA or a future- or past-masked version thereof. An averaging hard atten- tion Transformer is a restricted Transformer whose pooling function is f AHA or a future-or past-masked version thereof. We use the terms UHAT and AHAT, respectivamente, for these classes of Transformers, and also for the classes of languages they recognize. UHAT is clearly a subclass of GUHAT because the former imposes restrictions on the form of the input, atención, activación, and output functions. We suspect, but do not prove, that this inclusion is proper. Además, we briefly argue below that UHAT is properly contained in AHAT. Proposition 1. UHAT is a strict subclass of AHAT. Proof sketch. Because AHAT recognizes non- AC0 languages (P´erez et al., 2019; Bhattamishra et al., 2020), it suffices to show that UHAT ⊆ AHAT. Let T be a UHAT of dimension d rec- ognizing L. We define a UHAT ˆT of dimension d + 2 recognizing L that has no ties in its at- tention values. Since the pooling functions f UHA used in UHAT and f AHA used in AHAT are identical in the absence of ties, replacing the pool- ing function of ˆT with f AHA gives us an AHAT recognizing L. i Let N be a sufficiently large integer depending on n, specified below. Each activation value ˆy(k) in ˆT is y(k) from T with two additional constant components, set to 1 and i/N by the input func- , ˆy(k−1) ción. The attention function ˆf att ) j computes ai,j,k,h using the original attention func- tion and activation values, subtracting the value j/N . This is achievable with a bilinear map. k,h(ˆy(k−1) i i If for j < (cid:2) the attention values ai,j,k,h and ai,(cid:2),k,h are tied in T , then after subtracting j/N 804 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 9 0 2 0 3 7 1 2 4 / / t l a c _ a _ 0 0 4 9 0 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Figure 1: Activation values computed by a GUHAT Transformer for PALINDROMES as it rejects the input abcca$. and (cid:2)/N respectively, the tie is broken in favor of j. However, N must also be large enough to preserve the order of any two attention values that are not tied. There are a finite number of different attention values that arise in the computation of T on all the inputs of length n, and it suffices to choose N so that n/N is less than the distance between any pair of such attention values. 4.4 Prior Results for These Models Hahn (2020) shows that the languages 1∗ and {anbn | n ≥ 1} are in GUHAT, and the lan- guages PARITY and DYCK-k for all k ≥ 1 are not in GUHAT. P´erez et al. (2019) show that even without positional information, the language MAJORITY is in AHAT. Bhattamishra et al. (2020) show that SHUFFLE-k is in AHAT, which implies that DYCK-1 is in AHAT. Yao et al. (2021) show that the language DYCK-(k, D) is in UHAT. The latter two results use positional masking, but no other positional information. 5 PALINDROMES in GUHAT Let us now illustrate how a GUHAT computes by way of example. In this section, we describe a GUHAT with 2 layers and 1 head that recognizes the language PALINDROMES over the alphabet Σ = {a, b, c}. Broadly speaking, this Transformer works as follows. The first layer is responsible for comparing each symbol of the input string with the corresponding symbol on the opposite side of the string, and marking whether the two symbols match. The second layer reads these markings, searching for a mismatch identified by the first layer. If one is found, the model output function returns 0; otherwise, it returns 1. For intuition, we simultaneously illustrate the Transformer’s computation on the input abcca$, which should be rejected. The input function is defined as f (σ, i, n) = (σ, i, n) for each σ ∈ Σ and i ∈ [1..n]. For our example input, the initial (layer 0) activation values are shown in the first row of Figure 1. These activation values are not rational-valued vectors, of course, but the GUHAT model imposes no restriction on the form these values can take. We define the attention function for layer 1, 1,1 ((σ, i, n), (σ(cid:13), j, n)), to be {(j = n − i) ∨ (i = f att j = n)}. For each position i < n, this selects the activation at the correct corresponding position, n − i. For position n, it selects the activation at position n. We define the activation function for layer 1 as f act 1 ((xi, i, n), (xj, j, n)) = ⎧ ⎪⎨ ⎪⎩ (1, i), xi (cid:15)= xj (1, i), (0, i), i = j = n otherwise. The layer 1 activation values for our example input are shown in the second row of Figure 1. This indicates that positions 2 and 4 found mismatched symbols, and positions 1, 3, and 5 did not. Layer 2 gathers the relevant information from layer 1 into the last position. The layer 2 attention function is defined by f att 2,1 ((r, i), (s, j)) = s. This directs the attention at every position to the left- most activation value (s, j) from layer 1 such that s = 1. In our example, the leftmost such position is 2, with its activation of (1, 2). If the input se- quence had instead been a valid palindrome, none of the positions i ∈ [1..n − 1] would have been marked with (1, i) by layer 1. In this case, the leftmost position with s = 1 would have been the final position n, which has the activation value of (1, n). We define the layer 2 activation function as f act 2 ((r, i), (s, j)) = (i, j). For our example input, the activation values for layer 2 are shown in the third row of Figure 1. The activation value at position n will be (n, n) if and only if no earlier position found a symbol mismatch, so the model output function is simply g((i, j)) = {i = j}. With the input sequence abcca$, the activation for position 6 at layer 2 is (6, 2) and the output value is 0. For a valid palindrome such as abcba$, the 805 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 9 0 2 0 3 7 1 2 4 / / t l a c _ a _ 0 0 4 9 0 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 activation for position 6 at layer 2 is (6, 6) and the output value is 1. Generalizing this construction to an arbitrary alphabet Σ, we have the following. Proposition 2. For any finite alphabet Σ, the language PALINDROMES over Σ is in GUHAT. 6 A Normal Form for GUHAT Despite the abstractness and generality of the GUHAT model, we can define a normal form representation and show that every Transformer T in GUHAT is equivalent to a Transformer in GUHAT in this normal form with the same number of layers and heads. The key idea is to preserve in the activation values all the information from previous layers that has been used to compute them, by requiring that the input and activation functions just return the tuple of their arguments. We also require that attention values be integers in the smallest relevant range. Definition 13. A GUHAT with K layers and H heads is in informative normal form if and only if the following conditions are satisfied. • The input function is f (σ, i, n) = (σ, i, n). • For each layer k ∈ [1..K], the activation values are (H +1)-tuples of activation values at layer k − 1, and the activation function is defined by f act k (y, b1, . . . , bH ) = (y, b1, . . . , bH ). • For each layer k ∈ [1..K] and attention head h ∈ [1..H], the attention function f att k,h returns an integer in [0..N − 1], where N is the total number of possible ordered pairs of activation values at layer k − 1. Lemma 1. For any Transformer T ∈ GUHAT, there exists a Transformer ˆT ∈ GUHAT in in- formative normal form such that L(T ) = L( ˆT ). Moreover, ˆT has the same number of layers and heads as T . Proof. Let T be a GUHAT with K layers and H heads, with input alphabet Σ, input function f , attention functions f att k,h, activation functions f act k , and output function g. We describe how to construct functions for an equivalent Transformer ˆT in GUHAT in informative normal form, which also has K layers and H heads. We assume that n is the input length. For ˆT the input function ˆf (σ, i, n) is defined to return the triple (σ, i, n). Note that there are at most |Σ|n possible initial activation values. We also define a function t0 that translates initial activation values for ˆT into initial activation values for T by t0(σ, i, n) = f (σ, i, n). Now, we induct on the layers of T and ˆT . Assume that we have defined attention and acti- vation functions for ˆT for layers before k (where the initial activation values are treated as ‘‘layer 0’’), and a translation function tk−1 that trans- lates all possible activation values for ˆT from the previous layer into activation values for T from the previous layer. To define the attention function for ˆT for layer k for head h, we enumer- ate all the possible pairs ˆyi and ˆyj of activation values of ˆT at layer k − 1, and determine the corresponding attention values of T , which we denote by vk,h(ˆyi, ˆyj) = f att k,h(tk−1(ˆyi), tk−1(ˆyj)). We make a list of all the distinct resulting values and sort them into increasing order. Then we de- fine ˆf att k,h(ˆyi, ˆyj) to be the rank of vk,h(ˆyi, ˆyj) in this sorted list. The activation function for ˆT for layer k is, by definition, ˆf act k (y, b1, . . . , bH ) = (y, b1, . . . , bH ). The translation function for layer k is defined by tk(y, b1, . . . , bH ) k (tk−1(y), tk−1(b1), . . . , tk−1(bH )), = f act that is, we translate each of the component activa- tion values using tk−1 and then apply the activation function of T . Finally, the output function for ˆT is defined by ˆg(ˆy) = g(tK(ˆy)), that is, we translate the layer K activation value ˆy of ˆT to the layer K activation value of T , and apply the output function of T . By construction, ˆT is in informative normal form, and it has K layers and H heads. It is not difficult to see that for any input x, the translations tk(ˆy) of the activation values ˆy of ˆT are equal to the corresponding activation values of T , and the outputs ˆT (x) = T (x) are equal as well. Thus, L( ˆT ) = L(T ). To illustrate the construction of ˆT in the proof of Lemma 1, we briefly show how an informa- tive normal form version of the Transformer for PALINDROMES from Section 5 would process the input x = abcca$. Because the attention functions in that example return 0 or 1, their translation is simplified. 806 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 9 0 2 0 3 7 1 2 4 / / t l a c _ a _ 0 0 4 9 0 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 The initial activation values and layer 1 atten- tion function are the same as in the example. The resulting layer 1 activation sequence, consisting of a sequence of paired initial activations and attention values, is ((a, 1, 6), (a, 5, 6)), ((b, 2, 6), (c, 4, 6)), . . . , (($, 6, 6), ($, 6, 6)). The translation t1 maps ((xi, i, n), (xj, j, n)) to (cid:15)= xj. When (0, i) if xi = xj and (1, i) if xi applied to the above activation sequence, this yields the previous example’s layer 1 activation sequence. The layer 2 attention function applied to a pair of layer 1 activation values ((xi, i, n), (xj, j, n)) and ((xk, k, n), (x(cid:2), (cid:2), n)) first applies the transla- tion function t1 to these two activation values to recover the pairs (r, i) and (s, j), and then applies the example’s layer 2 attention function to these to yield the attention value s. The layer 2 translation function maps a layer 2 activation value (((xi, i, n), (xj, j, n)), ((xk, k, n), (x(cid:2), (cid:2), n))) to (i, k). For layer 2 and position 6 the activation value for this input is ((($, 6, 6), ($, 6, 6)), ((b, 2, 6), (c, 4, 6))), which is mapped to (6, 2) by t2. The previous example’s output function compares 6 and 2 and returns 0, rejecting the input x. 7 From GUHAT to Circuits In this section we show that for every language L ∈ GUHAT, we can construct a family of Bool- ean circuits of constant depth and polynomial size that also recognizes L. This will prove the following, which is our main result. Theorem 1. Every language in GUHAT is recog- nized by a family of circuits in AC 0. Let L be a language over Σ that is in GUHAT. By Lemma 1, we may assume that L is recognized by GUHAT Transformer T in informative normal form. Assume T has K layers and H heads. What we describe below is a family of circuits to recognize the end-marked language L$, which can easily be converted to a family of circuits that recognizes L by hard-wiring the representation of the end-of-sequence symbol $ at the end of the input string using constant gates. Let s = (cid:2)(|Σ|+1) and let h be any binary symbol encoding for Σ ∪ {$}. We construct a family of Boolean circuits {Cn} of constant depth and polynomial size such that for all positive integers n and all x ∈ Σn−1, x ∈ L if and only if Csn(h(x$)) = 1. The key step of the proof is to bound the number of bits needed to represent attention and activation values for an input sequence of length n by O(log n), where the suppressed constants depend on K and H. Lemma 2. Let T be a GUHAT in informative normal form with K layers and H heads, and alphabet Σ. Let s = (cid:2)(|Σ| + 1). Then for any input of length n and any k ∈ [0..K], the activation values at layer k can be represented by (H + 1)k(2(cid:2)(n) + s) bits, and for k ≥ 1, the attention scores at layer k can be represented by 2(H + 1)k−1(2(cid:2)(n) + s) bits. Proof. For an input sequence of length n, the initial activation values are (σ, i, n), where σ ∈ Σ ∪ {$} and i ∈ [1..n]. This can be represented by a string of 2(cid:2)(n) + s bits. At each successive layer, the activation values are a tuple of (H + 1) values from the previous layer, which multiplies the number of bits required to represent them by (H + 1). Also, the range of attention scores is bounded by the number of ordered pairs of activation values at the previous layer, so atten- tion values can be represented by twice the num- ber of bits to represent an activation value at the previous layer. It is worth observing that the bounds provided by Lemma 2 do not hold in the case of AHAT. Attention scores may be the result of the average of an arbitrary subset of the possible inputs, which means that there are exponentially more possible activation values at each layer. The following elementary facts about Boolean circuits will be useful. Lemma 3. An arbitrary Boolean function f of n inputs and m outputs can be computed by a depth 3 circuit of size at most m(n2n + 2n + n). Proof. Express each output zi of f as a disjunctive normal form (DNF) formula of at most 2n terms, each with at most n literals. Convert each DNF formula to a circuit with one OR gate with inputs from an AND gate for each term, each of whose inputs is either an input to the function, or the result of applying a NOT gate to an input. In each 807 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 9 0 2 0 3 7 1 2 4 / / t l a c _ a _ 0 0 4 9 0 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 such circuit, the OR gate has at most 2n input wires, each AND gate has at most n input wires, and each of at most n NOT gates has one input wire, for a total size bounded by n2n + 2n + n. The final circuit consists of these m separate cir- cuits computing in parallel, and its size is at most m times the size of each one. The longest pos- sible path to an output from an input is through a NOT, an AND, and the OR gate, for a depth of at most 3. Corollary 1. If a Boolean function f has at most c log n inputs and at most d log n outputs, then it may be computed by a Boolean circuit of depth 3 and size at most (d log n)(nc(c log n) + nc + c log n). With the O(log n) bound on the number of bits to represent activation and attention values, Lemma 2 yields circuits of constant depth and size polynomial in n for the input, attention, activa- tion, and output functions. Additional circuitry is necessary to implement the comparison of atten- tion scores and selection of the activation value to attend to for each position, layer, and head. We construct the overall circuit Csn according to the layers of T , starting with the input function. Let the inputs to T be xi for i ∈ [1..n]. The inputs to Csn are xi,j for i ∈ [1..n] and j ∈ [1..s], where xi,j are the bits of h(xi), representing the binary encoding of input symbol xi. At layer 0 for position i, the value of y(0) i = f (xi, i, n) = (xi, i, n) is achieved by having the input wires xi,j for j ∈ [1..s] followed by a sequence of constants 0 or 1 representing bin(i, n) and bin(n, n) for a total of 2(cid:2)(n) + s wires representing the value (xi, i, n). i Inducting on layers, we assume that for some k ∈ [1..K] the circuit Csn has been constructed to contain the wires representing all the activation values y(k−1) for i ∈ [1..n] at layer k − 1. The portion of the circuit computing the representa- tions of activation values at layer k is described as follows. Fix a position i ∈ [1..n] and a head h ∈ [1..H]. For each j ∈ [1..n], there is a circuit Ai,j,k,h that has as input the wires for the activation values y(k−1) and as output, wires rep- resenting the nonnegative integer attention score ai,j,k,h in binary. Each of these circuits Ai,j,k,h has 2(H + 1)k−1(2(cid:2)(n) + s) inputs and outputs by Lemma 2, and therefore can be computed using depth 3 and size polynomial in n, by Corollary 1. and y(k−1) j i All Hn2 such circuits for layer k operate in parallel, for overall depth 3 and size polyno- mial in n. We next describe the circuit that implements the pooling function f UHA. For each pair j, j(cid:13) ∈ [1..n], there is a circuit Di,j,j(cid:13),k,h whose inputs are the outputs of Ai,j,k,h and Ai,j(cid:13),k,h and whose output is a single wire gi,j,j(cid:13),k,h with a value of 1 if ai,j,k,h ≥ ai,j(cid:13),k,h and 0 otherwise. Because of the bounds on the number of inputs and outputs, each of these circuits can have depth 3 and size polynomial in n by Corollary 1. These n2 circuits all compute in parallel.3 Then for each position j, whether j maximizes ai,j,k,h can be computed by an AND gate whose inputs are gi,j,j(cid:13),k,h for all j(cid:13) ∈ [1..n]. Let the output of this AND gate be denoted mi,j,k,h. Then mi,j,k,h = 1 if and only if the position j maximizes ai,j,k,h. This increases the depth by 1. For each j, an indicator zi,j,k,h is computed by an AND gate whose inputs are mi,j,k,h and NOT(mi,j(cid:13),k,h) for all j(cid:13) < j. Thus, zi,j,k,h = 1 if and only if j is the leftmost position that max- imizes ai,j,k,h. This increases the depth by 2. j Finally, these indicator values are used to com- bine the layer k − 1 activation values in a selection circuit, yielding the representation of the activa- tion value bi,k,h = y(k−1) such that zi,j,k,h = 1. In general, such a selection circuit takes as in- put t selector bits z1, . . . , zt, where exactly one zj = 1, and t input values w1, . . . , wt, where each wr consists of S bits. It outputs S bits represent- ing the selected wj (for which zj = 1). Letting wr,s denote bit s of wr, the computation can be described as vr,s = wr,s ∧ zr for r ∈ [1..t] and s ∈ [1..S], which can be computed by one layer of tS AND gates in parallel. Then the bits of the (cid:9) S r=1 vr,s for s ∈ [1..S], which output are us = can be computed by one layer of S OR gates in parallel. Thus, the selection circuit adds 2 to the depth, and a polynomial in n to the size. Because each activation function for a GUHAT in informative normal form simply returns its arguments, no further computation is needed for the activation values. The representation of the activation value y(k) is just the sequence of wires i representing y(k−1) followed by those representing bi,k,1, through bi,k,H . i 3In fact, comparison of two b-bit integers can be done with a Boolean circuit of constant depth and size polynomial in b, but that is not necessary for the present purpose. 808 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 9 0 2 0 3 7 1 2 4 / / t l a c _ a _ 0 0 4 9 0 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 To produce the output of the circuit, we note that the representation of y(L) has O(log n) bits n and the output of g is a single bit, so g can be implemented by a Boolean circuit of constant depth and size polynomial in n, by Corollary 1. This concludes the proof of Theorem 1. Furst et al. (1984) prove that the PARITY, EQUALITY, and MAJORITY languages are not in AC0, which immediately implies the following. Corollary 2. GUHAT does not contain the lan- guages PARITY, MAJORITY, or EQUALITY. To see that the DYCK-k languages are also not in AC0, we reduce from the EQUALITY language. Corollary 3. For all k ≥ 1, the language DYCK-k is not in AC0, and is therefore not in GUHAT. Proof. It suffices to prove this for k = 1. Assume that there is a family {Cn} of Boolean circuits in AC0 that recognizes DYCK-1. We may assume that the binary symbol encoding is h([) = 0 and h(]) = 1. We show how to use this to construct a family {En} of Boolean circuits in AC0 that recognizes the EQUALITY language, a contradiction. En is constructed from C3n as follows. If the inputs to En are x1, . . . , xn, then En consists of C3n with its first n inputs set to the constant 0, its middle n inputs set to x1, . . . , xn, and its last n inputs set to the constant 1. Let x be any element of {0, 1}n. If the number of occurrences of 0 is not equal to the number of occurrences of 1 in x, then the input to C3n has unequal numbers of [ and ] symbols, which is not in DYCK-1 and x is rejected. If the number of occurrences of 0 is equal to the number of occurrences of 1 in x, then in any prefix of the input to C3n, the number of occurrences of ] is less than or equal to the number of occurrences of [. At the end of the input, the number of occurrences of [ is equal to the number of occurrences of ], so the input to C3n is in DYCK-1 and x is accepted. Thus {En} is a family of Boolean circuits in AC0 that recognizes the language EQUALITY, a contradiction. 8 Discussion and Conclusions We have defined formal language recognition by the encoder portion of a Transformer network us- ing generalized unique hard attention (GUHAT), unique hard attention (UHAT), and averaging hard attention (AHAT), and shown that languages in UHAT and GUHAT are recognizable by constant depth, polynomial size families of circuits, that is, families of circuits in the complexity class AC0. This strengthens the negative result of Hahn (2020) that the languages PARITY and DYCK-k are not in GUHAT, and provides a simpler and more general proof. Combined with prior results of P´erez et al. (2019) showing that the language MAJORITY is in AHAT, or Bhattamishra et al. (2020) showing that the language DYCK-1 is in AHAT, this shows that AHAT contains languages that are not in GUHAT or UHAT. Many intriguing open questions remain. What classical closure properties hold for the classes of languages GUHAT, UHAT, and AHAT? Closure under complement just requires complementing the output function g, and closure under pairwise union and intersection should be straightforward using a parallel approach; but what about closure under homomorphism, inverse homomorphism, concatenation, or Kleene star? We briefly observe that GUHAT and UHAT cannot be closed un- der both Kleene star and concatentation lest they contain all regular languages, including PARITY. Existing formal models and indeed practical implementations of Transformers vary in their representation of position information, whether as an absolute representation of position, a ratio (e.g., position i in a sequence of length n as i/n), through angle information (e.g., position i by the pair (cos θi, sin θi) where θi = πi/2n), or as an arbitrary learned embedding. In the UHAT and AHAT models, the choice of positional encod- ing can facilitate positional comparison (e.g., an angle-based encoding allows for equality testing via dot products) or make it uncomputable (e.g., if positional encodings enumerate Turing machines that halt on their own encodings). It remains to be understood what effect such differences in posi- tion representation have on the expressive power of a model. More generally, is it possible to prove that soft attention, which we have not addressed here, is strictly more powerful than even averaging hard attention? Yao et al. (2021, Theorem B.3) present a construction for a soft attention Transformer that recognizes DYCK-k. This construction crucially uses specialized encodings of position and layer normalization, whose formal power remains to be understood. 809 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 9 0 2 0 3 7 1 2 4 / / t l a c _ a _ 0 0 4 9 0 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Finally, given the success that Transformers have had as models of natural language, it is perhaps surprising that these models’ expressive power seems to be best characterized (or at least bounded) in terms of circuit complexity. Mathe- matical explorations of natural language have most commonly used the approach to language com- plexity afforded by the Chomsky hierarchy and its refinements, which is based on automata and formal grammars. The apparent incomparability of these approaches suggests that the exploration of different types of Transformer models might offer a new approach to the study of the formal properties of natural language. Acknowledgments We thank the reviewers and the action editor for their work in reviewing this paper. References Sanjeev Arora and Boaz Barak. 2009. Com- putational Complexity: A Modern Approach. Cambridge University Press, Cambridge, United Kingdom. https://doi.org/10 .1017/CBO9780511804090 Linguistics, 8:156–171. https://doi.org /10.1162/tacl_a_00306 approaches Thang Luong, Hieu Pham, and Christopher D. Manning. 2015. Effective to attention-based neural machine translation. In Proceedings of the 2015 Conference on Empir- ical Methods in Natural Language Processing, pages 1412–1421, Lisbon, Portugal. Associa- tion for Computational Linguistics. https:// doi.org/10.18653/v1/D15-1166 William Merrill, Ashish Sabharwal, and Noah A. are Smith. 2022. Saturated transformers constant-depth threshold circuits. Computing Research Repository, arXiv:2106.16213v3 [cs]. Jorge P´erez, Javier Marinkovi´c, and Pablo Barcel´o. 2019. On the Turing completeness of modern neural network architectures. In ICLR 2019 Conference Track, New Orleans, LA, USA. OpenReview. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems 30, pages 5998–6008, Long Beach, CA, USA. Curran Associates, Inc. transformers to recognize formal Satwik Bhattamishra, Kabir Ahuja, and Navin Goyal. 2020. On the ability and limitations of lan- guages. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 7096–7116, On- line. Association for Computational Linguis- tics. https://doi.org/10.18653/v1 /2020.emnlp-main.576 Merrick Furst, James B. Saxe, and Michael Sipser. 1984. Parity, circuits, and the polynomial- time hierarchy. Mathematical Systems Theory, 17(1):13–27. https://doi.org/10.1007 /BF01744431 Michael Hahn. 2020. Theoretical limitations of self-attention in neural sequence models. Trans- actions of the Association for Computational Shunyu Yao, Peng, Binghui Christos Papadimitriou, and Karthik Narasimhan. 2021. Self-attention networks can process bounded hierarchical languages. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th Inter- national Joint Conference on Natural Lan- guage Processing, volume 1: Long Papers, pages 3770–3785, Online. Association for Computational Linguistics. https://doi.org /10.18653/v1/2021.acl-long.292 Chulhee Yun, Srinadh Bhojanapalli, Ankit Singh Rawat, Sashank Reddi, and Sanjiv Kumar. 2020. Are Transformers universal approx- imators of sequence-to-sequence functions? In ICLR 2020 Conference Track, Online. OpenReview. 810 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 9 0 2 0 3 7 1 2 4 / / t l a c _ a _ 0 0 4 9 0 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3
Descargar PDF