Sparse, Dense, and Attentional Representations for Text Retrieval

Sparse, Dense, and Attentional Representations for Text Retrieval

Yi Luan∗, Jacob Eisenstein∗, Kristina Toutanova∗, Michael Collins

luanyi, jeisenstein, kristout, mjcollins

@google.com

}

{

Google Research

Astratto

between

Dual encoders perform retrieval by encoding
documents and queries
into dense low-
dimensional vectors, scoring each document
by its inner product with the query. Noi
investigate the capacity of this architecture
relative to sparse bag-of-words models and
attentional neural networks. Using both
theoretical and empirical analysis, we establish
connections between the encoding dimension,
the margin
inferiore-
ranked documents, and the document length,
suggesting limitations in the capacity of
fixed-length encodings to support precise
retrieval of long documents. Building on these
insights, we propose a simple neural model
that combines the efficiency of dual encoders
with some of the expressiveness of more costly
attentional architectures, and explore sparse-
dense hybrids to capitalize on the precision
of sparse retrieval. These models outperform
strong alternatives in large-scale retrieval.

gold

E

1 introduzione

Retrieving relevant documents is a core task for
language technology, and is a component of
applications such as information extraction and
question answering (per esempio., Narasimhan et al.,
2016; Kwok et al., 2001; Voorhees, 2001).
While classical information retrieval has focused
on heuristic weights for sparse bag-of-words
representations (Sp¨arck Jones, 1972), more recent
work has adopted a two-stage retrieval and
ranking pipeline, where a large number of
documents are retrieved using sparse high
dimensional query/document representations, E
are further reranked with learned neural models
(Mitra and Craswell, 2018). This two-stage
approach has achieved state-of-the-art results on

∗Equal contribution.

IR benchmarks (Nogueira and Cho, 2019; Yang
et al., 2019; Nogueira et al., 2019UN), particolarmente
since sizable annotated data has become available
for training deep neural models (Dietz et al., 2018;
Craswell et al., 2020). Tuttavia, this pipeline
suffers from a strict upper bound imposed by any
recall errors in the first-stage retrieval model: For
esempio, the recall@1000 for BM25 reported by
Yan et al. (2020) È 69.4.

A promising alternative is to perform first-stage
retrieval using learned dense low-dimensional
encodings of documents and queries (Huang
et al., 2013; Reimers and Gurevych, 2019; Gillick
et al., 2019; Karpukhin et al., 2020). The dual
encoder model scores each document by the
inner product between its encoding and that of
the query. Unlike full attentional architectures,
which require extensive computation on each
candidate document,
the dual encoder can be
easily applied to very large document collections
thanks to efficient algorithms for inner product
search; unlike untrained sparse retrieval models, Esso
can exploit machine learning to generalize across
related terms.

To assess the relevance of a document to an
information-seeking query, models must both (io)
check for precise term overlap (Per esempio,
presence of key entities in the query) E (ii)
compute semantic similarity generalizing across
related concepts. Sparse retrieval models excel at
the first sub-problem, while learned dual encoders
can be better at the second. Recent history in NLP
might suggest that learned dense representations
should always outperform sparse features overall,
Ma
this is not necessarily true: as shown in
Figura 1, the BM25 model (Robertson et al., 2009)
can outperform a dual encoder based on BERT,
particularly on longer documents and on a task
that requires precise detection of word overlap.1
This raises questions about the limitations of dual

1Vedere

4 for experimental details.
§

329

Operazioni dell'Associazione per la Linguistica Computazionale, vol. 9, pag. 329–345, 2021. https://doi.org/10.1162/tacl a 00369
Redattore di azioni: Jimmy Lin. Lotto di invio: 6/2020; Lotto di revisione: 9/2020; Pubblicato 4/2021.

2021 Associazione per la Linguistica Computazionale. Distribuito sotto CC-BY 4.0 licenza.

C
(cid:13)

l

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

P

:
/
/

D
io
R
e
C
T
.

M

io
T
.

e
D
tu

/
T

UN
C
l
/

l

UN
R
T
io
C
e

P
D

F
/

D
o

io
/

.

1
0
1
1
6
2

/
T

l

UN
C
_
UN
_
0
0
3
6
9
1
9
2
4
0
4
0

/

/
T

l

UN
C
_
UN
_
0
0
3
6
9
P
D

.

F

B

G
tu
e
S
T

T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

benchmarks (MS MARCO passage and document),
and passage retrieval for question answering
(Natural Questions). We confirm prior
find-
ings that full attentional architectures excel at
reranking tasks, but are not efficient enough
for large-scale retrieval. Of the more efficient
alternatives, the hybridized multi-vector encoder
is at or near the top in every evaluation, fuori-
performing state-of-the-art retrieval results in
MS MARCO. Our code is publicly available at
https://github.com/google-research
/language/tree/master/language
/multivec.

2 Analyzing Dual Encoder Fidelity

V

A query or a document is a sequence of words
drawn from some vocabulary
. Throughout this
section we assume a representation of queries
and documents typically used in sparse bag-of-
words models: Each query q and document d is a
vector in Rv where v is the vocabulary size. Noi
take the inner product
to be the relevance
q, D
io
H
score of document d for query q. This framework
accounts for a several well-known ranking models,
including Boolean inner product, TF-IDF, and BM25.
We will compare sparse retrieval models with
compressive dual encoders, for which we write
F (D) and f (q) to indicate compression of d
and q to Rk, with k
v, and where k does

not vary with the document length. For these
models, the relevance score is the inner product
3, we consider encoders that
F (q), F (D)
H
io
apply to sequences of tokens rather than vectors
of counts.)

. (In

§

A fundamental question is how the capacity of
dual encoders varies with the embedding size k. In
this section we focus on the related, more tractable
notion of fidelity: How much can we compress
the input while maintaining the ability to mimic
the performance of bag-of-words retrieval? Noi
explore this question mainly through the encoding
model of random projections, but also discuss
more general dimensionality reduction in

2.2.

§

2.1 Random Projections

fidelity of
To establish baselines on the
compressive dual encoder
retrieval, we now
consider encoders based on random projections
(Vempala, 2004). The encoder is defined as
Rk
v is a random matrix.
F (X) = Ax, where A
In Rademacher embeddings, each element ai,j

×

Figura 1: Recall@1 for retrieving passage containing
a query from three million candidates. The figure
compares a fine-tuned BERT-based dual encoder (DE-
BERT-768), an off-the-shelf BERT-based encoder with
average pooling (BERT-init), and sparse term-based
retrieval (BM25), while binning passages by length.

encoders, and the circumstances in which these
powerful models do not yet reach the state of
the art. Here we explore these questions using
both theoretical and empirical tools, and propose
a new architecture that leverages the strengths
of dual encoders while avoiding some of their
weaknesses.

We begin with a theoretical

investigation
of compressive dual encoders—dense encodings
whose dimension is below the vocabulary
size—and analyze their ability to preserve distinc-
tions made by sparse bag-of-words retrieval
models, which we term their fidelity. Fidelity
is important for the sub-problem of detecting
precise term overlap, and is a tractable proxy
for capacity. Using the theory of dimensionality
reduction, we relate fidelity to the normalized
margin between the gold retrieval result and
its competitors, and show that
this margin is
in turn related to the length of documents in
the collection. We validate the theory with an
empirical investigation of the effects of random
projection compression on sparse BM25 retrieval
using queries and documents from TREC-CAR, UN
recent IR benchmark (Dietz et al., 2018).

Prossimo, we offer a multi-vector encoding model,
which is computationally feasible for retrieval
like the dual-encoder architecture and achieves
significantly better quality. A simple hybrid that
interpolates models based on dense and sparse
representations leads to further improvements.

We compare the performance of dual encoders,
multi-vector encoders, and their sparse-dense
hybrids with classical sparse retrieval mod-
els and attentional neural networks, as well as
state-of-the-art published results where avail-
able. Our evaluations include open retrieval

330

l

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

P

:
/
/

D
io
R
e
C
T
.

M

io
T
.

e
D
tu

/
T

UN
C
l
/

l

UN
R
T
io
C
e

P
D

F
/

D
o

io
/

.

1
0
1
1
6
2

/
T

l

UN
C
_
UN
_
0
0
3
6
9
1
9
2
4
0
4
0

/

/
T

l

UN
C
_
UN
_
0
0
3
6
9
P
D

.

F

B

G
tu
e
S
T

T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

of the matrix A is sampled with equal probabil-
, 1
1
ity from two possible values:
. In
√k }
√k
{
1/2).
Gaussian embeddings, each ai,j ∼
N (0, k−
A pairwise ranking error occurs when
>
q, d1i
. Using such
Aq, Ad2i
< Aq, Ad1i q, d2i h random projections, it is possible to bound the probability of any such pairwise error in terms of the embedding size. but h h h Definition 2.1. For a query q and pair of documents (d1, d2) such that , the normalized margin is defined as, µ(q, d1, d2) = q, d1i ≥ h q, d2i h . d2k q,d1− d2i h d1− q ] k k k× d of Gaussian Lemma 1. Define a matrix A or Rademacher embeddings. Define vectors q, d1, d2 such that µ(q, d1, d2) = ǫ > 0. A ranking
error occurs when
. If β
H
is the probability of such an error then,

Aq, Ad2i ≥ h

Aq, Ad1i

Rk

×

4 esp

β

(cid:18)

k
2

(ǫ2/2

ǫ3/3)

.

(cid:19)

(1)

The proof, which builds on well-known results
A.1. By
about random projections, is found in
risolvere (1) for k, we can derive an embedding
size that guarantees a desired upper bound on the
pairwise error probability,

§

2(ǫ2/2

k

ǫ3/3)

1ln

4
β

.

(2)

to study the tightness of the bound; although
tightness (up to a constant factor)
theoretical
is suggested by results on the optimality of
the distributional Johnson-Lindenstrauss lemma
(Johnson and Lindenstrauss, 1984; Jayram and
Woodruff, 2013; Kane et al., 2011), here we study
the question only empirically.

2.1.1 Recall-at-r
In retrieval applications, it is important to return
the desired result within the top r search results.
For query q, define d1 as the document
Quello
maximizes some inner product ranking metric.
The probability of returning d1 in the top r results
after random projection can be bounded by a
function of the embedding size and normalized
margin:

d2 ∈ D

Lemma 2. Consider a query q, with target
document d1, and document collection
Quello
D
excludes d1, and such that
, µ(q, d1, d2) >
0. Define r0 to be any integer such that 1
r0 ≤
. Define ǫ to be the r0’th smallest normalized
|D|
, and for
margin µ(q, d1, d2) for any d2 ∈ D
simplicity assume that only a single document
has µ(q, d1, d2) = ǫ.2
d2 ∈ D
d of Gaussian
or Rademacher embeddings. Define R to be
a random variable such that R =
d2 ∈
, and let C =

Define a matrix A

Rk

|{

×

:

Aq, Ad1i ≤ h
r0 + 1). Then

Aq, Ad2i}|

D
4(

H
|D|

It is convenient to derive a simpler but looser

quadratic bound (proved in

A.2):

§

Pr(R

r0)

C exp

(cid:18)

k
2

(ǫ2/2

ǫ3/3)

.

(cid:19)

Corollary 1. Define vectors q, d1, d2 such that
v is a matrix
ǫ = µ(q, d1, d2) > 0. If A
of random Gaussian or Rademacher embeddings
such that k > 12ǫ−
Aq, Ad1i ≤
)
Aq, Ad2i

β , then Pr(

2ln 4

Rk

β.

H

H

×

)

IL

Aq, Ad1i

On the Tightness of the Bound. Let k∗(q, d1, d2)
denote
lowest dimension Gaussian or
Rademacher random projection following the
definition in Lemma 1, for which Pr(
< β, for a given document pair Aq, Ad2i h (d1, d2) and query q with normalized margin ǫ. Our lemma places an upper bound on k∗, 1ln 4 2(ǫ2/2 β . saying that k∗(q, d1, d2) Any k k∗(q, d1, d2) has sufficiently low probability of error, but lower values of k could potentially also have the desired property. Later in this section we perform empirical evaluation ǫ3/3)− − ≥ ≤ h The proof is in A.3. A direct consequence of the lemma is that to achieve recall-at-r0 = 1 for a given (q, d1, β, it is sufficient to set ) triple with probability − ≥ D 1 § k ≥ ǫ2/2 4( ln ǫ3/3 2 − r0 + 1) |D| − β , (3) where ǫ is the r0’th smallest normalized margin. D As with the bound on pairwise relevance errors in Lemma 1, Lemma 2 implies an upper bound on the minimum random projection dimension ) that recalls d1 in the top r0 results k∗(q, d1, with probability β. Due to the application of the union bound and worst-case assumptions Dǫ, about the normalized margins of documents in 2The case where multiple documents are tied with nor- malized margin ǫ is straightforward but slightly complicates the analysis. − ≥ 1 331 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 3 6 9 1 9 2 4 0 4 0 / / t l a c _ a _ 0 0 3 6 9 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 this bound is potentially loose. Later in this section we examine the empirical relationship between maximum document length, the distribution of normalized margins, and k∗. } 0, 1 ∈ { 2.1.2 Application to Boolean Inner Product Boolean inner product is a retrieval function in v over a vocabulary of size v, which d, q with di indicating the presence of term i in the document (and analogously for qi). The relevance score is then the number of terms that appear in both q and d. For this simple retrieval function, it is possible to compute an embedding size that guarantees a desired pairwise error probability over an entire dataset of documents. q, d i h v { d ∈ Rk D 0, 1 } 2. Let A Corollary 2. For a set of documents and a query q 2 and LQ = d k = ∈ v, let LD = 0, 1 } { } v maxd × ∈D k be a matrix of random Rademacher or Gaussian β . Then for embeddings such that k , the any d1, d2 ∈ D q, d2i β. is probability that ≤ ∈ { q k k 24LQLDln 4 q, d1i >
H
Aq, Ad2i
A.4. The corollary shows
that for Boolean inner product ranking, we can
guarantee any desired error bound β by choosing
an embedding size k that grows linearly in LD, IL
number of unique terms in the longest document.


such that
H
Aq, Ad1i ≤ h

The proof is in

§

H

2.1.3 Application to TF-IDF and BM25
Both TF-IDF (Sp¨arck Jones, 1972) and BM25
(Robertson et al., 2009) can be written as inner
products between bag-of-words representations of
the document and query as described earlier in this
IDFi,
section. Set the query representation ˜qi = qi×
where qi indicates the presence of the term in the
indicates the inverse document
query and IDFi
frequency of term i. The TF-IDF score is then
.
˜q, D
io
For BM25, we define ˜d
Rv, with each ˜di a
function of the count di and the document length
˜q, ˜d
(and hyperparameters); BM25(q, D) is then
.
io
Due to its practical utility in retrieval, we now
focus on BM25.

H

H

Pairwise Accuracy. We use empirical data
to test
the applicability of Lemma 1 to the
BM25 relevance model. We select query-document
triples (q, d1, d2) from the TREC-CAR dataset
(Dietz et al., 2018) by considering all possible
(q, d2), and selecting d1 = arg maxd BM25(q, D).
We bin the triples by the normalized margin ǫ, E
compute the quantity (ǫ2/2
1. According
to Lemma 1, the minimum embedding size of a

ǫ3/3)

332

Figura 2: Minimum k sufficient for Rademacher
embeddings to approximate BM25 pairwise rankings
on TREC-CAR with error rate β < .05. ≤ random projection k∗ which has β probability of making an error on a triple with normalized margin ǫ is upper bounded by a linear function of this quantity. In particular, for β = .05, the 1. In Lemma entails that k∗ ≤ this experiment we measure the empirical value of k∗ to evaluate the tightness of the bound. 8.76(ǫ2/2 ǫ3/3)− − The results are shown on the x-axis of Figure 2. For each bin we compute the minimum embedding size required to obtain 95% pairwise accuracy in ranking d1 vs d2, using a grid of 40 possible values for k between 32 and 9472, shown on the y-axis. (We exclude examples that had higher values of (ǫ2/2 1 than the range shown because they did not reach 95% accuracy for the explored range of k.) The figure shows that the theoretical bound is tight up to a constant factor, and that the minimum embedding size that yields desired fidelity grows linearly with (ǫ2/2 ǫ3/3)− ǫ3/3)− − 1. − Margins and Document Length. For boolean it was possible to express inner product, the minimum possible normalized margin (and therefore a sufficient embedding size) in terms of LQ and LD, the maximum number of unique terms across all queries and documents, respectively. Unfortunately, it is difficult to analytically derive a minimum normalized margin ǫ for either TF-IDF or BM25: Because each term may have a unique inverse document frequency, the minimum non- zero margin decreases with the num- d2i ber of terms in the query as each additional term creates more ways in which two documents can receive nearly score. We the therefore study empirically how normalized margins vary with maximum document length. Using the TREC-CAR retrieval dataset, we bin documents by length. For each query, we compute the normalized margins between q, d1 − same h l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 3 6 9 1 9 2 4 0 4 0 / / t l a c _ a _ 0 0 3 6 9 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 3 6 9 1 9 2 4 0 4 0 / / t l a c _ a _ 0 0 3 6 9 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Figure 3: Random projection on BM25 retrieval in TREC-CAR dataset, with documents binned by length. the document with best BM25 in the bin and all other documents in the bin, and look at the 10th, 100th, and 1000th smallest normalized margins. The distribution over these normalized margins is shown in Figure 3a, revealing that normalized margins decrease with document length. the observed minimum normalized margin for a collection of documents and queries is found to be much lower for BM25 compared to Boolean inner product. For example, for the collection used in Figure 2, the minimum normalized margin for BM25 is 6.8e-06, while for Boolean inner product it is 0.0169. In practice, Document Length and Encoding Dimension. Figure 3b shows the growth in minimum random projection dimension required to reach desired recall-at-10, using the same document bins as in Figure 3a. As predicted, the required dimension increases with the document length, while the normalized margin shrinks. 2.2 Bounds on General Encoding Functions We derived upper bounds on minimum required encoding for random linear projections above, and found the bounds on (q, d1, d2) triples to be empirically tight up to a constant factor. More general non-linear and learned encoders could be more efficient. However, there are general theoretical results showing that it is impossible for any encoder to guarantee an inner product ǫ with an distortion 2) encoding that does not grow as Ω(ǫ− (Larsen and Nelson, 2017; Alon and Klartag, 2017), for vectors x, y with norm 1. These results suggest more general capacity limitations f (x), f (y) i − h i| ≤ x, y ≤ |h 333 for fixed-length dual encoders when document length grows. || In our setting, BM25, TF-IDF, and Boolean inner product can all be reformulated equivalently as inner products in a space with vectors of norm at most 1 by L2-normalizing each query vector and rescaling all document vectors by √LD = , a constant factor that grows with the maxd || d length of the longest document. Now suppose we desire to limit the distortion on the unnormalized inner products to some value ˜ǫ, which might guarantee a desired performance characteristic. This corresponds to decreasing the maximum normalized inner product distortion ǫ by a factor of √LD. According to the general bounds on dimensionality reduction mentioned in the previous paragraph, this could necessitate an increase in the encoding size by a factor of LD. ≤ to encode if, However, there are a number of caveats to this theoretical argument. First, the theory states only that there exist vector sets that cannot be encoded into representations that grow more 2); actual documents and queries slowly than Ω(ǫ− might be easier for example, they are generated from some simple underlying stochastic process. Second, our construction 1 by rescaling all document achieves vectors by a constant factor, but there may be other ways to constrain the norms while using the embedding space more efficiently. Third, in the non-linear case it might be possible to eliminate ranking errors without achieving low inner product distortion. Finally, from a practical perspective, the generalization offered by learned dual encoders might overwhelm any sacrifices tasks of in fidelity, when evaluated on real || ≤ || d interest. Lacking theoretical tools to settle these in- questions, we present a set of empirical vestigations in later sections of this paper. But first we explore a lightweight modification to the dual encoder, which offers gains in expressivity at limited additional computational cost. 3 Multi-Vector Encodings The theoretical analysis suggests that fixed- length vector representations of documents may in general need to be large for long documents, if fidelity with respect to sparse high-dimensional representations is important. Cross-attentional architectures can achieve higher fidelity, but are impractical for large-scale retrieval (Nogueira et al., 2019b; Reimers and Gurevych, 2019; Humeau et al., 2020). We therefore propose a new architecture that represents each document as a fixed-size set of m vectors. Relevance scores are computed as the maximum inner product over this set. than a dual encoder that uses a single vector of size mk. This efficiency is a key difference from the POLY-ENCODER (Humeau et al., 2020), which computes a fixed number of vectors per query, and aggregates them by softmax attention against document vectors. (Yang et al., 2018b) propose a similar architecture for language modeling. Because of the use of softmax in these approaches, it is not possible to decompose the relevance score into a max over inner products, and so fast nearest- neighbor search cannot be applied. In addition, these works did not address retrieval from a large document collection. Analysis. To see why multi-vector encodings can enable smaller encodings per vector, consider an idealized setting in which each document vector is the sum of m orthogonal segments such that m i=1 d(i) and each query refers to exactly one d = segment in the gold document.3 An orthogonal segmentation can be obtained by choosing the segments as a partition of the vocabulary. P j . (y) i Formally, let x = (x1, . . . , xT ) represent a sequence of tokens, with x1 equal to the special token [CLS], and define y analogously. Then [h1(x), . . . , hT (x)] represents the sequence of contextualized embeddings at the top level of a deep transformer. We define a single-vector representation of the query x as f (1)(x) = h1(x), and a multi-vector representation of document y as f (m)(y) = [h1(y), . . . , hm(y)], the first m representation vectors for the sequence of tokens in y, with m < T . The relevance score is defined as maxj=1...mh f (1)(x), f (m) the search for (y) yields the largest Although this scoring function is not a dual encoder, the highest-scoring document can be implemented efficiently with standard approximate nearest-neighbor search by adding multiple (m) entries for each document to the search index data structure. If some vector f (m) inner product with j the query vector f (1)(x), it is easy to show the corresponding document must be the one that maximizes the relevance score ψ(m)(x, y). The size of the index must grow by a factor of m, but due to the efficiency of contemporary approximate nearest neighbor and maximum inner product search, the time complexity can be sublinear in the size of the index (Andoni et al., 2019; Guo et al., 2016b). Thus, a model using m vectors of size k to represent documents is more efficient at run-time h m >

Rv such
Theorem 1. Define vectors q, d1, d2 ∈
Quello
, and assume that both d1 and
q, d2i
q, d1i
H
d2 can be decomposed into m segments such
i=1 d(io)
1 , and analogously for d2; Tutto
Quello: d1 =
segments across both documents are orthogonal.
q, D(io)
If there exists an i such that
=
q, d1i
1 io
H
1 , D(io)
then µ(q, D(io)
E
2 )
q, d2i ≥ h

A.5.)
µ(q, d1, d2). (The proof is in

q, D(io)
2 io

P

H

H

,

§

Remark 1. The BM25 score can be computed
from non-negative representations of the docu-
ment and query; if the segmentation corresponds
to a partition of the vocabulary, then the segments
will also be non-negative, and thus the condition
holds for all i.
q, d2i ≥ h
The relevant case is when the same segment

q, D(io)
2 io

H

q, D(io)
2 io

H

for both documents,
is maximal
=
q, D(j)
, as will hold for ‘‘simple’’ queries
maxjh
2 io
that are well-aligned with the segmentation. Then
the normalized margin in the multi-vector model
will be at least as large as in the equivalent
single vector representation. The relationship to
encoding size follows from the theory in the
previous section: Theorem 1 implies that if we
set f (M)
((sì) = Ad(io) (for appropriate A), Poi
io

3Here we use (D, q) piuttosto che (X, sì) because we describe

vector encodings rather than token sequences.

334

l

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

P

:
/
/

D
io
R
e
C
T
.

M

io
T
.

e
D
tu

/
T

UN
C
l
/

l

UN
R
T
io
C
e

P
D

F
/

D
o

io
/

.

1
0
1
1
6
2

/
T

l

UN
C
_
UN
_
0
0
3
6
9
1
9
2
4
0
4
0

/

/
T

l

UN
C
_
UN
_
0
0
3
6
9
P
D

.

F

B

G
tu
e
S
T

T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

an increase in the normalized margin enables the
use of a smaller encoding dimension k while
still supporting the same pairwise error rate.
There are now m times more ‘‘documents’’ to
evaluate, but Lemma 2 shows that this exerts only
a logarithmic increase on the encoding size for a
desired recall@r. But while we hope this argument
is illuminating,
the assumptions of orthogonal
segments and perfect segment match against the
query are quite strong. We must therefore rely
on empirical analysis to validate the efficacy of
multi-vector encoding in realistic applications.

Cross-Attention. Cross-attentional
architec-
tures can be viewed as a generalization of the
multi-vector model: (1) set m = Tmax (one vector
per token); (2) compute one vector per token in
the query; (3) allow more expressive aggregation
over vectors than the simple max employed above.
Any sparse scoring function (per esempio., BM25) can be
mimicked by a cross-attention model, which need
only compute identity between individual words;
this can be achieved by random projection word
embeddings whose dimension is proportional to
the log of the vocabulary size. By definition, IL
required representation also grows linearly with
the number of tokens in the passage and query.
As with the POLY-ENCODER, retrieval in the cross-
attention model cannot be performed efficiently at
scale using fast nearest-neighbor search. In cont-
temporaneous work, Khattab and Zaharia (2020)
propose an approach with TY vectors per query
and TX vectors per document, using a simple
sum-of-max for aggregation of the inner products.
They apply this approach to retrieval via re-
ranking results of TY nearest-neighbor searches.
Our multi-vector model uses fixed length repre-
sentations instead, and a single nearest neighbor
search per query.

4 Experimental Setup

The full IR task requires detection of both precise
word overlap and semantic generalization. Nostro
theoretical results focus on the first aspect, E
derive theoretical and empirical bounds on the
sufficient dimensionality to achieve high fidelity
with respect to sparse bag-of-words models as
document length grows, for two types of linear
random projections. The theoretical setup differs
from modeling for realistic information-seeking
scenarios in at least two ways.

335

Primo, trained non-linear dual encoders might
be able to detect precise word overlap with
much lower-dimensional encodings, especially for
queries and documents with a natural distribution,
which may exhibit a low-dimensional subspace
the semantic generalization
structure. Secondo,
aspect of the IR task may be more important
than the first aspect for practical applications, E
our theory does not make predictions about how
encoder dimensionality relates to such ability to
compute general semantic similarity.

tests

We relate the theoretical analysis to text
retrieval in practice through experimental studies
task, described in
on three tasks. The first
to retrieve
the ability of models
5,
§
natural language documents that exactly contain a
query and evaluates both BM25 and deep neural
dual encoders on a task of detecting precise
word overlap, defined over texts with a natural
distribution. The second task, described in
6, È
the passage retrieval sub-problem of the open-
domain QA version of the Natural Questions
(Kwiatkowski et al., 2019; Lee et al., 2019); Questo
benchmark reflects the need to capture graded
notions of similarly and has a natural query text
distribution. For both of these tasks, we perform
controlled experiments varying the maximum
in the collection,
length of
which enables assessing the relationship between
encoder dimension and document length.

the documents

§

§

To evaluate the performance of our best
models in comparison to state-of-the-art works
on large-scale retrieval and ranking, In
7 we
report results on a third group of tasks focusing
on passage/document ranking: the passage and
document-level MS MARCO retrieval datasets
(Nguyen et al., 2016; Craswell et al., 2020). Here
we follow the standard two-stage retrieval and
ranking system: a first-stage retrieval from a large
document collection, followed by reranking with
a cross-attention model. We focus on the impact
of the first-stage retrieval model.

4.1 Modelli

Our experiments compare compressive and sparse
dual encoders, cross attention, and hybrid models.

BM25. We use case-insensitive wordpiece tok-
enizations of texts and default BM25 parameters
from the gensim library. We apply either unigram
(BM25-uni) or combined unigram+bigram repre-
sentations (BM25-bi).

l

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

P

:
/
/

D
io
R
e
C
T
.

M

io
T
.

e
D
tu

/
T

UN
C
l
/

l

UN
R
T
io
C
e

P
D

F
/

D
o

io
/

.

1
0
1
1
6
2

/
T

l

UN
C
_
UN
_
0
0
3
6
9
1
9
2
4
0
4
0

/

/
T

l

UN
C
_
UN
_
0
0
3
6
9
P
D

.

F

B

G
tu
e
S
T

T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

§

Dual Encoders
from BERT (DE-BERT). Noi
encode queries and documents using BERT-base,
which is a pre-trained transformer network
(12 layers, 768 dimensions) (Devlin et al., 2019).
We implement dual encoders from BERT as a
special case of the multi-vector model formalized
In
3, with number of vectors for the document
m = 1: The representations for queries and
documents are the top layer representations at
IL [CLS] token. This approach is widely used
for retrieval (Lee et al., 2019; Reimers and
Gurevych, 2019; Humeau et al., 2020; Xiong
et al., 2020).4 For lower-dimensional encodings,
we learn down-projections from d = 768 to k

32, 64, 128, 512,5 implemented as a single feed-
forward layer, followed by layer normalization.
All parameters are fine-tuned for the retrieval
compiti. We refer to these models as DE-BERT-k.

Cross-Attentional BERT. The most expressive
model we consider is cross-attentional BERT, Quale
we implement by applying the BERT encoder to
the concatenation of the query and document,
with a special [SEP] separator between x and y.
The relevance score is a learned linear function
of the encoding of the [CLS] token. Due to
the computational cost, cross-attentional BERT
is applied only in reranking as in prior work
(Nogueira and Cho, 2019; Yang et al., 2019).
These models are referred to as CROSS-ATTENTION.

§

Multi-Vector Encoding from BERT (ME-BERT).
In
3 we introduced a model in which every
document is represented by exactly m vectors.
We use m = 8 as a good compromise between
6, and find values
cost and accuracy in
5 E
Di 3 A 4 for m more accurate on the datasets
In
7. In addition to using BERT output repre-
sentations directly, we consider down-projected
feed-
representations,
forward layer with dimension 768
k. A model
with k-dimensional embeddings is referred to as
ME-BERT-k.

implemented using a

×

§

§

§

we linearly combine a sparse and dense system’s
scores using a single trainable weight λ, tuned on
a development set. Per esempio, a hybrid model
of ME-BERT and BM25-uni is referred to as HYBRID-
ME-BERT-uni. We implement approximate search to
retrieve using a linear combination of two systems
by re-ranking n-best top scoring candidates from
each system. Prior and concurrent work has also
used hybrid sparse-dense models (Guo et al.,
2016UN; Seo et al., 2019; Karpukhin et al., 2020;
Ma et al., 2020; Gao et al., 2020). Our contribution
is to assess the impact of sparse-dense hybrids as
the document length grows.

4.2 Learning and Inference

§

§

5 E

6, all trained
For the experiments in
models are initialized from BERT-base, E
all parameters are fine-tuned using a cross-
entropy loss with 7 sampled negatives from a
pre-computed 200-document list and additional
in-batch negatives (with a total number of
1024 candidates in a batch); the pre-computed
candidates include 100 top neighbors from BM25
E 100 random samples. This is similar to the
method by Lee et al. (2019), but with additional
fixed candidates, also used in concurrent work
(Karpukhin et al., 2020). Given a model trained in
Da questa parte, for the scalable methods, we also applied
hard-negative mining as in Gillick et al. (2019)
and used one iteration when beneficial. More
sophisticated negative selection is proposed in
concurrent work (Xiong et al., 2020). For retrieval
from large document collections with the scalable
models, we used ScaNN: an efficient approximate
nearest neighbor search library (Guo et al., 2020);
in most experiments, we use exact search settings
but also evaluate approximate search in Section
7, the same general approach with slightly
7. In
§
different hyperparameters (detailed in that section)
was used, to enable more direct comparisons to
prior work.

§

Sparse-Dense Hybrids (HYBRID). A natural
approach to balancing between the fidelity of
sparse representations and the generalization of
learned dense ones is to build a hybrid. To do this,

4Based on preliminary experiments with pooling
strategies we use the [CLS] vettori (without the feed-forward
projection learned on the next sentence prediction task).
layer

5We experimented with adding a similar

for

d = 768, but this did not offer empirical gains.

336

5 Containing Passage ICT Task

We begin with experiments on the task of retriev-
ing a Wikipedia passage y containing a sequence
of words x. We create a dataset using Wikipedia,
following the Inverse Cloze Task definition by
Lee et al. (2019), but adapted to suit the goals of
our study. The task is defined by first breaking
Wikipedia texts into segments of length at most l.

l

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

P

:
/
/

D
io
R
e
C
T
.

M

io
T
.

e
D
tu

/
T

UN
C
l
/

l

UN
R
T
io
C
e

P
D

F
/

D
o

io
/

.

1
0
1
1
6
2

/
T

l

UN
C
_
UN
_
0
0
3
6
9
1
9
2
4
0
4
0

/

/
T

l

UN
C
_
UN
_
0
0
3
6
9
P
D

.

F

B

G
tu
e
S
T

T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

Figura 4: Results on the containing passage ICT task as maximum passage length varies (50 A 400 gettoni). Left:
Reranking 200 candidates; Right: Retrieval from 3 million candidates. Exact numbers refer to Table A.1.

. Queries xi
These form the document collection
are generated by sampling sub-sequences from the
documents yi. We use queries of lengths between
5 E 25, and do not remove the queries xi from
their corresponding documents yi.

D

Dl, for l

We create a dataset with 1 million queries
and evaluate retrieval against four document
collections
50,100,200,400. Each

Dl contains 3 million documents of maximum
length l tokens. In addition to original Wikipedia
passages, each
Dl contains synthetic distractor
documents, which contain the large majority of
words in x but differ by one or two tokens.
5 K queries are used for evaluation, leaving the
rest for training and validation. Although checking
containment is a straightforward machine learning
task,
is a good testbed for assessing the
fidelity of compressive neural models. BM25-bi
achieves over 95 MRR@10 across collections for
this task.

Esso

Figura 4 (left) shows test set

results on
reranking, where models need to select one of
200 passages (top 100 BM25-bi and 100 random
candidates). It is interesting to see how strong
the sparse models are relative to even a 768-
length
dimensional DE-BERT. As the document
increases, the performance of both the sparse and
dense dual encoders worsens; the accuracy of the
DE-BERT models falls most rapidly, widening the
gap to BM25.

Full cross-attention is nearly perfect and does
not degrade with document length. DE-BERT-768,
which uses 8 vectors of dimension 768 to represent
documents, strongly outperforms the best DE-BERT
modello. Even DE-BERT-64, which uses 8 vectors of
size only 64 instead (thus requiring the same
document collection size as DE-BERT-512 and being
faster at inference time), outperforms the DE-BERT
models by a large margin.

337

Figura 4 (right) shows results for the much more
challenging task of retrieval from 3 million can-
didates. For the latter setting, we only evaluate
models that can efficiently retrieve nearest neigh-
bors from such a large set. We see similar behavior
to the reranking setting, with the multi-vector
methods exceeding BM25-uni performance for all
lengths and DE-BERT models under-performing
BM25-uni. The hybrid model outperforms both
components in the combination with largest
improvements over ME-BERT for
the longest-
document collection.

6 Retrieval for Open-Domain QA

l

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

P

:
/
/

D
io
R
e
C
T
.

M

io
T
.

e
D
tu

/
T

UN
C
l
/

l

UN
R
T
io
C
e

P
D

F
/

D
o

io
/

.

1
0
1
1
6
2

/
T

l

UN
C
_
UN
_
0
0
3
6
9
1
9
2
4
0
4
0

/

/
T

l

UN
C
_
UN
_
0
0
3
6
9
P
D

.

F

B

G
tu
e
S
T

T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

{

50, 100, 200, 400

four different document

For this task we similarly use English Wikipedia6
collections, Di
COME
maximum passage length l
,
}
and corresponding approximate sizes of 39
million, 27.3 million, 16.1 million, E 10.2
million documents, rispettivamente. Here we use real
user queries contained in the Natural Questions
dataset (Kwiatkowski et al., 2019). We follow the
setup in Lee et al. (2019). There are 87,925 QA
pairs in training and 3,610 QA pairs in the test set.
We hold out a subset of training for development.
For document retrieval, a passage is correct
for a query x if it contains a string that matches
exactly an annotator-provided short answer for the
question. We form a reranking task by considering
the top 100 results from BM25-uni and 100 random
samples, and also consider the full retrieval setting.
BM25-uni is used here instead of BM25-bi, because
it is the stronger model for this task.

Our theoretical results do not make direct
predictions for performance of compressive dual
encoder models relative to BM25 on this task. They

6https://archive.org/download/enwiki

-20181220.

Figura 5: Results on NQ passage recall as maximum passage length varies (50 A 400 gettoni). Left: Reranking of
200 passages; Right: Open domain retrieval result on all of (English) Wikipedia. Exact numbers refer to Table A.1.

compressive dual

do tell us that as the document length grows,
low-dimensional
encoders
may not be able to measure weighted term
overlap precisely, potentially leading to lower
performance on the task. Therefore, we would
expect that higher dimensional dual encoders,
multi-vector encoders, and hybrid models become
more useful for collections with longer documents.
Figura 5 (left) shows heldout set results on
the reranking task. To fairly compare systems
that operate over collections of different-sized
to select
passages, we allow each model
approximately the same number of tokens (400)
and evaluate on whether an answer is contained
in them. Per esempio, models retrieving from
top 8 passages, and ones
D50
retrieving from
D100 retrieve top 4. The figure
shows this recall@400 tokens across models.
The relative performance of BM25-uni and DE-
BERT is different from that seen in the ICT
task, due to the semantic generalizations needed.
Nevertheless, higher-dimensional DE-BERT models
generally perform better, and multi-vector models
provide further benefits, especially for longer-
document collections; ME-BERT-768 outperforms DE-
BERT-768 and ME-BERT-64 outperforms DE-BERT-512;
CROSS-ATTENTION is still substantially stronger.

return their

Figura 5 (right) shows heldout set results for
the task of retrieving from Wikipedia for each
Dl. Unlike
of the four document collections
the reranking setting, only higher-dimensional
DE-BERT models outperform BM25 for passages
longer than 50. The hybrid models offer large
improvements over their components, capturing
both precise word overlap and semantic similarity.
The gain from adding BM25 to ME-BERT and DE-
BERT increases as the length of the documents in
the collection grows, which is consistent with our
expectations based on the theory.

338

7 Large-Scale Supervised IR

The previous experimental sections focused on
understanding the relationship between compres-
sive encoder representation dimensionality and
document
length. Here we evaluate whether
our newly proposed multi-vector retrieval model
ME-BERT, its corresponding dual encoder base-
line DE-BERT, and sparse-dense hybrids compare
favorably to state-of-the-art models for large-
scale supervised retrieval and ranking on IR
benchmarks.

Datasets. The MS MARCO passage ranking task
focuses on ranking passages from a collection
of about 8.8 mln. About 532k queries paired with
relevant passages are provided for training. The MS
MARCO document ranking task is on ranking full
documents instead. The full collection contains
Di 3 million documents and the training set
has about 367 thousand queries. We report results
on the passage and document development sets,
comprising 6,980 E 5,193 queries, rispettivamente
in Table 1. We report MS MARCO and TREC DL
2019 (Craswell et al., 2020) test results in Table 2.

Model Settings. For MS MARCO passage we apply
models on the provided passage collections. For MS
MARCO document, we follow Yan et al. (2020) E
break documents into a set of overlapping passages
with length up to 482 gettoni, each including the
document URL and title. For each task, we train
the models on that task’s training data only. Noi
initialize the retriever and reranker models with
BERT-large. We train dense retrieval models on
positive and negative candidates from the 1000-
best list of BM25, additionally using one iteration
of hard negative mining when beneficial. For ME-
BERT, we used m = 3 for the passage and m = 4
for the document task.

l

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

P

:
/
/

D
io
R
e
C
T
.

M

io
T
.

e
D
tu

/
T

UN
C
l
/

l

UN
R
T
io
C
e

P
D

F
/

D
o

io
/

.

1
0
1
1
6
2

/
T

l

UN
C
_
UN
_
0
0
3
6
9
1
9
2
4
0
4
0

/

/
T

l

UN
C
_
UN
_
0
0
3
6
9
P
D

.

F

B

G
tu
e
S
T

T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

MS-Passage

MS-Doc

Retrieval

Reranking

Model

BM25
BM25-E
DOC2QUERY
DOCT5QUERY
DEEPCT
HDCT
DE-BERT
ME-BERT
DE-HYBRID
DE-HYBRID-E
ME-HYBRID
ME-HYBRID-E
MULTI-STAGE
IDST
Leaderboard
DE-BERT
ME-BERT
ME-HYBRID

MRR

0.167
0.184
0.215
0.278
0.243

0.302
0.334
0.304
0.309
0.338
0.343
0.390
0.408
0.439
0.391
0.395
0.394

MRR

0.249
0.209



0.300
0.288
0.333
0.313
0.315
0.346
0.339



0.339
0.353
0.353

Tavolo 1: Development set results on MS MARCO-
Passage (MS-Passage), MS MARCO-Document
(MS-Doc) showing MRR@10.

Model

MRR(MS)

RR

NDCG@10 Holes@10

Passage Retrieval

BM25-Anserini
DE-BERT
ME-BERT
DE-HYBRID-E
ME-HYBRID-E

0.186
0.295
0.323
0.306
0.336

0.825
0.936
0.968
0.951
0.977

0.506
0.639
0.687
0.659
0.706

Document Retrieval

Base-Indri
DE-BERT
ME-BERT
DE-HYBRID-E
ME-HYBRID-E

0.192


0.287
0.310

0.785
0.841
0.877
0.890
0.914

0.517
0.510
0.588
0.595
0.610

0.000
0.165
0.109
0.105
0.051

0.002
0.188
0.109
0.084
0.063

Tavolo 2: Test set first-pass retrieval results on
the passage and document TREC 2019 DL
evaluation as well as MS MARCO eval MRR@10
(passage) and MRR@100 (document) under
MRR(MS).

Results. Tavolo 1 comparatively evaluates our
models on the dev sets of two tasks. The state of
the art prior work follows the two-stage retrieval
and reranking approach, where an efficient first-
stage system retrieves a (usually large) list of
candidates from the document collection, E
a second stage more expensive model such as
cross-attention BERT reranks the candidates.

Our focus is on improving the first stage, and we
compare to prior works in two settings: Retrieval,
top part of Table 1, where only first-stage efficient
retrieval systems are used and Reranking, bottom

part of the table, where more expensive second-
stage models are employed to re-rank candidates.
Figura 6 delves into the impact of the first-stage
retrieval systems as the number of candidates the
second stage reranker has access to is substantially
reduced, improving efficiency.

context-aware

representations

We report results in comparison to the following
systems: 1) MULTI-STAGE (Nogueira and Lin, 2019),
which reranks BM25 candidates with a cascade of
BERT models, 2) DOC2QUERY (Nogueira et al.,
2019B) and DOCT5QUERY (Nogueira and Lin,
2019), which use neural models to expand docu-
ments before indexing and scoring with sparse
retrieval models, 3) DEEPCT (Dai and Callan,
2020B), which learns to map BERT’s contextualized
testo
term
A
pesi, 4) HDCT (Dai and Callan, 2020UN),
which uses a hierachical approach that combines
passage-level term weights into document level
term weights, 5) IDST, a two-stage cascade
ranking pipeline by Yan et al. (2020), E 6)
Leaderboard, which is the best score on the MS
MARCO-passage leaderboard as of Sept. 18, 2020.7
We also compare our models both to our own
BM25 implementation described in
4.1, and the ex-
ternal publicly available sparse model implemen-
tations, denoted with BM25-E. For the passage task,
BM25-E is the Anserini (Yang et al., 2018UN) sys-
tem with default parameters. For the document
task, BM25-E is the official IndriQueryLikelihood
baseline. We report on dense-sparse hybrids using
both our own BM25, and the external sparse
systems; the latter hybrids are indicated by a
suffix -E.

§

Looking at the top part of Table 1, we can
see that our DE-BERT model already outperforms
or is competitive with prior systems. The multi-
vector model brings larger improvement on the
dataset containing longer documents (MS MARCO
document), and the sparse-dense hybrid models
bring improvements over dense-only models on
both datasets. According to a Wilcoxon signed
rank test for statistical significance, all differences
between DE-BERT, ME-BERT, DE-HYBRID-E, and ME-
HYBRID-E are statistically significant on both
development sets with p-value < .0001. When a large number of candidates can be reranked, the impact of the first-stage system decreases. In the bottom part of the table we 7https://microsoft.github.io/msmarco/. 339 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 3 6 9 1 9 2 4 0 4 0 / / t l a c _ a _ 0 0 3 6 9 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Figure 6: MRR@10 when reranking at different retrieval depth (10 to 1000 candidates) for MS MARCO. see that our models are comparable to sys- tems reranking BM25 candidates. The accuracy of the first-stage system is particularly impor- tant when the cost of reranking a large set of candidates is prohibitive. Figure 6 shows the performance of systems that rerank a smaller number of candidates. We see that, when a very small number of candidates can be scored with expensive cross-attention models, the multi- vector ME-BERT and hybrid models achieve large improvements compared to prior systems on both MS MARCO tasks. Table 2 shows test results for dense models, external sparse model baselines, and hybrids of the two (without reranking). In addition to test set (eval) results on the MS MARCO passage task, we report metrics on the manually annotated passage and document retrieval test set at TREC DL 2019. We report the fraction of unrated items as Holes@10 following Xiong et al. (2020). Time and Space Analysis Figure 7 compares the running time/quality trade-off curves for DE- BERT and ME-BERT on the MS MARCO passage task using the ScaNN (Guo et al., 2020) library on a 160 Intel(R) Xeon(R) CPU @ 2.20GHz cores machine with 1.88TB memory. Both models use one vector of size k = 1024 per query; DE-BERT uses one and ME-BERT uses 3 vectors of size k = 1024 per document. The size of the document index for DE-BERT is 34.2GB and the size of the index for ME-BERT is about 3 times larger. The indexing time was 1.52h and 3.02h for DE-BERT and ME-BERT, respectively. The ScaNN configuration we use is num leaves=5000, and num leaves to search ranges from 25 to 2000 (from less to more exact search) and time per query is measured when using parallel inference on all 160 cores. In the higher quality range of the curves, ME-BERT achieves 340 Figure 7: Quality/running time tradeoff for DE-BERT and ME-BERT on the MS MARCO passage dev set. Dashed lines show quality with exact search. substantially higher MRR than DE-BERT for the same inference time per query. 8 Related Work We have mentioned research on improving the accuracy of retrieval models throughout the paper. Here we focus on work related to our central focus on the capacity of dense dual encoder representations relative to sparse bags-of-words. In compressive sensing it is possible to recover a bag of words vector x from the projection Ax for suitable A. Bounds for the sufficient dimensionality of isotropic Gaussian projections (Candes and Tao, 2005; Arora et al., 2018) are more pessimistic than the bound described in this is unsurprising because the task 2, but § of recovering bags-of-words from a compressed measurement is strictly harder than recovering inner products. Subramani et al. (2019) ask whether it is possi- ble to exactly recover sentences (token sequences) from pretrained decoders, using vector embed- dings that are added as a bias to the decoder hidden is more state. Because their decoding model expressive (and thus more computationally inten- sive) than inner product retrieval, the theoretical l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 3 6 9 1 9 2 4 0 4 0 / / t l a c _ a _ 0 0 3 6 9 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 issues examined here do not apply. Nonetheless, (Subramani et al., 2019) empirically observe a similar dependence between sentence length and embedding size. Wieting and Kiela (2019) rep- resent sentences as bags of random projections, finding that high-dimensional projections (k = 4096) perform nearly as well as trained encoding models. These empirical results provide further empirical support for the hypothesis that bag-of- words vectors from real text are ‘‘hard to embed’’ in the sense of Larsen and Nelson (2017). Our contribution is to systematically explore the rela- tionship between document length and encoding dimension, focusing on the case of exact inner product-based retrieval. We leave the combina- tion of representation learning and approximate retrieval for future work. 9 Conclusion Transformers perform well on an unreasonable range of problems in natural language processing. Yet the computational demands of large-scale retrieval push us to seek other architectures: cross-attention over contextualized embeddings is too slow, but dual encoding into fixed- length vectors may be insufficiently expressive, sometimes failing even to match the performance of sparse bag-of-words competitors. We have used both theoretical and empirical techniques to characterize the fidelity of fixed-length dual encoders, focusing on the role of document length. Based on these observations, we propose hybrid models that yield strong performance while maintaining scalability. Acknowledgments We thank Ming-Wei Chang, Jon Clark, William Cohen, Kelvin Guu, Sanjiv Kumar, Kenton Lee, Jimmy Lin, Ankur Parikh, Ice Pasupat, Iulia Turc, William A. Woods, Vincent Zhao, and the anonymous reviewers for helpful discussions of this work. References Dimitris Achlioptas. 2003. Database-friendly ran- dom projections: Johnson-Lindenstrauss with binary coins. Journal of Computer and System Sciences, 66(4):671–687. DOI: https:// doi.org/10.1016/S0022-0000(03) 00025-4 Noga Alon and Bo’az Klartag. 2017. Optimal compression of approximate inner products In 58th Annual and dimension reduction. Symposium on Foundations of Computer Sci- ence (FOCS). DOI: https://doi.org/10 .1109/FOCS.2017.65 Alexandr Andoni, Piotr Indyk, and Ilya Razen- shteyn. 2019. Approximate nearest neigh- bor search in high dimensions. Proceedings of the International Congress of Mathematicians (ICM 2018). Sanjeev Arora, Mikhail Khodak, Nikunj Saunshi, and Kiran Vodrahalli. 2018. A compressed sensing view of unsupervised text embeddings, bag-of-n-grams, and LSTMs. In Proceedings of the International Conference on Learning Representations (ICLR). DOI: https://doi .org/10.1142/9789813272880 0182 Shai Ben-David, Nadav Eiron, and Hans Ulrich learning via Simon. 2002. Limitations of embeddings in Euclidean half spaces. Journal of Machine Learning Research, 3(Nov):441–461. Emmanuel J. Candes and Terence Tao. 2005. IEEE Decoding by linear programming. Transactions on Information Theory, 51(12): 4203–4215. DOI: https://doi.org/10 .1109/TIT.2005.858979 Nick Craswell, Bhaskar Mitra, Emine Yilmaz, Daniel Campos, and Ellen M. Voorhees. 2020. Overview of the TREC 2019 deep learning track. In Text REtrieval Conference (TREC). TREC. Zhuyun Dai and Jamie Callan. 2020a. Context- aware document term weighting for ad-hoc search. In Proceedings of The Web Conference 2020. DOI: https://doi.org/10.1145 /3366423.3380258 Zhuyun Dai and Jamie Callan. 2020b. Context- aware sentence/passage term importance esti- mation for first stage retrieval. Proceedings of the ACM SIGIR International Conference on Theory of Information Retrieval. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training 341 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 3 6 9 1 9 2 4 0 4 0 / / t l a c _ a _ 0 0 3 6 9 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 of deep bidirectional transformers for language the 2019 understanding. In Proceedings of Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Laura Dietz, Ben Gamari, Jeff Dalton, and Nick Craswell. 2018. TREC complex answer retrieval overview. In Text REtrieval Conference (TREC). Luyu Gao, Zhuyun Dai, Zhen Fan, and Jamie Callan. 2020. Complementing lexical retrieval with semantic residual embedding. CoRR, abs/2004.13969. Version 1. Daniel Gillick, Sayali Kulkarni, Larry Lansing, Alessandro Presta, Jason Baldridge, Eugene Ie, and Diego Garcia-Olano. 2019. Learning dense representations for entity retrieval. In Proceedings of the 23rd Conference on Compu- tational Natural Language Learning (CoNLL). DOI: https://doi.org/10.18653/v1 /K19-1049 Jiafeng Guo, Yixing Fan, Qingyao Ai, and W. Bruce Croft. 2016a. A deep relevance matching model for ad-hoc retrieval. In Pro- ceedings of the 25th ACM International on Conference on Information and Knowledge Management. Ruiqi Guo, Sanjiv Kumar, Krzysztof Choromanski, and David Simcha. 2016b. Quan- tization based fast inner product search. In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS). Ruiqi Guo, Philip Sun, Erik Lindgren, Quan Geng, David Simcha, Felix Chern, and Sanjiv Kumar. 2020. Accelerating large-scale inference with anisotropic vector quantization. In Proceedings of the 37th International Conference on Machine Learning. Po-Sen Huang, Xiaodong He, Jianfeng Gao, Li Deng, Alex Acero, and Larry Heck. 2013. Learning deep structured semantic models for web search using clickthrough data. In Proceedings of the International Conference on Information and Knowledge Management (CIKM). DOI: https://doi.org/10.1145 /2505515.2505665 342 Samuel Humeau, Kurt Shuster, Marie-Anne Lachaux, and Jason Weston. 2020. Poly- encoders: Transformer architectures and pre- fast training strategies and accurate for In Proceedings of multi-sentence scoring. the International Conference on Learning Representations (ICLR). Thathachar S. transforms Jayram and David P. Woodruff. 2013. Optimal bounds for Johnson- Lindenstrauss streaming problems with subconstant error. ACM Trans- actions on Algorithms (TALG), 9(3):1–17. https://doi.org/10.1145 DOI: /2483699.2483706 and William B. Johnson and Joram Lindenstrauss. 1984. Extensions of Lipschitz mappings into a Hilbert space. Contemporary Mathematics, 26(189–206):1. Daniel Kane, Raghu Meka, and Jelani Nelson. 2011. Almost Johnson- optimal Lindenstrauss families. Approximation, Ran- domization, and Combinatorial Optimization. Algorithms and Techniques. explicit Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. 2020. Dense passage retrieval for open-domain question answering. In Proceedings of the 2020 Con- ference on Empirical Methods in Natural Lan- guage Processing (EMNLP). DOI: https:// doi.org/10.18653/v1/2020.emnlp -main.550 Omar Khattab and Matei Zaharia. 2020. Col- BERT. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. DOI: https://doi.org/10.1145/3397271 .3401075 Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, Kristina Toutanova, Llion Jones, Matthew Kelcey, Ming-Wei Chang, Andrew M. Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. 2019. Natural ques- tions: A benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:453–466. DOI: l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 3 6 9 1 9 2 4 0 4 0 / / t l a c _ a _ 0 0 3 6 9 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 https://doi.org/10.1162/tacl a 00276 Cody Kwok, Oren Etzioni, and Daniel S. Weld. 2001. Scaling question answering to the web. ACM Transactions on Informa- tion Systems (TOIS), 19(3):242–262. DOI: https://doi.org/10.1145/502115 .502117 Kasper Green Larsen and Jelani Nelson. 2017. Op- timality of the Johnson-Lindenstrauss lemma. In 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS). DOI: https://doi.org/10.1109/FOCS .2017.64 Kenton Lee, Ming-Wei Chang, and Kristina Toutanova. 2019. Latent retrieval for weakly supervised open domain question answering. In Proceedings for Computational Linguistics (ACL). the Association of Ji Ma, Ivan Korotkov, Yinfei Yang, Keith Hall, and Ryan T. McDonald. 2020. Zero-shot neural retrieval via domain-targeted synthetic query generation. CoRR, abs/2004.14503. Bhaskar Mitra and Nick Craswell. 2018. An information retrieval. introduction to neural Foundations and Trends R in Information (cid:13) Retrieval, 13(1):1–126. DOI: https://doi .org/10.1561/1500000061 Karthik Narasimhan, Adam Yala, and Regina Barzilay. 2016. Improving information extrac- tion by acquiring external evidence with In Proceedings of reinforcement Empirical Methods in Natural Language Pro- cessing (EMNLP). DOI: https://doi .org/10.18653/v1/D16-1261 learning. Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, Rangan Majumder, and Li Deng. 2016. MS MARCO: A human generated machine reading comprehension dataset. Rodrigo Nogueira and Kyunghyun Cho. 2019. Passage re-ranking with BERT. CoRR, abs /1901.04085. Rodrigo Nogueira and Jimmy Lin. 2019. From doc2query to doctttttquery. https://github .com/castorini/docTTTTTquery 343 Rodrigo Nogueira, Wei Yang, Kyunghyun Cho, and Jimmy Lin. 2019a. Multi-stage document ranking with BERT. CoRR, abs/1910.14424. Rodrigo Nogueira, Wei Yang, Jimmy Lin, and Kyunghyun Cho. 2019b. Document expansion by query prediction. CoRR, abs/1904.08375. Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence embeddings using Siamese BERT-networks. In Proceedings of Empirical Methods in Natural Language Pro- cessing (EMNLP). DOI: https://doi.org /10.18653/v1/D19-1410 Stephen Robertson, and Hugo Zaragoza. 2009. The probabilistic relevance framework: BM25 in and beyond. Foundations and Trends Information Retrieval, 3(4):333–389. DOI: https://doi.org/10.1561/1500000019 Minjoon Seo, Jinhyuk Lee, Tom Kwiatkowski, Ankur Parikh, Ali Farhadi, and Hannaneh Hajishirzi. 2019. Real-time open-domain ques- tion answering with dense-sparse phrase index. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Karen Sp¨arck Jones. 1972. A statistical inter- pretation of term specificity and its applica- tion in retrieval. Journal of Documentation, 28(1):11–21. DOI: https://doi.org/10 .1108/eb026526 Nishant Subramani, Samuel Bowman, and Kyunghyun Cho. 2019. Can unconditional language models recover arbitrary sentences? In Advances in Neural Information Processing Systems. Santosh S. Vempala. 2004. The Random Projection Method, volume 65. American Mathematical Society. Ellen M. Voorhees. 2001. The TREC question answering track. Natural Language Engi- neering, 7(4):361–378. DOI: https://doi .org/10.1017/S1351324901002789 John Wieting and Douwe Kiela. 2019. No training required: Exploring random encoders In Proceedings for sentence classification. of the International Conference on Learning Representations (ICLR). l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 3 6 9 1 9 2 4 0 4 0 / / t l a c _ a _ 0 0 3 6 9 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Lee Xiong, Chenyan Xiong, Ye Li, Kwok-Fung Tang, Jialin Liu, Paul Bennett, Junaid Ahmed, and Arnold Overwijk. 2020. Approximate nearest neighbor negative contrastive learning for dense text retrieval. CoRR, abs/2007.00808. Version 1. Ming Yan, Chenliang Li, Chen Wu, Bin Bi, Wei Wang, Jiangnan Xia, and Luo Si. 2020. IDST at TREC 2019 deep learning track: Deep cascade ranking with generation-based document expansion and pre-trained language In Text REtrieval Conference modeling. (TREC). Peilin Yang, Hui Fang, and Jimmy Lin, New York, NY, USA. 2018a. Anserini: Reproducible ranking baselines using lucene. Journal of Data and Information Quality, 10(4). DOI: https://doi.org/10.1145/3239571 Wei Yang, Haotian Zhang, and Jimmy Lin. 2019. Simple applications of BERT for ad hoc document retrieval. CoRR, abs/1903.10972. Zhilin Yang, Zihang Dai, Ruslan Salakhutdinov, and William W. Cohen. 2018b. Breaking the softmax bottleneck: A high-rank RNN language model. In Proceedings of the International Con- ference on Learning Representations (ICLR). A Proofs A.1 Lemma 1 Proof. For both distributions of embeddings, the error on the squared norm can be bounded with high probability (Achlioptas, 2003, Lemma 5.1): Pr( Ax k k (cid:12) (cid:12) (cid:12) < 2 exp( 2 x 2 − k k k (ǫ2/2 2 > ǫ

X

2)
k
k
ǫ3/3)).

(cid:12)
(cid:12)
(cid:12)

(4)

This bound implies an analogous bound on the
absolute error of the inner product (Ben-David
et al., 2002, corollary 19),

X, sì

Pr(

| H

Ax, Ay

4 esp(

k
2

i − h
(ǫ2/2

ǫ
2
io |
ǫ3/3)).

X

(

k

|k

2 +

2))

k

k

(5)

344

Let ¯q = q/
k

q

Then µ(q, d1, d2) =
if and only if

and ¯d = (d1 −
¯q, ¯d
H
io
A¯q, A ¯d

k

.
d2)/
d2k
k
. A ranking error occurs

d1 −

H
¯q, ¯d

0, which implies
=
= 1, so the probability of an inner product
ǫ is bounded by the right-hand side

ǫ. By construction

i − h

io |

i ≤

¯q
k

k

A¯q, A ¯d
| H
¯d
k
k
distortion
Di (5).

A.2 Corollary 1

¯q, ¯d

H

Proof. We have ǫ = µ(q, d1, d2) =
the Cauchy-Schwarz inequality. For ǫ
have ǫ2/6

the bound in (1) to β
the natural log yields lnβ
can be rearranged into k

1 by
1, we
ǫ3/3. We can then loosen
ǫ2
k
6 ). Taking
2

ǫ2k/12, Quale

2ln 4
β .

4 esp(
ln4
12ǫ−

i ≤

ǫ2/2


A.3 Lemma 2

convenience

Proof. For
µ(d2) =
µ(q, d1, d2). Define ǫ as in the theorem statement,
E
: µ(q, d1, d2)

. We have

define

ǫ

}


d2 ∈ Dǫ : Aq1 ≤

(µ(d2)2/2

Pr(
k
2

µ(d2)3/3))

Aq2)

Dǫ =
Pr(R

{

d2 ∈ D
r0)


4 esp(

ǫ

≤ Xd2∈D
4
||

esp(

k
2

(ǫ2/2

ǫ3/3)).

r0 implies the event

d2 ∈ Dǫ : Aq1 ≤

The first inequality follows because the event
Aq2.
R
The second inequality follows by a combination
of Lemma 1 and the union bound. The final
inequality follows because for any d2 ∈ Dǫ,
ǫ. The theorem follows because
µ(q, d1, d2)
=
||


r0 + 1.
|D|

A.4 Corollary 2

1/(

q, d2i

the retrieval

È 1 when q

function maxdh

Proof. For
,
q, D
io
the minimum non-zero unnormalized margin
and d are
q, d1i − h
H
Boolean vectors. Therefore the normalized margin
ha
lower bound µ(q, d1, d2)
k ×
). For non-negative d1 and d2 we
d2k
d1 −
k
2 +
√2LD.
Avere
d2k ≤ qk
d1 −

k
Preserving a normalized margin of
ǫ =
2 is therefore sufficient to avoid any
(2LQLD)
pairwise errors. By plugging this value into
24LQLDln 4
Corollary 1, we see that setting k
β
the probability of any pairwise
ensures that
β.
error is

d2k

d1k

k

k

q

2

1

l

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

P

:
/
/

D
io
R
e
C
T
.

M

io
T
.

e
D
tu

/
T

UN
C
l
/

l

UN
R
T
io
C
e

P
D

F
/

D
o

io
/

.

1
0
1
1
6
2

/
T

l

UN
C
_
UN
_
0
0
3
6
9
1
9
2
4
0
4
0

/

/
T

l

UN
C
_
UN
_
0
0
3
6
9
P
D

.

F

B

G
tu
e
S
T

T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

Model

Reranking

Retrieval

Passage length

50

100

200

400

50

100

200

400

CROSS-ATTENTION
HYBRID-ME-BERT-uni
HYBRID-ME-BERT-bi

ME-BERT-768
ME-BERT-64

DE-BERT-768

DE-BERT-512
DE-BERT-128

DE-BERT-64

DE-BERT-32
BM25-uni

BM25-bi

CROSS-ATTENTION
HYBRID-ME-BERT-uni

ME-BERT-768
ME-BERT-64

DE-BERT-768
DE-BERT-512

DE-BERT-128

DE-BERT-64
DE-BERT-32
BM25-uni

ICT task (MRR@10)

99.9 99.9 99.8 99.6





98.2 97.0 94.4 91.9
99.3 99.0 97.3 96.1
98.0 96.7 92.4 89.8 96.8 96.1 91.1 85.2
96.3 94.2 89.0 83.7 92.9 91.7 84.6 72.8
91.7 87.8 79.7 74.1 90.2 85.6 72.9 63.0
91.4 87.2 78.9 73.1 89.4 81.5 66.8 55.8
90.5 85.0 75.0 68.1 85.7 75.4 58.0 47.3
88.8 82.0 70.7 63.8 82.8 68.9 48.5 38.3
83.6 74.9 62.6 55.9 70.1 53.2 34.0 27.6
92.1 88.6 84.6 81.8 92.1 88.6 84.6 81.8
98.0 97.1 95.9 94.5 98.0 97.1 95.9 94.5

NQ (Recall@400 tokens)

48.9 55.5 54.2 47.6

45.7 49.5 48.5 42.9
43.6 49.6 46.5 38.7 42.0 43.3 40.4 34.4
44.4 48.7 44.5 38.2 42.2 43.4 38.9 33.0
42.9 47.7 44.4 36.6 44.2 44.0 40.1 32.2
43.8 48.5 44.1 36.5 43.3 43.2 38.8 32.7
42.8 45.7 41.2 35.7 38.0 36.7 32.8 27.0
42.6 45.7 42.5 35.4 37.4 35.1 32.6 26.6
42.4 45.8 42.1 34.0 36.3 34.7 31.0 24.9
30.1 35.7 34.1 30.1 30.1 35.7 34.1 30.1

Table A.1: Results on ICT task and NQ task (correspond to Figure 4 and Figure 5).

A.5 Theorem 1

Proof. Recall

that µ(q, d1, d2) = h
q
k

q, D(io)
=
1 io
, implying that

H

H

.

d2i
q,d1−
d2k
d1−
k×k
E
q, d1i

q, D(j)

By assumption we have
maxjh
q, d2i
D(io)
2 i ≥ h

2 i ≤ h
q, D(io)

H

q, d1 −
1
In the denominator, we expand
(D(io)

=
d2k
D(
io) =
¬
1
2
=i d(j). Plugging this into the squared norm,

d2i
d1 −
k
, where d(

D(io)
2 ) + (D(

io)
¬
1

k

k

(6)

)

io)

¬

j

P

2

d1 −
k
=

d2k
(D(io)
1

2

2

io)

=

(7)

D(
¬
2

)
k
2

k
+ 2

D(io)
2 k

D(io)
2 ) + (D(

io)
¬
1
io)
io)
D(
D(
+
¬
¬
1
2 k
k
io)
io)
D(
, D(
¬
¬
2
1
io)
io)
D(
D(
¬
¬
2 k
1

k
D(io)
1
D(io)
1
H
D(io)
D(io)
2 k
1
k
2
D(io)
D(io)
2 k
1
≥ k
D(io)
The inner product
= 0
1
because the segments are orthogonal. The combi-
nation of (6) E (10) completes the theorem.

D(io)
2 io
2
+

io)
¬
1

D(io)
2 , D(

D(
¬
2

(10)

(9)

(8)

=

k

H

io

io)

.

2

/
T

l

UN
C
_
UN
_
0
0
3
6
9
P
D

.

F

B

G
tu
e
S
T

T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

345

l

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

P

:
/
/

D
io
R
e
C
T
.

M

io
T
.

e
D
tu

/
T

UN
C
l
/

l

UN
R
T
io
C
e

P
D

F
/

D
o

io
/

.

1
0
1
1
6
2

/
T

l

UN
C
_
UN
_
0
0
3
6
9
1
9
2
4
0
4
0

/

6Sparse, Dense, and Attentional Representations for Text Retrieval image

Scarica il pdf