Time-Aware Language Models as Temporal Knowledge Bases

Bhuwan Dhingra∗∗∗

Jeremy R. Cole∗
Jacob Eisenstein William W. Cohen
Google Research

Julian Martin Eisenschlos Daniel Gillick

{bdhingra,jrcole,eisenjulian,dgillick,jeisenstein,wcohen}@google.com

Abstrait

Many facts come with an expiration date, depuis
the name of the President to the basketball
team Lebron James plays for. Cependant, most
language models (LMs) are trained on snap-
shots of data collected at a specific moment in
temps. This can limit their utility, en particulier dans
the closed-book setting where the pretraining
corpus must contain the facts the model should
memorize. We introduce a diagnostic dataset
aimed at probing LMs for factual knowledge
that changes over time and highlight problems
with LMs at either end of the spectrum—those
trained on specific slices of temporal data, comme
well as those trained on a wide range of tempo-
ral data. To mitigate these problems, we pro-
pose a simple technique for jointly modeling
text with its timestamp. This improves mem-
orization of seen facts from the training time
period, as well as calibration on predictions
about unseen facts from future time periods.
We also show that models trained with tem-
poral context can be efficiently ‘‘refreshed’’
as new data arrives, without the need for re-
training from scratch.

Introduction

Language models (LMs) have been suggested
as repositories of real-world knowledge (Petroni
et coll., 2019) and there is much interest in using
them for tasks such as closed-book question an-
swering (QA; Roberts et al., 2020), fact verifica-
tion (Lee et al., 2020), and dialogue (Adiwardana
et coll., 2020). Many facts, cependant, change with
temps. This raises two questions: Do pretrained
LMs learn the appropriate temporal scope for
the facts they encode? And what is the best way
to update temporally scoped knowledge in pre-
trained models?

∗Equal contribution.
∗∗Also affiliated with Duke University, work done at

Google.

257

Pretraining corpora for models such as BERT
(Devlin et al., 2019), RoBERTa (Liu et al., 2019),
and GPT (Radford et al., 2019) are typically de-
rived from a snapshot of the web crawled at a
specific moment in time (Raffel et al., 2020).
While the impact on language modeling itself has
been highlighted in recent work (par exemple., Lazaridou
et coll., 2021; R¨ottger and Pierrehumbert, 2021;
Hombaiah et al., 2021), there are several poten-
tial problems specific to the encoding of factual
connaissance:

• Averaging: For temporally scoped knowl-
bord, the model may see conflicting informa-
tion, Par exemple, ‘‘Lebron James plays for
the Cavaliers / Lakers.’’ Because LM train-
ing generally ignores temporal metadata, ce
can lead to an averaging effect, dans lequel
the model has low confidence in any of the
correct answers.

• Forgetting: Corpora such as Wikipedia and
web crawls are constantly growing, avec
documents distributed non-uniformly across
temps: There are more recent documents than
older ones, both because old documents can
be updated and because more web documents
are generated recently than in the past. As a
result, the model may fail to memorize facts
that were valid only during underrepresented
periods of time, and therefore do worse when
asked questions about the more distant past.

• Poor temporal calibration: As language
models become ‘‘stale’’, they are increas-
ingly likely to be queried about facts outside
the temporal scope of their training data.
While it may seem undesirable for a model to
guess the answer to such questions, in many
cases it is perfectly reasonable to assume that

Transactions of the Association for Computational Linguistics, vol. 10, pp. 257–273, 2022. https://doi.org/10.1162/tacl a 00459
Action Editor: Anna Korhonen. Submission batch: 10/2021; Revision batch: 11/2021; Published 3/2022.
c(cid:13) 2022 Association for Computational Linguistics. Distributed under a CC-BY 4.0 Licence.

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
4
5
9
2
0
0
4
5
4
3

/
t

un
c
_
un
_
0
0
4
5
9
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

the future will be like the present: for exam-
ple, in twenty years the capital of Alaska is
unlikely to change, even though the gover-
nor of Alaska is nearly impossible to pre-
dict. Ideally, the confidence with which the
model responds to such queries should reflect
this difficulty.

Temporally scoped facts are common in prac-
tice; cependant, QA datasets such as SQuAD
(Rajpurkar et al., 2018) or Natural Questions
(Kwiatkowski et al., 2019) focus on a single time
period, even for questions whose answers are tem-
porally scoped. Ainsi, our first contribution in this
paper is a diagnostic dataset, TEMPLAMA (short
for TEMPoral LAnguage Model Analysis), de
fill-in-the-blank queries for probing time-sensitive
knowledge in LMs. The queries in TEMPLAMA
are chosen such that the answer varies with time
(§ 2.1). Using this dataset, we find empirical evi-
dence of the problems mentioned above (§ 3).

As a first step towards addressing these prob-
lems, we propose a lightweight modification to
pretraining. We parametrize the masked language
modeling objective (MLM; Devlin et al., 2019)
with temporal information, P. (oui|X, t; je), where y
is a masked token or span, x is the textual context,
and t is the time (§ 2.3). The parameters θ must
learn a representation of both text and time. In the
T5 framework (Raffel et al., 2020), this can be ac-
complished by prefixing the input x with a string
representation of t, Par exemple, ‘‘year: 2018’’.
En outre, we pretrain from documents that
are uniformly sampled from the timespan of the
training corpus which, in our case, consists of
news articles ranging from 2010–2018 (Lazaridou
et coll., 2021) (§ 2.1). These interventions accom-
plish two goals: the model is exposed to facts
from the entire time range instead of just the
most recent one, which avoids forgetting certain
temporally scoped facts, and it prevents averag-
ing because the facts are assigned to different time
buckets (in our case years). This leads to improved
recall of facts from the timespan of the training
corpus (§ 3.1).

These interventions also improve the model’s
temporal calibration. We find that jointly model-
ing text and time improves perplexity on future
years unseen during training. On TEMPLAMA,
the joint model degrades more gracefully than
a model unaware of time. We also examine the
model’s calibration farther into the future using

hand-crafted sets of queries whose answer is likely
to change frequently, rarely, or never. We find
qualitative evidence that the entropy of models
trained uniformly across the training timespan in-
creases most rapidly for the frequently changing
facts (§ 3.2).

While calibration is desirable, models should
be refreshed with new data when it becomes
available. A standard practice for doing this is
to combine the new and old data and retrain the
model from scratch (par exemple., Liu et al., 2021), mais
retraining can be costly for large-scale models
(Strubell et al., 2019). On the other hand, finetun-
ing only on the new data leads to catastrophic
forgetting of the old data (Zhu et al., 2020),
since standard LMs have no knowledge of what is
‘‘new’’ and what is ‘‘old’’, unlike a model trained
with temporal context. We show that our tempo-
rally scoped pretraining procedure makes LMs
more amenable to post-hoc finetuning, as the
data is implicitly bucketed into non-overlapping
time slices. We observe a similar performance to
models retrained from scratch with 30× fewer
steps, and without degradation on the knowledge
encoded by the older data (§ 3.3).

Summary of Contributions:
(1) We offer
TEMPLAMA, a new dataset of temporally scoped
knowledge probes. (2) We propose a simple mod-
ification to pretraining that facilitates the acqui-
sition of temporal knowledge. (3) We conduct
evaluations that demonstrate the impact of tem-
poral shift on the knowledge encoded by existing
LMs and the improvements offered by temporally
scoped pretraining. (4) We perform a qualitative
analysis of temporal calibration into the future,
again demonstrating the positive impact of tem-
porally scoped pretraining. (5) We show that tem-
porally scoped pretraining also facilitates efficient
updates to existing pretrained LMs.

2 Methods

We probe factual knowledge in masked LMs using
span prediction—given an input statement x with
a span y replaced by a special character, le
task is to reconstruct that span. En plus, nous
assume that each (X, oui) pair has a timestamp t
denoting the time at which it was written or a
point in time at which its assertion is valid. Dans
this paper, we discretize t into yearly buckets
and leave more fine-grained groupings (par exemple., à

258

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
4
5
9
2
0
0
4
5
4
3

/
t

un
c
_
un
_
0
0
4
5
9
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

the level of months or days) for future work. Pour
simplicity and efficiency, all of our models are
text-to-text Transformers (Vaswani et al., 2017)
initialized from publicly available T5 checkpoints
(Raffel et al., 2020) and then adapted to more
time-dependent datasets. We first describe these
datasets, followed by the approaches for jointly
modeling text and time.

2.1 Datasets

We experiment with a large-scale news corpus
(CUSTOMNEWS) for pretraining our models, com-
bined with a smaller diagnostic dataset of factual
queries (TEMPLAMA) for evaluation.

CUSTOMNEWS The CUSTOMNEWS dataset is a sub-
set of web documents that are determined to be
news (Lazaridou et al., 2021) and have an as-
sociated date either extracted from the article’s
URL or from its html by looking for a publication
date. We adapt this dataset in two main ways.
D'abord, we focus on a subset created by randomly
sampling 1M news articles from each of the years
2010–2020 which had the maximum number of ar-
ticles. Deuxième, while Lazaridou et al. (2021) used
this data for classic autoregressive language mod-
eling, we instead adapt it for the MLM objective.
Spécifiquement, we split the articles into sentences x
and then identify salient spans y in the text corre-
sponding to named entities and dates. The salient
span masking (SSM) paradigm improves question
answering performance in both open-book (Guu
et coll., 2020) and closed-book settings (Roberts
et coll., 2020). SSM restricts the inputs to those
which have a higher chance of requiring world
knowledge and better aligns with our objective
of measuring the factual knowledge captured by
the LMs. Following Guu et al. (2020), we iden-
tify named entities using a BERT-based tagger
trained on CoNLL-2003 data (Tjong Kim Sang
and De Meulder, 2003) and a regular expression
for dates.

TEMPLAMA We also construct a more targeted
masked LM evaluation for probing temporally
sensitive knowledge. Starting with the November
2020 Wikidata snapshot (Vrandeˇci´c and Kr¨otzsch,
2014), we first identify all facts that have either
a start or an end date after 2010 and whose sub-
jects and objects are both entities with Wikipedia
pages.1 Among these 482K facts, we identify sub-

1We use SLING (Ringgaard et

al., 2017)

pour

preprocessing.

Year

2017

2020

2012
2019

Input

CUSTOMNEWS

Target

The pound faces pressure from the US
but the X election could hit euro
X accused Liverpool of ‘crossing the
line’ during win over his Chelsea side.

French

Frank
Lampard

TEMPLAMA

Cristiano Ronaldo plays for X .
Cristiano Ronaldo plays for X .

Real Madrid
Juventus FC

Tableau 1: Examples from CUSTOMNEWS, lequel
masks named entities and dates from news ar-
ticles, and TEMPLAMA, a novel synthethic dataset
of temporally scoped factual statements built from
Wikidata.

ject and relation pairs that have multiple objects at
different times and select nine relations with the
most such subjects. For these relations we manu-
ally write template cloze queries (par exemple., ‘‘Subject
works for X .’’) and populate them with the
1000 most frequent subjects per relation. For each
subject and each relation we gather all the objects
with their associated time interval and construct
a separate query for each year in that interval.
When intervals for the object entities overlap, nous
add all of them to the list of correct answers. Le
query and the corresponding year form the inputs
x and t, while the object entity is the target y.
In total we construct 50,310 queries across 11
years.2 Note that these type of cloze-style ques-
tions naturally follow the salient span masking
paradigm, where the answer to the question is the
span to be masked. Tableau 1 shows examples from
both CUSTOMNEWS and TEMPLAMA. A full list
of the relations in TEMPLAMA and their template
queries is included in Appendix A.

2.2 Training and Evaluation

We train and evaluate each of our models on a
mixture of CUSTOMNEWS and TEMPLAMA. All
models are initialized from a public T5 check-
indiquer, and then further adapted for 300K steps on
our data. From CUSTOMNEWS we hold out 2000
articles each for validation and testing from each
of the yearly subsets. From TEMPLAMA we re-
serve 10% et 70% of the queries from each
of the yearly subsets for validation and testing,
respectivement, ensuring that none of the subject en-
tities overlap between train, validation, or test sets.

2The TEMPLAMA data is available at https://
github.com/google-research/language/tree
/master/language/templama.

259

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
4
5
9
2
0
0
4
5
4
3

/
t

un
c
_
un
_
0
0
4
5
9
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Chiffre 1: Three training setups to train T5 on CUSTOMNEWS: The Uniform model (gauche) is trained on all the data
without explicit time information. The Yearly model (middle) avoids averaging over similar contexts by training
separate models depending on the year, while the Temporal model (droite) prepends a time prefix to each example.

Splitting along subject entities ensures that none
of the facts required to answer the test queries are
seen during training on TEMPLAMA (Lewis et al.,
2021). Instead they must be learned in an unsu-
pervised manner either from the T5 pretraining or
when adapting to CUSTOMNEWS. We train over the
combination of the two training sets such that for
every 1000 inputs from CUSTOMNEWS, the model
sees 1 input from TEMPLAMA. Finetuning on a
small disjoint set of queries from TEMPLAMA
in this manner avoids issues due to suboptimal
prompts (Jiang et al., 2020b; Logan et al., 2021)
by allowing the model to learn the expected for-
mat of queries and answers (par exemple., ‘‘Liverpool
F.C.’’ vs ‘‘Liverpool’’).

We also partition the data into two groups based
on the year: 2010–18 and 2019–20. Models are
trained only on the former, but tested on both
to measure their performance for both seen and
future time periods. This split was informed by the
fact that the T5 checkpoints were pretrained on
web text extracted in April 2019. The main metric
for evaluation is a token-level F1 score between
the predicted and ground truth targets, computed
in the same way as for the SQuAD benchmark
(Rajpurkar et al., 2018). For TEMPLAMA queries
with multiple targets we take the max F1.

2.3 Jointly Modeling Text and Time

Given a dataset of (X, oui, t) triples we model
P. (oui|X, t; je) using variants of the T5 model where,
given x as the input sequence, we maximize the
likelihood of the target sequence y. We compare
two approaches to condition the predictions on the
time t (also see Figure 1).

Yearly
In the first approach we use the temporal
context by training separate models specialized
to different time buckets (in our case years), donc

P. (oui|X, t; je) = P (oui|X; θt). Ainsi, we train an
ensemble of nine T5 models adapted to each
year between 2010 et 2018 for an additional
300K steps. When provided with a test input, ce
approach routes it to the appropriate yearly expert
based on its timestamp. If the timestamp falls
outside 2010–18, we use the closest yearly expert
(par exemple., 2018 for all test inputs ≥ 2018).

Temporal Training a separate expert for each
time slice reduces the averaging across conflict-
ing contexts (§ 1), but keeping an ensemble of
large-scale LMs is undesirable in practice. More-
over, there are regularities in how often facts
changement (par exemple., the FIFA World Cup happens every
4 années, whereas NBA Championships happen ev-
ery year), which a model specialized to a single
time slice might not be able to learn. Hence we
also train a single T5 model on the entire dataset
from 2010–2018 for 300K steps. In this model,
the time t is concatenated to the input, c'est,
P. (oui|X, t; je) = P (oui|t ⊕ x; je), using a simple string
representation of t as a prefix for the input x, pour
example, ‘‘year: 2014’’.

Baselines The T5 checkpoints
released by
Raffel et al. (2020) are pretrained on long inputs
with multiple masks and cannot directly be tested
using our factual knowledge probes. Plutôt, nous
establish a baseline on the datasets introduced
above using the pretrained models from Roberts
et autres. (2020), which were trained using SSM on
Wikipedia for an additional 100K steps. This is
referred to as T5-CBQA (closed-book question
answering). We also experiment with additionally
finetuning this model on TEMPLAMA for 5K steps
(T5-CBQA-ft).

To isolate the effect of time-aware pretraining,
we also train a Uniform model, which trains on

260

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
4
5
9
2
0
0
4
5
4
3

/
t

un
c
_
un
_
0
0
4
5
9
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Model

#Parameters

CustomNews

TempLAMA

2010–18

2019–20 Overall

2010–18

2019–20 Overall

T5-CBQA
T5-CBQA-ft
Uniform
Yearly
Temporal

737M.
737M.
737M.
6.6B
737M.

20.2
15.2
30.6
33.4
32.1

19.8
15.7
27.8
26.7
29.5

20.1
15.3
30.1
32.2
31.6

5.4
17.8
28.1
28.5
29.6

4.3
15.3
19.8
21.8
22.2

5.2
17.3
26.6
27.3
28.2

Tableau 2: F1 scores of Large-sized model variants for salient span mask prediction on CUSTOMNEWS
and TEMPLAMA. T5-CBQA is the pretrained model from Roberts et al. (2020), and T5-CBQA-ft is
further finetuned on TEMPLAMA. The Yearly model is an ensemble of 9 models each finetuned on
a yearly slice of the training data between 2010 et 2018. We use the 2018 model when testing on
2019–20. The Uniform and Temporal models are trained on the entire data from 2010–18, and the latter
has additional temporal context. The F1 scores are macro-averaged across the evaluation years. Le
Temporal model performs better on TEMPLAMA, which is focused only on temporally scoped facts, comme
well as on the unseen years for CUSTOMNEWS.

the same uniformly sampled data as Temporal for
the same number of steps, but without the time
provided as an input. During training, examples
are shuffled rather than presented in chronological
order. Note that there are many ways of sampling
training data across time, and the optimal choice
likely depends on the relative importance of mem-
orizing old versus recent facts. Here we assume
all time slices in the training data are equally
important and hence focus on uniform sampling.

Hyperparameters We primarily focus on the
Large-sized T5 models with 770M parameters,
but we also investigate the scaling with size by
comparing to the Small (110M.) and XXL (11B)
versions. We use the same set of hyperparameters
as Raffel et al. (2020), with a batch size of 2048,
a fixed learning rate of 0.001, and a dropout rate
de 0.1. All our models are trained for a fixed
number of 300K steps, except when adapting to
new data (§ 3.3), and then evaluated on the test
ensemble. We found the loss on held out CUSTOMNEWS
was still improving at the end of 300K steps,
but the overall trends were stable; to limit the
experimentation time we did not explore longer
training runs.

3 Experiments

We design several experiments to highlight the
problems around temporally scoped knowledge in
LMs and to test whether they can be addressed by
joint models of text and time.

3.1 Memorizing Facts Across Time

To understand the interplay of memorization and
temps, we examine the TEMPLAMA and CUSTOM-
NEWS performance on the 2010–18 slice. Ce
permits us to analyze the forgetting and averaging
effects discussed in § 1 by comparing models
trained on different slices of the data and with or
without the temporal context.

Results Table 2 shows performance on the
2010–18 test sets of CUSTOMNEWS and TEMP-
LAMA. T5-CBQA and T5-CBQA-ft fare signifi-
cantly worse on TEMPLAMA (17.8) than the more
standard Natural Questions benchmark (28.5; c.f.
Roberts et al., 2020). En particulier, we find that
training on the news domain leads to signif-
icant
improvements on the temporally scoped
knowledge required by TEMPLAMA (comparing
T5-CBQA-ft and Uniform). The two approaches
that condition the predictions on time, Yearly and
Temporal, improve over Uniform, which trains
on the same data but without temporal context.
The Yearly ensemble, cependant, has linearly more
parameters and requires linearly more compute
to train. For 2010–18, the Yearly model performs
better on CUSTOMNEWS, which is far more likely
to describe short-lived facts, but the Temporal
model is better on TEMPLAMA, where the facts
typically span multiple years. We further investi-
gate the relationship between fact durations and
model performance below.

We show empirical evidence of averaging and
forgetting effects in Figure 2, which plots the F1

261

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
4
5
9
2
0
0
4
5
4
3

/
t

un
c
_
un
_
0
0
4
5
9
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Chiffre 2: F1 score of models trained on data from a specific year on CUSTOMNEWS (Gauche) and TEMPLAMA (Middle)
as the gap between test and train years varies. Negative gaps indicate that the model is tested on data from before
the slice on which it was trained. The F1-score is macro-averaged across all possible pairs of train/test years
entre 2010 et 2018. For comparison we also show the F1 score of Uniform and Temporal models averaged
across 2010–18. Shaded area shows the 95% confidence interval around the macro-average. The performance
drop on both sides shows the forgetting effect. (Droite) F1 scores on TEMPLAMA grouped by the number of years
for which the answer to a query persists. Shaded area shows the 95% confidence interval using bootstrap.

score of the year-specific models as we vary the
gap between test and train years. The performance
drops quickly on both sides, showing forgetting;
cependant, the decline is larger for future years.
The right plot compares F1-scores on TEMPLAMA
for queries grouped by the number of years for
which their answer is valid.3 This is computed
from the duration of their corresponding facts
in Wikidata. The uniformly trained model has
higher performance on queries whose answers
persist for a long time, but it does worse on quer-
ies whose answers persist for less than 5 années.
The opposite is true for the year-specific models,
which is intuitive due to the averaging effect of
training on data from long periods of time. Add-
ing temporal context strikes a trade-off between
these two extremes, leading to the overall higher
F1 in Table 2.

Qualitatively, examining the TEMPLAMA ques-
tions that the Temporal model answers correctly
while the Uniform model answers incorrectly sup-
ports our hypothesis that the Uniform model is
averaging over possible choices: It frequently an-
swers with an entity that was more salient during
our training period (see Table 5).

Scaling Table 3 shows the effect of increasing
model size on the overall F1 scores on CUSTOM-
NEWS and TEMPLAMA. En général, larger model
sizes lead to a bigger improvement when training
with temporal context.

Longer Time Span. Tableau 6 compares the
Large-sized Uniform and Temporal models when

Size

Petit
Large
XXL

CustomNews

TempLAMA

Uniform Temporal Uniform Temporal

21.1
30.1
32.3

21.9
31.6
33.8

20.7
26.6
28.4

20.5
28.2
30.5

Tableau 3: Overall F1-score averaged from 2010–20
for Uniform and Temporal models for different
model sizes. Larger models benefit more from the
temporal context.

trained on a wider time period from 2004 à
2018.4 While the Temporal model still outper-
forms Uniform, the gap is smaller between the two
compared to when training on 2010–18. In gen-
eral increasing the time period entails memorizing
more facts for the Temporal model. Ainsi, ce
result suggests that the model size should also be
increased when training on longer time spans.

facts

translates

CronQuestions To explore whether the im-
à
proved memorization of
downstream tasks, we finetune the Uniform and
Temporal models on CronQuestions, a dataset of
410K time-dependent questions based on tempo-
ral knowledge graphs (Saxena et al., 2021). Il
consists of questions where the answer is either
an entity or a temporal expression. Similar to
TEMPLAMA, the questions are based on Wikidata
across time. We focus on a closed-book version
of the task, similar to the setup in Roberts et al.
(2020), where the model is trained to predict the

3For multiple answers we pick the duration of the

4CUSTOMNEWS only has a small number of articles from

first one.

2003 and before.

262

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
4
5
9
2
0
0
4
5
4
3

/
t

un
c
_
un
_
0
0
4
5
9
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Size

Petit

Large

XXL

Model

None
Uniform
Temporal
None
2018
Uniform
Temporal
None
Uniform
Temporal

EM
3.63
4.01
4.05
4.10
4.39
4.70
5.13
5.44
5.71
5.81

F1
9.51
10.27
10.20
10.78
10.87
11.34
11.93
12.19
12.61
12.88

Tableau 4: Test set results for models fine-
tuned on the CronQuestions dataset in a
closed-book manner. ‘‘None’’ refers to
finetuning the T5 baseline; the ‘‘2018’’
model is adapted to the 2018 slice of
CUSTOMNEWS.

first answer in the list of correct answers for an
input question. During evaluation, it is compared
to each answer in the set of correct answers, et
we take the maximum score among them. Tableau 4
lists the SQuAD-based EM and F1 metrics on the
test set. We see an improvement in memorization
for the Uniform and Temporal models, avec le
latter doing slightly better on the Large and XXL
model sizes.

3.2 Better Calibration in the Future

We examine the model’s performance on future
slices of data at two different time scales. In the
d'abord, we look at graceful degradation, mimicking
the life-cycle of a model that has been deployed,
and thus has not seen the newest slices of data
yet. In the second, we ask the models to predict
relations in the more distant future. While this
may seem unreasonable, it is possible to articulate
coherent intuitions about the future: Par exemple,
the capitals of U.S. states change far less fre-
quently than their governors, and the probabilities
emitted by language models should reflect this.

3.2.1 Graceful Degradation

Here we examine the TEMPLAMA and CUSTOM-
NEWS performance on the 2019–20 slices. Note
that none of the models were pretrained or adapted
to this slice, so these experiments allow us to
measure degradation. We additionally look at the

263

perplexity of the masked LM, which we com-
pute as:

ppl = exp −

P.(X,oui,t) log P (oui|X, t; je)
Py len(oui)

Following Lazaridou et al. (2021), we expect per-
plexity to increase for slices that are not covered
in the training data, but we expect the temporally
conditioned model to be relatively more robust.

Results Comparing the Uniform and Tempo-
ral models in Table 2, we can see that training
with temporal context improves F1 scores on the
2019–20 slices. The Yearly ensemble, which uses
the latest 2018 model when tested on 2019–20,
is significantly worse on CUSTOMNEWS but com-
parable on TEMPLAMA; potentially because some
of the answers remain the same. A closer look
at the model predictions reveals that, unsurpris-
franchement, none of the models are able to predict the
TEMPLAMA facts that change after the training
period. Adding temporal context simply allows
the Temporal model to persist the unchanged facts
to 2019–20. On CUSTOMNEWS it has higher per-
formance on the SSM objective, which includes
both dates and entities in articles from an unseen
time period.

Tableau 7 shows MLM perplexity on the CUSTOM-
NEWS test set. The Temporal model has lowest
perplexities on both the seen and unseen slices
of evaluation data. The Uniform model has lower
perplexity than the Yearly one, especially on the
future slices where we use the 2018 expert for the
latter. This suggests that, for language modeling,
training on more data outweighs the benefit of
training on the specific temporal distribution of
test data.

Do the models learn how soon an answer is
likely to change in the future? We do a qual-
itative analysis by partitioning the TEMPLAMA
test queries where each model was correct in the
2018 evaluation into two sets: those with Single or
Multiple answers across 2010–20. Then we mea-
sure the log-likelihood of that correct answer as
we change the input year t from 2019 à 2029,
and plot the change in log-likelihood relative to
2018 in Figure 3. For the T5-CBQA-ft and Uni-
form models, we vary the input years by prefixing
queries with ‘‘In year,…’’. The confidence for all
models decreases as we get into the future, lequel
is reasonable since all relations in TEMPLAMA

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
4
5
9
2
0
0
4
5
4
3

/
t

un
c
_
un
_
0
0
4
5
9
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Input

Year

Uniform

Temporal

X is the chair of Federal Reserve System.

Nigel Farage is a member of the X .
Mark Sanford holds the position of X .

X is the head of the government of New York City.
X is the head coach of Real Madrid CF.

Theresa May holds the position of X .
Peyton Manning plays for X .

X is the head of the government of United Kingdom. 2011 Theresa May

Marissa Mayer works for X .
Rahm Emanuel holds the position of X .

Jerome Powell
2019 Janet L. Yellen
Brexit Party
2019 UK Independence Party
United States representative
2017 Governor of South Carolina
Bill de Blasio
2016 Michael Bloomberg
2015 Zinedine Zidane
Carlo Ancelotti
2014 Prime Minister of Great Britain Home Secretary
Denver Broncos
2014 Indianapolis Colts
David Cameron
Google
White House Chief of Staff

2011 Yahoo
2010 Mayor of Chicago

Tableau 5: Examples comparing the Uniform and Temporal models on TEMPLAMA. The former fre-
quently predicts a more common or newsworthy answer from the range of the training data, without
taking the year into account.

Model

2004–09

2010–18

2019–20

Uniform
Temporal

34.8 (+6.3)
36.3 (+5.2)

29.8 (–0.8)
31.1 (–1.0)

27.4 (–0.4)
28.8 (–0.7)

Tableau 6: F1 scores on different evaluation slices
of CUSTOMNEWS for models trained on data from
2004–18. Numbers in the parentheses show the
absolute difference from the same model trained
on data from 2010–18.

Model

2010–18

2019–20

T5-CBQA
Uniform
Yearly
Temporal

26.11
11.68
13.62
11.33

29.22
14.37
23.30
13.58

Tableau 7: Masked language modeling per-
plexity on CUSTOMNEWS (lower is better).
The Temporal model degrades less when
evaluated on the future time slice.

are time-sensitive. Cependant, the confidence of
the Temporal model decreases more rapidly for
queries with multiple answers, reflecting the in-
tuition that facts which have changed in the past
are likely to change again in the future.

3.2.2 Future Relations

To further probe the models’ understanding of
expected versus unexpected changes in the future,
we curate a small diagnostic dataset of queries
about future relations. We restrict the queries such
that the answer is always either one of the 200
largest US cities or one of the 249 countries in
the world. This allows us to compute the entropy
of the predictions over a fixed set. To relate model

264

Chiffre 3: Change in log-likelihood over time of the
most recent answer (depuis 2018) for TEMPLAMA
queries with Single or Multiple answers. The difference
is taken from the value for the 2018 answer. The Tem-
poral model exhibits a more pronounced confidence
gap for facts that changed in the past.

predictions to commonsense intuitions, we con-
struct three sets of queries based on how frequently
they are expected to change: frequent, rare, et
never. Par exemple, the location of an awards
show might change every year, while the city
an athlete plays in changes every few years, et
the location of a landmark almost never changes.
Alors, given queries like ‘‘In 2022, the Space
Needle will be in X ’’ and ‘‘In 2022, the NBA
All-Star Game will be in X .’’, a model with
a reasonable representation of time should have
lower entropy for the former rather than the lat-
ter. De plus, the entropy should increase with
time as the queries address the more distant fu-
ture, and the rate of increase should be greatest
for frequently-changing relations. Note that we do
not expect models to provide the correct answers

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
4
5
9
2
0
0
4
5
4
3

/
t

un
c
_
un
_
0
0
4
5
9
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

note the limitations with this evaluation, cependant:
(1) due to manual curation by the authors there
are only 86 queries in these sets, and are likely to
be biased in the facts they probe; et (2) entropy
mixes different kinds of uncertainty: that which is
inherent in the query (par exemple., there are more distinct
countries than cities with NFL teams), ainsi que
that due to the lack of confidence in the model.
We are interested in the latter, but our evaluation
does not disentangle the two effects.

3.3 Cheaper Adaptation to New Data

Improved calibration about the future can help
minimize mistakes after the training time period
(par exemple., by abstaining), but eventually models need
to be refreshed as the world changes and new data
arrives. Dans cette section, we consider the setting
where we have an already trained model on the
2010–18 slices, as well as new data from the 2019
slice. We attempt to update the model on this new
data (as measured by the combined performance
on 2019–20 held out data) without forgetting the
2010–18 slices. These experiments are similar
to the task posed by Lazaridou et al. (2020),
but we compare the impact of adapting versus
retraining from scratch. Finetuning only on the
newest data (2019) is suboptimal as the model
forgets facts about the past (Chiffre 5), lequel
was also observed by Zhu et al. (2020). Ici
we explore a simple alternative—training on a
mixture which samples a data point from the new
slice (2019) with probability α and a data point
from the old slices (2010–18) with probability
1- un. We finetune both the Temporal and Uniform
models on this mixture for an additional 50K steps
and compare the resulting performance to models
retrained from scratch for 300K steps on data
sampled uniformly from all slices (2010–19). Note
that the latter strategy can be costly for large-scale
LMs (Strubell et al., 2019).

Results Figure 5 shows the F1-score on CUSTOM-
NEWS and TEMPLAMA as we vary α. Across all
values of α, the Uniform model improves sig-
nificantly on the 2019 slice, but this comes at
the cost of degrading on the 2010–18 slices.
The Temporal model also adapts to 2019, mais
shows minimal degradation on the 2010–18
slice up to α = 0.6. For α = 0.5 we found
that its performance with 10K additional steps
matches that of the Temporal model trained from
scratch for 300K steps, suggesting that models

Chiffre 4: Entropy over time for frequent, rare, et
never-changing queries. The Temporal model is more
uncertain about frequently changing queries as time
passe, and has a flatter entropy for constant facts.

for these queries (which we do not know anyway),
but only assign confidence in a manner consistent
with human intuitions. In total, we constructed 86
queries across the three sets, which are included
in Appendix B.

Results Figure 4 shows the entropy of differ-
ent model variants averaged across the three sets
of queries and plotted over time. The baseline
T5-CBQA-ft model has a low constant entropy
throughout, irrespective of the query type. Com-
bined with its low accuracy on future slices from
Tableau 2, this suggests it remains confidently in-
correct and has poor calibration about which facts
are likely to change. Both the Uniform and Tem-
poral models have increasing uncertainty in the
avenir, which is ordered correctly according to
intuition: highest for the queries of frequently
changing facts, and lowest for queries whose an-
swers are expected not to change. Fait intéressant,
the Temporal model has a largely constant en-
tropy for rare- and never-changing queries until
2022, after which it begins to increase. While this
agrees with intuition, ideally a model should have
low entropy on the never-changing set further into
l'avenir.

Dans l'ensemble, these results suggests that: (1) mod-
els trained uniformly over a wide range of time-
sensitive data show improved calibration about
expected changes in the future; et (2) entraînement
with temporal context further improves this cali-
bration for the first few years beyond the training
period, in our case from 2019 à 2022. We also

265

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
4
5
9
2
0
0
4
5
4
3

/
t

un
c
_
un
_
0
0
4
5
9
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
4
5
9
2
0
0
4
5
4
3

/
t

un
c
_
un
_
0
0
4
5
9
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Chiffre 5: CUSTOMNEWS (gauche) and TEMPLAMA (droite) F1 score as models are adapted to new data from 2019 pour
50K steps. α denotes the fraction of training examples which come from the 2019 slice (remaining examples come
from the 2010–18 slices). Dotted lines indicate models retrained from scratch for 300K steps on equal proportions
of all data from 2010–19. The Temporal model degrades less than Uniform on the 2010–18 slice when adapted.

trained with temporal context can be efficiently
adapted to new data without forgetting facts from
the old data.

4 Discussion and Limitations

Our experiments have shown that current models
have practical limitations in their ability to mem-
orize the past and reasonably estimate the future.
These limitations can be mitigated by providing
the model the date at which a text was created.
While our results show consistent advantages, ils
also represent a narrow understanding of time. Dans
particular, the publication date of a news articles
does not necessarily correspond to the temporal
scope of all events described in the article. Pour
example, articles may talk about historical events
or discuss events scheduled to happen in the future.
In CUSTOMNEWS around 3.9% sentences explicitly
mention a year between 2010 et 2018, et 2.1%
mention the same year as the publication date of
the article. This fraction is likely responsible for
the improvement of the Uniform model. The Tem-
poral model further assigns an approximate scope
to the remaining 96% sentences and it is encour-
aging to see improvements from that. One avenue
for future work is to explore better strategies for
assigning dates to these sentences.

We have focused on closed-book question an-
swering, but temporal staleness of language mod-
els may have impacts in other applications as well.
Par exemple, in open-book question answering, it

is still necessary to align the question with rele-
vant text in the retrieved passage, and this could be
challenging when the question cannot be properly
encoded by a stale LM: Par exemple, the query
‘‘which countries were affected by the 2020 hurri-
cane season?’’ would not match the passage ‘‘Iota
caused damages of $564 million in Nicaragua’’ in
an LM that did not have access to training data
mentioning ‘‘Iota’’ as a hurricane.

Another limitation of our work is that TEMP-
LAMA is constructed in a synthetic manner from
WikiData. Incomplete or incorrect facts in the KB
can result in incorrect queries in TEMPLAMA; pour
instance, we assume a missing start date implies
the fact is valid from the beginning of our time
period of interest. We partition the TEMPLAMA
and CUSTOMNEWS dataset on the same yearly slices
despite the nature of the datasets being quite
different. De plus, we did not investigate using
longer or shorter temporal partitions. En plus,
we did not test the ability to model temporal
expressions such as ‘‘before’’ or ‘‘during’’, et
we did not investigate temporal commonsense
(par exemple., Zhou et al. 2019), temporal ordering (par exemple.,
Ning et al. 2020), or events (par exemple., Zhou et al. 2021).
Dernièrement, it is worth noting that like all closed-
book models the models presented in this paper are
also likely to only memorize common facts about
popular entities. This has the danger of reinforc-
ing stereotypes and leading to unfair outcomes.
En plus, training the multitude of large-scale
language models presented in this paper required
the use of 32 Cloud TPU v3 cores for several

266

hundred hours, which has a significant environ-
mental impact (Strubell et al., 2019). Cependant,
our hope is that efficient schemes for updating
temporally sensitive knowledge in LMs will even-
tually save energy costs in the long run.

5 Related Work

There is extensive prior work on learning di-
achronic embeddings of individual words (par exemple.,
Wijaya and Yeniterzi, 2011; Hamilton et al., 2016;
Bamler and Mandt, 2017). Particularly related is
the approach of Dubossarsky et al. (2019), OMS
learn time-sensitive embeddings by concatenat-
ing each word token with the decade in which
it appears. As contextualized embedding models
have largely replaced non-contextual word em-
beddings (Peters et al., 2018; Devlin et al., 2019),
the main application of diachronic word embed-
dings is to detect and model lexical semantic
changes (par exemple., Frermann and Lapata, 2016), rather
than to improve temporal awareness on down-
stream tasks. Our work fills this gap by adding a
temporal component to T5, a pretrained language
model that can complete multi-token spans. While
Giulianelli et al. (2020) use contextualized em-
beddings from BERT to model lexical semantic
changes post hoc, they do not add a time-sensitive
component to the language model itself. Ainsi,
their approach cannot support time-aware fact
achèvement.

Several studies have focused on degradation of
models on test data from a different time period
than their training data (Huang and Paul, 2018,
2019; Jaidka et al., 2018; Lukes and Søgaard,
2018; Florio et al., 2020). Delasalles et al. (2019)
introduced an LSTM language model that con-
ditions on dynamic author representations com-
puted separately, and showed that it improves
perplexity on both seen and unseen (avenir) temps
periods. Most recently, R¨ottger and Pierrehumbert
(2021) analyzed the interplay between temporal
adaptation during pretraining and finetuning, et
concluded that while both stages benefit from
adaptation separately, adaptation during pretrain-
ing does not help the downstream task. Ici
we show that the benefits of adaptation can be
achieved using a single model that conditions
on time. We further show that the benefits of
adaptation come, à
in part, from better
least
memorization of time-sensitive facts.

In production contexts, an important form of
temporal generalization is the deployment of
models trained on data up to a certain time T
but applied on data after T : c'est, the present.
Lazaridou et al. (2021) show that language mod-
els gradually degrade in performance under such a
time-stratified setting, and propose dynamic eval-
uation (Krause et al., 2018) as a potential mitiga-
tion. Cependant, LMs are frequently applied to past
data as well, Par exemple, for extracting represen-
tations, and here we show that updating on only
the new data degrades performance on old data.
Our approach of conditioning on the temporal
context alleviates this issue.

A related line of work has explored editing
neural predictions after training given a dataset
of revised input and output pairs (Sinitsin et al.,
2020; Zhu et al., 2020; De Cao et al., 2021).
Here we introduce a different setting where we
have access to new unlabeled text after model
entraînement, which must be used implicitly to update
the factual predictions of the model. In this case the
update procedure also needs to figure out which
facts must be updated and which ones remain
the same.

Petroni et al. (2019) introduced the LAMA
benchmark for probing the factual knowledge
memorized by LMs, which consists of cloze
queries about facts, Par exemple, ‘‘Dante was
born in X ’’. Follow up studies have introduced
improved prompts for eliciting such knowledge
(Jiang et al., 2020b) as well as multilingual ver-
sions (Jiang et al., 2020un; Kassner et al., 2021).
Cependant, all these benchmarks assume a static
view of the knowledge inside an LM, and con-
sider all answers across time to be correct for a
given query. The TEMPLAMA dataset instead fo-
cuses on relations where the answers change with
time and uses temporal scopes to determine the
correct answer.

TEMPLAMA is similar in spirit

to KB-QA
benchmarks which focus on temporal reasoning
such as TempQuestions (Jia et al., 2018) et
CronQuestions (Saxena et al., 2021). Its format,
cependant, mimics the masked LM task typically
used in pretraining, since it
is intended as a
zero/few-shot probe. Unlike those datasets, nous
further restrict the queries to subject and relation
pairs for which multiple objects exist at different
points in time, and ensure a balanced distribu-
tion over the entire time period of interest from
2010–2020.

267

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
4
5
9
2
0
0
4
5
4
3

/
t

un
c
_
un
_
0
0
4
5
9
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

WikiData ID

Relation

# Queries

Template

P54
P39
P108
P102
P286
P69
P488
P6
P127

member of sports team
position held
employer
political party
head coach
educated at
chairperson
head of government
owned by

9033
7343
9049
7324
4886
1672
4190
4125
2688

plays for