Book Reviews
Embeddings in Natural Language Processing: Theory and Advances
in Vector Representations of Meaning
Mohammad Taher Pilehvar and Jose Camacho-Collados
(Tehran Institute for Advanced Studies & Cardiff University)
摩根 & Claypool (Synthesis Lectures on Human Language Technologies, edited by
Graeme Hirst, 体积 47), 2021, xvii+157 pp; 平装, 国际标准书号 978-1-63639-021-5;
ebook, 国际标准书号 978-1-63639-022-2; hardcover, 国际标准书号 978-1-63639-023-9;
土井:10.2200/S01057ED1V01Y202009HLT047
Reviewed by
Marcos Garcia
CiTIUS, University of Santiago de Compostela
Word vector representations have a long tradition in several research fields, such as cog-
nitive science or computational linguistics. They have been used to represent the mean-
ing of various units of natural languages, 包括, 除其他外, 字, 短语,
and sentences. Before the deep learning tsunami, count-based vector space models had
been successfully used in computational linguistics to represent the semantics of natural
语言. 然而, the rise of neural networks in NLP popularized the use of word
嵌入, which are now applied as pre-trained vectors in most machine learning
架构.
This book, written by Mohammad Taher Pilehvar and Jose Camacho-Collados,
provides a comprehensive and easy-to-read review of the theory and advances in vector
models for NLP, focusing specially on semantic representations and their applications.
It is a great introduction to different types of embeddings and the background and mo-
tivations behind them. 在这个意义上, the authors adequately present the most relevant
concepts and approaches that have been used to build vector representations. 他们还
keep track of the most recent advances of this vibrant and fast-evolving area of research,
discussing cross-lingual representations and current language models based on the
Transformer. 所以, this is a useful book for researchers interested in computational
methods for semantic representations and artificial intelligence. Although some basic
knowledge of machine learning may be necessary to follow a few topics, the book
includes clear illustrations and explanations, which make it accessible to a wide range
of readers.
Apart from the preface and the conclusions, the book is organized into eight
chapters. In the first two, the authors introduce some of the core ideas of NLP and
artificial neural networks, 分别, discussing several concepts that will be useful
throughout the book. 然后, Chapters 3 到 6 present different types of vector represen-
tations at the lexical level (word embeddings, graph embeddings, sense embeddings,
and contextualized embeddings), followed by a brief chapter (7) about sentence and
document embeddings. For each specific topic, the book includes methods and data sets
to assess the quality of the embeddings. 最后, Chapter 8 raises ethical issues involved
https://doi.org/10.1162/coli r 00410
© 2021 计算语言学协会
根据知识共享署名-非商业性-禁止衍生品发布 4.0 国际的
(CC BY-NC-ND 4.0) 执照
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
C
哦
我
我
/
我
A
r
t
我
C
e
–
p
d
F
/
/
/
/
4
7
3
6
9
9
1
9
7
1
8
4
6
/
C
哦
我
我
_
r
_
0
0
4
1
0
p
d
.
F
乙
y
G
你
e
s
t
t
哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3
计算语言学
体积 47, 数字 3
in data-driven models for artificial intelligence. Each chapter can be summarized as
如下.
Chapter 1 makes a brief introduction to some challenges of NLP, both from un-
derstanding and from generation perspectives, including different types of linguistic
ambiguity. The main part of the chapter introduces vector space models for semantic
表示, presenting the distributional hypothesis and the evolution of vector
space models.
The second chapter starts by giving a quick introduction of some linguistic fun-
damentals for NLP (syntax, morphology, and semantics) and of statistical language
型号. 然后, it gives an overview of deep learning, presenting the fundamental differ-
ences between architectures, and concepts which will be referred along with the book.
最后, the authors present some of the most relevant knowledge resources to build
semantically richer vector representations.
Chapter 3 is an extensive review of word embeddings. It first presents differ-
ent count-based approaches and dimensionality reduction techniques and then dis-
cusses predictive models such as Word2vec and GloVe. 此外, it also describes
character-based and knowledge-based embeddings as well as supervised and unsuper-
vised approaches of cross-lingual vector representations.
Chapter 4 illustrates the principal methods to build node and relation embeddings
from graphs. 第一的, it presents the key strategies to build node embeddings, from matrix
factorization or random walks to methods based on graph neural networks. 然后, 二
approaches regarding relation embeddings are presented: those built from knowledge
图表, and unsupervised methods which exploit regularities in the vector space.
The next chapter (5) starts by presenting the Meaning Conflation Deficiency of static
word embeddings, which motivates research on sense representations. This chapter
discusses two main approaches to build sense embeddings: unsupervised methods to
induce senses from corpora, and knowledge-based approaches which take advantage
of lexical resources.
Chapter 6 addresses contextualized embeddings and describes the main proper-
ties of the Transformer architecture and the self-attention mechanism. It includes an
overview of these types of embeddings, from early methods that represent a word
by its context, to current language models for contextualized word representation.
In this respect, the authors present contextualized models based on recurrent neural
网络 (例如, ELMo), and on the Transformer (GPT, BERT, and some derivatives). 这
potential impact of several parameters, such as subword tokenization or the training
目标, is also explained, and the authors discuss various approaches to use these
models in downstream tasks, such as feature extraction and finetuning. 最后, 他们
also summarize some interesting insights regarding the exploration of the linguistic
properties encoded by neural language models.
Chapter 7 comprises a brief sketch of vector representations of longer units, 例如
sentences and documents. It presents the bag of words approach and its limitations
as well as the concept of compositionality and its significance for the unsupervised
learning of sentence embeddings. Some supervised strategies (例如, training on natural
language inference or machine translation datasets) are also discussed.
Ethical aspects and biases of word representations are the focus of Chapter 8. 这里,
the authors present some risks of data-driven models for artificial intelligence and use
examples of gender stereotypes to show biases present in word embeddings, followed
by several methods aimed at reducing those biases. 全面的, the authors emphasize the
growing interest in the NLP community to critically analyze the social impact of these
型号.
700
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
C
哦
我
我
/
我
A
r
t
我
C
e
–
p
d
F
/
/
/
/
4
7
3
6
9
9
1
9
7
1
8
4
6
/
C
哦
我
我
_
r
_
0
0
4
1
0
p
d
.
F
乙
y
G
你
e
s
t
t
哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3
Book Reviews
The book concludes by highlighting some of the major achievements of current vec-
tor representations and calling for more rigorous evaluations to measure their progress,
especially in languages other than English, and with an eye on interpretability.
总之, this book brings a high-level synthesis of different types of embed-
dings for NLP, focused on the general concepts and the most established techniques, 和
includes useful pointers to delve deeper into specific topics. As the book also discusses
the most recent contextualized models (up to November 2020), it results in an attractive
combination of the foundations of vector space models with current approaches based
on artificial neural networks. As suggested by the authors, because of the explosion and
rapid development of deep learning methods for NLP, maybe “it is necessary to step
back and rethink in order to achieve true language understanding.”
Marcos Garcia is a postdoctoral researcher at CiTIUS, the Research Center in Intelligent Tech-
nologies of the University of Santiago de Compostela. He has worked on NLP topics such as
PoS-tagging, dependency parsing, and lexical semantics, and has developed resources and tools
for different languages in both industry and academia. His e-mail address is marcos.garcia
.gonzalez@usc.gal.
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
C
哦
我
我
/
我
A
r
t
我
C
e
–
p
d
F
/
/
/
/
4
7
3
6
9
9
1
9
7
1
8
4
6
/
C
哦
我
我
_
r
_
0
0
4
1
0
p
d
.
F
乙
y
G
你
e
s
t
t
哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3
701
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
C
哦
我
我
/
我
A
r
t
我
C
e
–
p
d
F
/
/
/
/
4
7
3
6
9
9
1
9
7
1
8
4
6
/
C
哦
我
我
_
r
_
0
0
4
1
0
p
d
.
F
乙
y
G
你
e
s
t
t
哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3
702
下载pdf