Effective Approaches to Neural Query
Language Identification
Xingzhang Ren
Alibaba DAMO Academy
xingzhang.rxz@alibaba-inc.com
Baosong Yang∗
Alibaba DAMO Academy
yangbaosong.ybs@alibaba-inc.com
Dayiheng Liu
Alibaba DAMO Academy
liudayiheng.ldyh@alibaba-inc.com
Haibo Zhang
Alibaba DAMO Academy
zhanhui.zhb@alibaba-inc.com
Xiaoyu Lv
Alibaba DAMO Academy
anzhi.lxy@alibaba-inc.com
Liang Yao
Alibaba DAMO Academy
yaoliang.yl@alibaba-inc.com
Jun Xie
Alibaba DAMO Academy
qingjing.xj@alibaba-inc.com
Query language identification (Q-LID) plays a crucial role in a cross-lingual search engine.
There exist two main challenges in Q-LID: (1) insufficient contextual information in queries for
disambiguation; 和 (2) the lack of query-style training examples for low-resource languages.
在本文中, we propose a neural Q-LID model by alleviating the above problems from both
model architecture and data augmentation perspectives. Concretely, we build our model upon the
∗Corresponding author.
动作编辑器: Mohit Bansal. 提交材料已收到: 13 十月 2021; 收到修订版: 19 六月 2022;
接受出版: 28 六月 2022.
https://doi.org/10.1162/coli a 00451
© 2022 计算语言学协会
根据知识共享署名-非商业性-禁止衍生品发布 4.0 国际的
(CC BY-NC-ND 4.0) 执照
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
C
哦
我
我
/
我
A
r
t
我
C
e
–
p
d
F
/
/
/
/
4
8
4
8
8
7
2
0
6
1
7
9
3
/
C
哦
我
我
_
A
_
0
0
4
5
1
p
d
.
F
乙
y
G
你
e
s
t
t
哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3
计算语言学
体积 48, 数字 4
advanced TRANSFORMER model. In order to enhance the discrimination of queries, a variety of
external features (例如, 特点, word, as well as script) are fed into the model and fused by a
multi-scale attention mechanism. 而且, to remedy the low resource challenge in this task, A
novel machine translation–based strategy is proposed to automatically generate synthetic query-
style data for low-resource languages. We contribute the first Q-LID test set called QID-21,
which consists of search queries in 21 语言. Experimental results reveal that our model
yields better classification accuracy than strong baselines and existing LID systems on both
query and traditional LID tasks.1
1. 介绍
Cross-lingual information retrieval (CLIR) can have separate query language identi-
fication (Q-LID), query translation, 信息检索, as well as machine-learned
ranking stages (Sabet et al. 2019; Sun, Sia, and Duh 2020; 李等人. 2020). Among them,
the Q-LID stage takes a multilingual user query as input and returns the language clas-
sification results for the downstream translation and retrieval tasks. Low-quality Q-LID
may cause problems such as inaccurate and missed translations, eventually resulting
in irrelevant recalls or null results that are inconsistent with the user’s intention (Bosca
and Dini 2010; Lui, Lau, and Baldwin 2014; Tambi, Kale, and King 2020).
最近, deep neural networks have shown their superiority and even yielded
human-level performance in a variety of natural language processing tasks, 为了考试-
普莱, text classification (Kim 2014; Mandal and Singh 2018), 语言建模 (Devlin
等人. 2019; Conneau and Lample 2019), as well as machine translation (Vaswani et al.
2017; Dai et al. 2019). 然而, most existing Q-LID systems still apply traditional
型号, 例如, Random Forest (Vo and Khoury 2019), Gradient Boost Tree (Tambi,
Kale, and King 2020), and statistical-based approaches (Duvenhage 2019), which de-
pend on massive feature engineering (Mandal and Singh 2018). 一般来说, the inappli-
cability of neural networks in the Q-LID task mainly lies in two concerns:
•
•
C1: Queries are usually composed of keywords and are presented as short
文本. The lack of contextual information in queries raises the difficulty of
Q-LID, especially for the fuzzy searches in the real-world scenario such
as misspelling and code-switch (Tambi, Kale, and King 2020; Ren et al.
2022; Wan et al. 2022). End-to-end training in neural-based models
regardless of prior knowledge may be insufficient to cope with this task.
C2: A well-performed neural model depends on extensive training
examples (Devlin et al. 2019). In contrast with conventional LID models
that can exploit massive collections of public data such as the W2C
语料库 (Majlis and Zabokrtsk ´y 2012) and the Common Crawl corpus
(Sch¨afer 2016), well-labeled query data covering low-resource languages
are unavailable. The unbalanced training corpus potentially causes
learning biases and weakens model performance (Glorot, Bordes, 和
本吉奥 2011).
1 The source code and the associated benchmark have been released at: https://github.com/xzhren
/Q-LID.
888
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
C
哦
我
我
/
我
A
r
t
我
C
e
–
p
d
F
/
/
/
/
4
8
4
8
8
7
2
0
6
1
7
9
3
/
C
哦
我
我
_
A
_
0
0
4
5
1
p
d
.
F
乙
y
G
你
e
s
t
t
哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3
Ren et al.
Effective Approaches to Neural Query Language Identification
Considering that short text queries lack sufficient context, a conventional character-
feature based representation model has difficulty in obtaining effective classification
信息. Because there are abundant high-frequency characters that often appear in
various words or even multiple languages, the amount of information carried by each
character feature is not large enough to distinguish which language or even which word
这是. 所以, one can consider introducing higher-order features like word features
to disambiguate the meaning of character features. 此外, the Unicode encoding
block information of each character is also an effective method to increase the amount
of information, also known as script features.2 Thus, word and script features can make
the model better understand the contextual meaning of short text queries.
在本文中, we aim at alleviating the problems listed above and building a neural-
based Q-LID system. In order to enhance the discrimination of queries and the robust-
ness on handling fuzzy inputs (C1), we introduce multi-feature embedding, 其中
特点, word, as well as script serve as distinct embeddings and are integrated into
the input representations of our model. 此外, a multi-scale attention mechanism
(Beltagy, Peters, and Cohan 2020; 徐等. 2022) is applied to force the encoder to
extract and fuse different information. 最后, in response to the problem of unbalanced
training samples (C2), we propose a novel data augmentation method that generates
pseudo multilingual data by translating an example from a resource-rich language (例如,
英语) to low-resource ones using machine translation.
In order to evaluate the effectiveness of the proposed model, we collect a bench-
mark in 21 languages called QID-21; each language contains 1,000 manually labeled
queries extracted from a real-world search engine—AliExpress—which is an online
international retail service.3 Experimental results demonstrate that our Q-LID system
yields better accuracy over the strong neural-based text classification baselines and
several existing LID systems. 有趣的是, our model consistently yields improvement
on an existing short-text (out-of-domain) LID task, indicating its universal effective-
内斯. Qualitative analyses reveal that the new approach can exactly handle situations
of fuzzy inputs. 总结一下, the major contributions of our work are three-fold:
• We introduce multi-feature learning to improve a neural Q-LID model on
classifying ambiguous queries, which can also be effective in other NLP
tasks that handle short-texts.
• We propose a novel translation–based data augmentation approach to
balance the training samples between low- and rich-resource languages.
• We collect QID-21 and make it publicly available, which may contribute
to the subsequent researches in the communities of language
识别.
2. 相关工作
2.1 Query Language Identification
在过去的十年, most researchers have explored LID models for document or
sentence classification (Jauhiainen et al. 2019; Deshwal, Sangwan, and Kumar 2019;
2 https://en.wikipedia.org/wiki/Unicode_block.
3 https://www.aliexpress.com/.
889
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
C
哦
我
我
/
我
A
r
t
我
C
e
–
p
d
F
/
/
/
/
4
8
4
8
8
7
2
0
6
1
7
9
3
/
C
哦
我
我
_
A
_
0
0
4
5
1
p
d
.
F
乙
y
G
你
e
s
t
t
哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3
计算语言学
体积 48, 数字 4
齐, Ma, and Gu 2019), while few studies have paid attention to search queries. 典型-
卡莉, queries are short and noisy, including an abundance of spelling mistakes, 代码-
switching, and non-word tokens such as URLs, emoticons, and hashtags. Prior studies
have shown that out-of-the-box and state-of-the-art LID systems suffer significant drops
in accuracy when applied to queries (Lui and Baldwin 2012; Tambi, Kale, and King
2020). An interesting research direction is token-level LID for code-mixed texts (张
等人. 2018; Mager, Cetinoglu, and Kann 2019; Mandal and Singh 2018). 然而, fine-
grained LID has marginal assistance for the CLIR task, since the downstream modules
(例如, machine translation and information retrieval) depend on a unique language label
rather than the multiple identifications of all tokens in the query. 此外, 代币-
level LID may introduce more error information that propagates to downstream tasks.
Our work can be categorized into short-text sentence-level LID context. 在这个
社区, Duvenhage (2019) studies the low-resource task and presents a hierarchical
naive Bayesian and lexicon-based classifier. Godinez et al. (2020) investigate several
linguistic features and prove that prior knowledge is able to alleviate the problem of
insufficient contextual information in short-text LID. Tambi, Kale, and King (2020) 建造
a Q-LID model based on Gradient Boost Tree by collecting noisy and weakly labeled
training data. Both of these studies are based on traditional models (例如, Random Forest,
naive Bayesian, Support Vector Machines).
Considering the neural-based approaches, Vo and Khoury (2019) exploit convolu-
tional neural networks and prove their effectiveness on the short-text LID task. Nev-
ertheless, their model was designed for classifying short messages in Twitter, 哪个
has extensive in-domain training data and relatively longer sequences than queries.
Contrary to Vo and Khoury (2019), the Q-LID task has higher expectations of disam-
biguation and data quality. 为此, we investigate several effective modules such
as multi-feature embedding and multi-scale attention mechanism. A novel machine
translation–based data augmentation is also introduced to ease the deficiency of in-
domain training samples.
2.2 Feature Engineering
Feature engineering transforms the feature space of a dataset to improve modeling
表现. In the NLP task, Deng et al. (2019) investigate the text feature represen-
tation method based on the bag-of-words model, and propose four methods of filter,
wrapper, embedded, and hybrid for feature selection. Garla and Brandt (2012) utilize
the domain knowledge for feature extraction and ranking when performing clinical text
分类. Textual features such as bag-of-words, hotspots, and semantic kernel are
explored.
As a classic text classification task, introducing feature engineering is the normal
process for LID. As a traditional model, Wu et al. (2019) use both character and word
n-gram features. The character n-grams varied between 1 到 9 and the word n-grams
varied from 1 到 3. The features were weighted with either tf-idf or BM25 weighting
计划. As a neural-based model, 张等人. (2018) propose CMX using character
n-gram, script, and lexicon features. Among them, the lexicon feature group is backed
by a large lexicon table, which holds a language distribution for each token observed
in the monolingual training data. 相比之下, the multi-feature embedding proposed
in this work includes character, script, word, and positional features to preserve the
sequential nature of the text, which is more conducive to the acquisition of contextual
信息.
890
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
C
哦
我
我
/
我
A
r
t
我
C
e
–
p
d
F
/
/
/
/
4
8
4
8
8
7
2
0
6
1
7
9
3
/
C
哦
我
我
_
A
_
0
0
4
5
1
p
d
.
F
乙
y
G
你
e
s
t
t
哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3
Ren et al.
Effective Approaches to Neural Query Language Identification
2.3 Data Augmentation
Bayer, Kaufhold, and Reuter (2021) provide an overview of data augmentation ap-
proaches suited for the textual domain. Among them, translation is generalized as a
document-level data augmentation method in data space. 然而, it generally refers
to the round-trip translation strategy (Wan et al. 2020; Yao et al. 2020). Utilizing trans-
lating a document into another language and afterward translating back into the source
语言, the round-trip translation strategy leads to various possibilities in the choice
of terms or sentence structure. 此外, the one-way translation can also be regarded
as a generative method of data augmentation in multilingual scenarios. Amjad, Sidorov,
and Zhila (2020) use machine translation to migrate the large-scale supervised corpus
existing in English to the low-resource language of Urdu, accordingly solving the prob-
lem of lack of the annotation fake news detection data in the Urdu language. Bornea et
阿尔. (2021) utilize translation as data augmentation to improve cross-lingual transfer by
bringing the multilingual embeddings closer in the semantic space.
Regarding the task of LID, Ceolin (2021) carries on some data augmentation exper-
iments with Random Swap (swap the position of two words in the sentence), Random
Delection (remove one word in the sentence), Random Insertion (insert one extra word
in the sentence), and Random Replacement (involves the replacement of a word with
a synonym). This is an effective data enhancement strategy for LID (Wei and Zou
2019). 然而, it still cannot eliminate the problems of data imbalance and domain
inadaptation. We propose the novel machine translation based data augmentation that
can undertake this role well.
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
C
哦
我
我
/
我
A
r
t
我
C
e
–
p
d
F
/
3. Model Architecture
/
/
/
4
8
4
8
8
7
2
0
6
1
7
9
3
/
C
哦
Our model is built upon Transformer (Vaswani et al. 2017) 建筑学, which is highly
parallelized and has shown excellent capability on natural language processing (Devlin
等人. 2019). Contrary to a common setting that exploits multiple layers, we merely
apply one layer with several enhancements for the sake of computational efficiency.
The model architecture is illustrated in Figure 1. Given an input query X, we first adopt
a multi-feature embedding module to integrate multiple embeddings to its represen-
tation vector X. 然后, an encoding layer that consists of two sub-layers is exploited to
capture the query features. The first sub-layer is a multi-head multi-scale attention layer
(Beltagy, Peters, and Cohan 2020), notated as MHMSA(·), and the second one is a
positionwise fully connected feed-forward network with ReLU activation, denoted as
FFN(·). A residual connection (He et al. 2016) is used around each of two sub-layers,
followed by layer normalization (Ba, Kiros, 和辛顿 2016). 正式地, the output of
the first sub-layer H and the second sub-layer Z are calculated as:
我
我
_
A
_
0
0
4
5
1
p
d
.
F
乙
y
G
你
e
s
t
t
哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3
H = LN(X + MHMSA(X))
Z = LN(H + FFN(H))
where LN(·) means the layer normalization.
(1)
(2)
891
计算语言学
体积 48, 数字 4
数字 1
Illustration of the architecture of our model with the input of “Cl´e USB.” Our model is built upon
a 1-layer TRANSFORMER architecture. In order to tackle the problem of insufficient context in a
query, we exploit multi-feature embedding to characterize the input query with character, word,
script,4 and positional features. We further adopt a multi-head multi-scale attention module to
capture different information using distinct window masks. 最后, the probability distribution
is predicted through the output layer.
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
C
哦
我
我
/
我
A
r
t
我
C
e
–
p
d
F
/
/
/
/
4
8
4
8
8
7
2
0
6
1
7
9
3
/
C
哦
The token representations Z are averaged to obtain the query embedding Z.5 The
final prediction can be the label Y with highest probability that is calculated as:
Y = argmax(softmax(WoZ + bo))
(3)
where Wo ∈ RD×L, bo ∈ RL are trainable parameters with D being the hidden size and
L being the number of languages. softmax(·) represents a nonlinear function which is
used to normalize the probability distribution of labels.
3.1 Multi-Feature Embedding
Most existing LID approaches exploit character embedding (Jauhiainen et al. 2019) 到
avoid the problem of out-of-vocabulary (OOV). 然而, the frequency of different
characters is extremely discrepant. 例如, Chinese characters are sparse and
rarely appear at the model learning time, making the model underfit on Chinese.
另一方面, high-frequency characters, such as a-to-z, are shared by many
languages and difficult to be distinguished. The problem becomes serious when a
我
我
_
A
_
0
0
4
5
1
p
d
.
F
乙
y
G
你
e
s
t
t
哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3
4 在这种情况下, [BL] 和 [LS] are assigned as the “Basic Latin” and “Latin Supplement” Unicode block.
5 Note that we merely reduce vectors with respect to all valid characters, 那是, the vector of the symbols
that represent begin, 结尾, padding, as well as segmentation are masked in mean operation.
892
LinearLinearLinearLinearLinearLinearMatmulMatmulMatmulMatmulMatmulMatmulMatmulMatmulClé[乙]USB[乙]𝐸!”𝐸#”𝐸$”𝐸%”𝐸&”𝐸’”𝐸(“𝐸)”𝐸*+𝐸,+𝐸é+𝐸[/]+𝐸1+𝐸2+𝐸/+𝐸[3]+𝐸*,é4𝐸*,é4𝐸*,é4𝐸[/]4𝐸12/4𝐸12/4𝐸12/4𝐸[3]4𝐸[/5]6𝐸[/5]6𝐸[52]6𝐸[/]6𝐸[/5]6𝐸[/5]6𝐸[/5]6𝐸[3]6Add&NormFeedForwardAdd&NormLinearSoftMaxOutputProbabilitiesMulti-Feature EmbeddingMulti-HeadMulti-ScaleAttentionMatMulScaleSoftMaxMatMulhead1head2…Multi-ScaleMaskQKVLinearLinearLinearConcatLinearhead1head2…QhKhVhQhKhVhCléUSB[乙][乙]CléUSB[乙][乙]CléUSB[乙][乙]CléUSB[乙][乙]CharEmbeddingWordEmbeddingScript EmbeddingPositional EmbeddingInputIds…
Ren et al.
Effective Approaches to Neural Query Language Identification
query is composed of few characters and lacks contextual information. 因此,
it is necessary to incorporate more features to help the model identify the languages
of queries. We propose the multi-feature embedding, which leverages character, script,
and word features to distinguish queries.
•
•
Character Embedding (Ec): The character-level features are selected as
the basic embedding. We assign [乙] as the blank token, [乙] as the special
token indicating the end of the sentence, 和 [U] as the unknown
characters (OOV). Following the common setting, we prune the
extremely rare characters to reduce the character vocabulary size.6
Script Embedding (Es): Because several low-frequency characters rarely
appear in the training set, they may fail to be well learned during
训练. 所以, we extend our model with the script feature, 哪个
can strongly bind certain characters to a specific language. 例如,
Hiragana and Hangul are only used in Japanese and Korean, 分别.
Unicode block provides explicit guidance, each of which is generally
meant to supply glyphs used by one or more specific languages.7 To this
结尾, we serve a unicode block serial number as the script feature of a
特点.
• Word Embedding (Ew): A natural concern for exploiting word
embeddings is the large vocabulary size. 此外, the distribution of
words is unbalanced across languages. 例如, there are 100K
words commonly used in Chinese, whereas only 20K frequent terms in
英语. In response to these problems, we propose two strategies to
reduce the vocabulary size: (1) pruning words in a language whose script
feature is highly recognizable, such as Thai; 和 (2) splitting words into
sub-word units using word piece model following Wu et al. (2016) 和
Devlin et al. (2019). 这样, queries with language-shared characters
can be discriminated with the complement of word features.
Positional Embedding (Ep): For the sequential information modeling, 我们
further add the sinusoidal positional encoding to the input embedding
following Vaswani et al. (2017).
•
全面的, we follow the common settings in Transformer to sum up input embed-
丁斯. Both the embeddings have the same dimensionalities and are co-optimized with
该模型. The final multi-feature embedding X of the input query is the positionwise
sum of the above embeddings:
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
C
哦
我
我
/
我
A
r
t
我
C
e
–
p
d
F
/
/
/
/
4
8
4
8
8
7
2
0
6
1
7
9
3
/
C
哦
我
我
_
A
_
0
0
4
5
1
p
d
.
F
乙
y
G
你
e
s
t
t
哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3
X = Ec
X + Ew
X + Es
X + Ep
X
(4)
3.2 Multi-Head Multi-Scale Attention
Another problem is how to exploit multi-feature embeddings in the final classifica-
tion task. As the core component in TRANSFORMER (Vaswani et al. 2017), multi-head
6 Following the common setting, we prune those characters whose frequencies in training set less than 10.
7 https://en.wikipedia.org/wiki/Unicode_block.
893
计算语言学
体积 48, 数字 4
self-attention performs multiple self-attention modules on input representations, 因此
jointly attends to information from different representation subspaces at different posi-
系统蒸发散. 然而, several studies pointed out that the overall view on the input sentence
may lead self-attention to overlook fine-grained information (Yang et al. 2019; Guo,
张, 和刘 2019). 李等人. (2018) and Strubell et al. (2018) suggest that guiding
different heads to learn distinct features can generate more informative representation.
为此, we adopt multi-head multi-scale attention (Yang et al. 2019; Beltagy, Peters,
and Cohan 2020; 徐等. 2019), which assign different attention window sizes to heads,
making them inspect the input query from different perspectives. In contrast with prior
studies that use the strategy to extract distinct granularity information in documents
or long sentences, we are the first to apply it to the short-text scenario. We expect that
these task-specific heads can jointly generate more informative representation for the
corresponding query. Specifically, our model first transforms input layer X into h-th
subspace with different linear projections:
Qh, Kh, Vh = XWh
问, XWh
K, XWh
V
(5)
问, Wh
V} ∈ RD×Dh denote learnable parameter matrices associated with
在哪里 {Wh
the h-th head, Dh represent the dimensionality of the h-th head subspace. N attention
functions are applied to generate the output states in parallel:
K, Wh
headh = softmax(
QhKhT
√
Dh
+ MSM(wh))Vh
(6)
√
Dh is the scaling factor. We achieve the multi-scale mechanism by locating a
这里,
mask matrix MSM(wh) for the h-th attention model, thus forcing it to extract features in
a specific window wh. The final output of a multi-head multi-scale attention layer is the
concatenation of each head:
MHMSA(X) = [head1, ··· , headN]
(7)
这样, 特点, word, 短语, as well as sentence are distinctly extracted by
different heads and finally fused.
Multi-Scale Mask. The h-th mask matrix can be formally expressed as:
MSM(wh) = Mh ∈ RI×I
(8)
where I denotes the sequence length, and the item Mh
我,j in the matrix is allocated with 0
or −∞ to measure whether the i-th element is able to attend to the j-th element, 那是:
(西德:40)
Mh
我,j =
0,
−∞,
(i − wh) ≤ j ≤ (我 + wh)
(i − wh) < j or j > (我 + wh)
(9)
The example of multi-scale masks is shown in Figure 1. 在本文中, we explore the
setting of the window size of heads wh using Grid Search (Bergstra et al. 2011). Based on
empirical results, we finally set wh of four heads to 0, 1, 2, 3, 分别. The window
894
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
C
哦
我
我
/
我
A
r
t
我
C
e
–
p
d
F
/
/
/
/
4
8
4
8
8
7
2
0
6
1
7
9
3
/
C
哦
我
我
_
A
_
0
0
4
5
1
p
d
.
F
乙
y
G
你
e
s
t
t
哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3
Ren et al.
Effective Approaches to Neural Query Language Identification
数字 2
Illustration of synthetic data generation using machine translation systems. Given a query that is
in a resource-rich language, we translate it to other languages to construct the pseudo dataset.
sizes of the remaining 4 heads are set to the length of sequence, thus capturing global
信息.
4. Data Augmentation
A well-performed neural-based NLP model depends on extensive language resources
(Devlin et al. 2019). The existing well-labeled LID training sets usually consist of long
sentences or documents, while few are in short-text style. 此外, 的数量
existing LID training samples are unbalanced over languages. 例如, 有
extensive English queries or keywords collected from the Web, whereas it is difficult to
find examples in relatively low-resource languages such as Indonesian or Hindi. Both of
these cause training bias: The model overfits on long texts and tends to predict the label
to the resource-rich ones (Glorot, Bordes, and Bengio 2011). 因此, an in-domain
and balanced dataset is essential to Q-LID task. We approach this problem by proposing
a machine translation–based method to construct synthetic data.
Machine Translation–Based Data Augmentation. The starting point of our approach
is an observation in language resources. Considering resource-rich languages such as
英语, it is easy to obtain large-scale monolingual Q-LID training samples. 在里面
与此同时, there exists relatively more English-to-Multilingual parallel corpus to build
well-performed machine translation systems. Naturally, leveraging a machine transla-
tion system to generate large-scale pseudo data would be an appealing alternative to
alleviate the lack of Q-LID training samples. Concretely, we first build multiple machine
translation systems that serve English as the source side. 然后, a large amount of
English samples in the search domain are translated to the target languages, 如图所示
数字 2. 通过这种方法, we can obtain extensive and balanced in-domain synthetic
data for model training.
Note that noises caused by a machine translation model may harm the quality of
pseudo data. We propose to filter translations that are the same as their source texts (IE。,
untranslated examples). Since translation errors associated with semantics marginally
affect the LID task, we keep these samples in the dataset for the model robustness.
5. 实验
We examine the effectiveness of the proposed method on a collected Q-LID dataset and
an open-source LID dataset.
895
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
C
哦
我
我
/
我
A
r
t
我
C
e
–
p
d
F
/
/
/
/
4
8
4
8
8
7
2
0
6
1
7
9
3
/
C
哦
我
我
_
A
_
0
0
4
5
1
p
d
.
F
乙
y
G
你
e
s
t
t
哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3
en→idMTen→ hiMTN95MaskHigh-qualityEnglishdataLow-resourcepseudodataMaskerN95N95MaskN95MaskN95 नकाब
计算语言学
体积 48, 数字 4
5.1 数据集
We construct our multilingual data on 21 语言, 包括: 英语 (在), 志-
nese (zh), 俄语 (茹), Portuguese (点), 西班牙语 (英语), 法语 (fr), 德语 (的), Italian
(它), Dutch (nl), Japanese (ja), Korean (ko), Arabic (阿尔), Thai (th), Hindi (你好), Hebrew
(他), Vietnamese (六), Turkish (tr), Polish (pl), Indonesian (id), Malay (多发性硬化症), 和
Ukrainian (uk).
(1) Training Set
We extract a large amount of monolingual data through the collection and crawl-
ing of open data on the Internet, and obtained publicly available parallel corpus for
the training of machine translation models. Regarding low-resource languages, 我们
constructed synthetic pseudo data with English search data and machine translation
型号. 最后, we build a training set on 21 语言, each of which consists of 4
百万 (中号) 样品. Details of our datasets are as follow:
• Multilingual Out-of-Domain Data are selected from the released
datasets: W2C corpus (Majlis and Zabokrtsk ´y 2012), Common Crawl
语料库 (Sch¨afer 2016), and Tatoeba (Tiedemann and Thottingal 2020).
•
•
Parallel Corpus are extracted from an open-source data Tatoeba
(Tiedemann and Thottingal 2020).
Synthetic In-Domain Data are composed of in-domain queries or
keywords, which are generated by a data augmentation method
节中描述 4. We build English-to-Multilingual machine
translation models following an open source project Tatoeba8 (Tiedemann
and Thottingal 2020). These models are trained using the parallel corpus
introduced above. The in-domain high-quality English queries are
collected from the search logs of a search engine—the AliExpress search
服务.
全面的, for multilingual out-of-domain data, there are 2M samples screened for
each language. Considering synthetic in-domain data, we finally collect 2M pseudo
data for each language. 最终, the number of training samples for each language is
about 4M, half of which are out-of-domain, the remainder are in-domain.
(2) Evaluation Datasets
We collect a QID-21 set that contains multilingual queries and language labels
manually checked by native experts of each corresponding language. All the queries
are extracted from the in-domian training set with careful data desensitization. In order
to investigate the universal effectiveness of the proposed methods, we further extract
a short-text set KB-21 from Kocmi and Bojar (2017), using a subset of 21 语言.
Considering the QID-21 set, 有 21,440 句子, the average word count in each
sample is 2.56, and the average number with respect to character is 15.53. 关于
KB-21 set, 有 2,100 句子, and the average number of words and characters
in each sample is 4.47 和 34.90, 分别.
8 https://github.com/Helsinki-NLP/Tatoeba-Challenge.
896
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
C
哦
我
我
/
我
A
r
t
我
C
e
–
p
d
F
/
/
/
/
4
8
4
8
8
7
2
0
6
1
7
9
3
/
C
哦
我
我
_
A
_
0
0
4
5
1
p
d
.
F
乙
y
G
你
e
s
t
t
哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3
Ren et al.
Effective Approaches to Neural Query Language Identification
桌子 1
Statistics of our training and test set. It can be seen that out-of-domain data is generally long
句子, which is a challenge for short-text LID in query scenarios. The synthetic in-domain
data acquired through data enhancement can fill the domain gap of the data set.
数据集
n Out-of-Domain
In-Domain
我
A
r
时间
t QID-21
s
e
时间
KB-21
句子
42中号
42中号
21,440
2,100
Tokens per sentence Characters per sentence
13.05
2.92
2.56
4.47
72.27
18.32
15.53
34.90
The data statistics of the training set and test set are shown in Table 1.
(3) Data Release
We release all the evaluation datasets, including the KB-21 set and the QID-
21 放. For the training set, we release multilingual out-of-domain data and parallel
corpus as well. Particularly, the QID-21 dataset with 21,440 queries (在 21 语言)
are desensitized and reviewed by several linguistic experts, which is the first benchmark
for query language identification and may contribute to the subsequent researches in
the communities of language identification.
尽管如此, the synthetic in-domain data cannot be released, since the source
English queries are collected from the search logs of the AliExpress search service, 因此
containing sensitive user and business information. And it is unavailable to manually
filter and check all the samples.9
5.2 Experimental Setting
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
C
哦
我
我
/
我
A
r
t
我
C
e
–
p
d
F
/
/
/
/
4
8
4
8
8
7
2
0
6
1
7
9
3
/
C
哦
We follow the base model setting as in Vaswani et al. (2017), except that the number
of layers is set to 1. 因此, the hidden size is 512, the filter size is 2,048, the dropout
rate is 0.1, and the head number is 8. Considering the proposed multi-head multi-scale
注意力 (MHMSA), we set window sizes (wh) 的 4 heads to 0, 1, 2, 3, 分别. 这
window sizes of the remaining 4 heads are set to the sequence length, thus capturing
global information. The character, word, and script vocabulary size are 13.5K, 58.4K,
和 107, 分别. For training, we used the Adam optimizer with the same learning
rate schedule strategy as Vaswani et al. (2017) and 8k warmup steps. Each batch consists
的 1,024 examples and the dropout rate is set to a constant of 0.1. Models are trained on
a single Tesla P100 GPU.
In this study, a 1-layer TRANSFORMER model serves as the baseline. We reimple-
ment several existing neural-based LID approaches and widely used text classification
型号, and compared with popular LID systems, as listed in Table 2.
Text Classification Models. For FASTTEXT, we exploit 1-3 gram to extract characters
and words. For TEXTCNN, we apply six filters with the size of 3, 3, 4, 4, 5, 5 and a hidden
大小 512. For computational efficiency, 1-layer networks are used as default if no con-
fusion is possible. For TRANSFORMER, we used the higher performance configuration
of 6-layer and 12-layer networks. 而且, we fine-tuned the M-BERT and XLM-R
我
我
_
A
_
0
0
4
5
1
p
d
.
F
乙
y
G
你
e
s
t
t
哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3
9 For the purpose of reproducing our results, we release our final models (trained with augmented data) 在
https://github.com/xzhren/Q-LID.
897
计算语言学
体积 48, 数字 4
桌子 2
Classification accuracy (ACC) on test sets. We report the average score of 5 独立的
experimental runs for each neural-based model. + indicates that the training set is enhanced
with the proposed data augmentation. “Speed” denotes the number of characters processed per
second with the batch size being 1. As seen, our final model outperforms Transformer baseline
超过 11 ACC (95.35 与. 84.26) on QID task. “†” and “††” indicate the improvement over
TRANSFORMER is statistically significant (p < 0.05 and p < 0.01, respectively), estimated by
bootstrap sampling (Koehn 2004).
Model
QID-21
QID-21+
KB-21
KB-21+
Parameter Speed
Langid.py (Lui and Baldwin 2012)
LanideNN (Kocmi and Bojar 2017)
Bing Online
Google Online
Existing LID Systems
73.76
67.77
83.87
89.08
91.33
92.71
93.95
96.19
LOGISTIC REGRESSION (LR) (Bestgen 2021)
NAIVE BAYES (NB) (Jauhiainen, Jauhiainen, and
Lind´en 2021)
ATTENTIONCNN (Vo and Khoury 2019)
Reimplemented LID Models
83.01
84.23
72.62
72.51
82.16
91.41
89.88
89.91
91.33
Reimplemented Text Classification Models
FASTTEXT (Joulin et al. 2017)
TEXTCNN (Kim 2014)
TRANSFORMER (6 Layer) (Vaswani et al. 2017)
TRANSFORMER (12 Layer) (Vaswani et al. 2017)
M-BERT (12 Layer) (Devlin et al. 2019)
XLM-R (12 Layer) (Conneau et al. 2020)
70.95
81.57
85.74
85.93
86.37
86.51
82.52
91.21
92.80
92.82
92.53
92.97
88.69
91.24
93.14
93.38
93.95
94.04
Our Q-LID Systems
90.92
91.42
93.38
90.46
93.19
94.67
94.71
95.95
95.98
TRANSFORMER
Our Model
84.26
89.77††
91.40
95.35††
92.81
94.29†
93.48
96.86††
16.8M
46.8M
0.8M
3.3M
–
–
–
–
18.4k
0.03k
–
–
41.5k
23.4k
15.2M
11.2k
24.3M
15.0M
32.5M
51.3M
177.9M
279.2M
65.8k
11.8k
2.7k
1.6k
1.5k
1.1k
12.3k
11.6k
models based on large-scale corpus pre-training. The settings of these big models are
the same as the paper with 12 layers, 768 hidden states, 3,072 filter states, and 12 heads.
Popular LID Approaches. We reproduced two state-of-the-art models in VarDial-
21 LID task (Chakravarthi et al. 2021) based on naive Bayes (Jauhiainen, Jauhiainen,
and Lind´en 2021) and Logistic Regression (Bestgen 2021), respectively. In addition,
ATTENTIONCNN (Vo and Khoury 2019), devoted to the short-text LID task, is reim-
plemented. Other configurations of our reimplementations are the same as common
settings described in corresponding literature or the released source codes.
Existing LID Systems. Moreover, we also examine 4 popular LID systems on our
LID tasks, including Langid.py10 (Lui and Baldwin 2012), LanideNN11 (Kocmi and Bojar
2017), Google LID,12 as well as Bing LID.13
5.3 Experimental Results
(1) Main Results
As shown in Table 2, our model outperforms existing LID systems and related clas-
sification models. Specifically, applying data augmentation can consistently improve
the accuracy 6%–13% across model architectures. It is interesting to see that augmented
data helps more on QID-21 than KB-21. The main reason stems from the fact that
10 https://github.com/saffsd/langid.py.
11 https://github.com/kocmitom/LanideNN.
12 https://translate.google.com.
13 https://www.bing.com/translator.
898
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
-
p
d
f
/
/
/
/
4
8
4
8
8
7
2
0
6
1
7
9
3
/
c
o
l
i
_
a
_
0
0
4
5
1
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Ren et al.
Effective Approaches to Neural Query Language Identification
Table 3
Ablation study of the proposed components. They improve the identification accuracy, and are
complementary to each other.
Model
TRANSFORMER
w/ Word Feature
w/ Script Feature
w/ MHMSA
Our Model
QID-21 KB-21
93.48
94.19
94.00
93.62
96.86
91.40
93.50
92.75
92.08
95.35
Param.
Speed
16.8M 12.3k
46.7M 11.7k
16.9M 11.8k
16.8M 12.2k
46.8M 11.6k
the augmented samples are translated from the search queries, which have the same
domain as QID-21 set but are inconsistent with those short texts in KB-21.
Considering the model architecture, FASTTEXT yields the fastest processing speed
but the lowest classification accuracy. Compared to CNN-based approaches (TEXTCNN,
ATTENTIONCNN), TRANSFORMER possesses comparable speed but better quality
on LID, reconfirming the strength of the baseline system on language modeling.
The shallow models (LOGISTIC REGRESSION, NAIVE BAYES) achieve faster inference
speed, but yield poor accuracy. It is worth noting that these approaches are the state-
of-the-art in LID task VarDial-21.14 In the VarDial task, the neural-based approaches
underperform shallow ones, since the main challenge of VarDial lies in low resource
and dialect-style texts. On the contrary, texts in our task are short and noisy. By in-
corporating multi-feature embedding and multi-scale attention, our model surpasses
the strong baselines. It is encouraging to see that the proposed approach even gains
higher accuracy and is 12 times faster than several complicated networks, for exam-
ple,TRANSFORMER (12 Layer) and M-BERT (12 Layer). In particular, the latter is initial-
ized by a language model that was pre-trained with billions of multilingual samples.15
Finally, data augmentation and enhancements on model architecture are complemen-
tary to each other, and their combination increases by over 11% accuracy on query
LID task.
(2) Ablation Study on Model Enhancements
We conduct experiments to evaluate the effectiveness of the proposed multi-feature
embedding and MHMSA. As concluded in Table 3, word feature, script feature, as
well as MHMSA progressively improve the model performance. Also, the proposed
model shows superiorities on both in-domain and out-of-domian LID tasks, verifying
its universal effectiveness.
(3) Ablation Study on Data Augmentation
A question is whether the improvements of data augmentation derive from in-
domain samples or the larger data scale. To answer this question, we conduct an
experiment where we complement training data using the same number of training
examples from the out-of-domain dataset instead of pseudo ones. Results listed in
Table 4 demonstrate that the additional training examples marginally affect the qual-
ity of Q-LID. The synthetic data provides shorter and more domain-specific training
14 https://sites.google.com/view/vardial2021/.
15 Because M-BERT does not have character embeddings, we only use word features in this experiment.
899
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
-
p
d
f
/
/
/
/
4
8
4
8
8
7
2
0
6
1
7
9
3
/
c
o
l
i
_
a
_
0
0
4
5
1
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Computational Linguistics
Volume 48, Number 4
Table 4
Ablation study on data augmentation. Neither the additional out-of-domain samples nor the
large-scale parallel corpus used for machine translation training directly contribute to LID.
Training Set
QID-21
KB-21
Out-of-Domain
w/ Synthetic In-Domain
w/ Parallel
w/ Out-of-Domain (Addition)
w/ Synthetic In-Domain (20%)
w/ Synthetic In-Domain (50%)
w/ Synthetic In-Domain (80%)
89.77
95.35
90.93
90.89
92.02
94.89
95.30
94.29
96.86
94.38
94.52
94.91
96.12
96.79
samples than real data, which contributes to the short-text LID. Furthermore, our
experiments also show that there is no further improvement via directly training our
Q-LID model with those parallel data used for teaching machine translation systems.
In addition, we explore the influence of different numbers of synthetic in-domain
data. We carried on experiments with 20%, 50%, and 80% synthetic data, which are
shown in Table 4. It can be observed that when the synthetic data reaches 50%, the
improvement is the largest, and when it reaches 80%, there are still some slight im-
provements. This further demonstrates the effectiveness of our augmented data.
6. Analysis
6.1 Quantitative Analysis
(1) Impact of Multi-Feature Embedding
We further investigate the impact of multi-features. As shown in Table 5, the dis-
tribution of characters in the vanilla model is compact. For example, the top 100 most
frequent characters cover 81.93% of occurrences over the training set. The proposed
multi-feature embedding significantly alleviates this problem. Figure 3 gives the dis-
tribution of vocabulary from the perspective of languages. Compared with other lan-
guages, Chinese (zh), Japanese (ja), as well as Korean (ko) have the most yet relatively
sparse characters appearing in the training corpus. The proposed method leverages
different features, making the count of input multi-feature embeddings balance to some
extent. This is beneficial to Q-LID since the model is trained in a more stable fashion.
(2) Impact of Multi-Head Multi-Scale Attention
We conduct an experiment to explore the effectiveness of the MHMSA mechanism.
As shown in Figure 4, our method gains fewer identification errors on short sequences,
Table 5
The ratio of top frequent input embeddings to the total occurrences of input embeddings in
training data. Obviously, multiple features significantly alleviate the problem of sparse data.
Proportion (%)
TOP 100
TOP 1K TOP 2K
Character Only
w/ Multi-Feature
81.93
53.54
98.60
75.30
99.64
79.99
900
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
-
p
d
f
/
/
/
/
4
8
4
8
8
7
2
0
6
1
7
9
3
/
c
o
l
i
_
a
_
0
0
4
5
1
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Ren et al.
Effective Approaches to Neural Query Language Identification
Figure 3
The frequency distribution of input embeddings in 21 languages (different colors). With the help
of word and script features, our model reconstructs the representation distribution of different
languages to a more uniform one.
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
-
p
d
f
/
/
/
/
4
8
4
8
8
7
2
0
6
1
7
9
3
/
c
o
l
i
_
a
_
0
0
4
5
1
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Figure 4
Performance improvement on different input sequence length (character-level). Our method
consistently outperforms baseline over the length buckets.
verifying our hypothesis that a local window in the attention head is beneficial to the
performance of Q-LID.
(3) Impact of Data Augmentation
In-domain training data have crucial impacts on Q-LID. We draw Figure 5 for
illustrating how data augmentation contributes to Q-LID. Under our scenario, several
similar languages fail to be distinguished when the classifier is trained using out-of-
domain and unbalanced samples. For example, Malay and Indonesian are similar and
the latter lacks a language resource, resulting in a high error rate on their identification.
Additionally, German, English, and Dutch belong to the Germanic branch of the Indo-
European language family and share some vocabularies that increase the difficulty of
Q-LID. With the data augmentation, our model performs with significant improvements
on these languages. This indicates the effectiveness of the proposed method.
6.2 Qualitative Analysis
Table 6 shows several identification results of baseline and our model. We selected
several representative cases for analysis.
901
w/o Multi-Feature (top) vs. w/ Multi-Feature (bottom)0102030405060708090100zhjakohivimsthpltritptdeenidnlfreshearukruTest Error Rate (%)0246810Data Length Buckets(0, 5)[5, 10)[10, 15)[15, 20)[20, 25)[25, 30)[30, 35)[35, 40)[40, 45)[50, ~)w/o MHMSAw/ MHMSA
Computational Linguistics
Volume 48, Number 4
Figure 5
The prediction error rate on different languages before and after using data augmentation.
Table 6
Examples of results predicted by baseline and our model. Our model can exactly handle the
problems of polysemy, code-switching, and misspelling.
Query
masque sport
xiaomi 8 чеxoл
cosmeticos
Meaning
sport mask
xiaomi 8 case
cosmetics
Baseline Ours
en
de
en
fr
ru
pt
Label
fr
ru
pt
•
•
•
In the first case, “masque” is an English and French homograph, while
“sport” is a common word in English and French. When these two words
are combined together, it should be a French phrase for “sport mask.”
Considering the second case, “xiaomi 8” means a mobile phone model,
followed by a Russian word for “case.” Baseline ascertains such kind of
code-switching case as German (de).
For the third case, “cosmeticos” presents a misspelled Portuguese word
“cosm´eticos.” Baseline classifies this case to English.
All of these error identifications eventually lead to irrelevant recalls to user inten-
tion. On the contrary, our model can exactly handle these problems.
7. Conclusion
In this paper, we investigate and propose several effective approaches to improve neural
Q-LID from both model architecture and data augmentation perspectives. Experimental
results show that the proposed approaches not only make the Q-LID system surpass
strong baselines over 11 accuracy, but also benefit the out-of-domain LID task. Besides,
we collect an LID test set and make it publicly available, which may contribute to the
subsequent researches in the communities of LID and CLIR.
902
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
-
p
d
f
/
/
/
/
4
8
4
8
8
7
2
0
6
1
7
9
3
/
c
o
l
i
_
a
_
0
0
4
5
1
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Test Error Rate (%)0510152025303540Languageardeenesfrhehiiditjakomsnlplptruthtrukvizh4.71.60.317.33.80.08.27.63.36.910.70.10.33.013.40.80.00.75.38.69.20.1w/o Data Augmentationw/ Data Augmentation
Ren et al.
Effective Approaches to Neural Query Language Identification
Acknowledgments
The authors thank the reviewers for their
helpful comments in improving the quality
of this work. This work is supported by
National Key R&D Program of China
(2018YFB1403202).
References
Amjad, Maaz, Grigori Sidorov, and Alisa
Zhila. 2020. Data augmentation using
machine translation for fake news
detection in the Urdu language. In
Proceedings of the 12th Language Resources
and Evaluation Conference, LREC 2020,
pages 2537–2542.
Ba, Lei Jimmy, Jamie Ryan Kiros, and
Geoffrey E. Hinton. 2016. Layer
normalization. CoRR, abs/1607.06450.
Bayer, Markus, Marc-Andr´e Kaufhold, and
Christian Reuter. 2021. A survey on data
augmentation for text classification. CoRR,
abs/2107.03158.
Beltagy, Iz, Matthew E. Peters, and Arman
Cohan. 2020. Longformer: The long-document
transformer. arXiv preprint
arXiv:2004.05150.
Bergstra, James, R´emi Bardenet, Yoshua
Bengio, and Bal´azs K´egl. 2011. Algorithms
for hyper-parameter optimization. In
Advances in Neural Information Processing
Systems 24: 25th Annual Conference on
Neural Information Processing Systems 2011.
pages 2546–2554.
Bestgen, Yves. 2021. Optimizing a supervised
classifier for a difficult language
identification problem. In Proceedings of the
Eighth Workshop on NLP for Similar
Languages, Varieties and Dialects,
VarDial@EACL 2021, pages 96–101.
Bornea, Mihaela A., Lin Pan, Sara Rosenthal,
Radu Florian, and Avirup Sil. 2021.
Multilingual transfer learning for QA
using translation as data augmentation. In
Thirty-Fifth AAAI Conference on Artificial
Intelligence, AAAI 2021, Thirty-Third
Conference on Innovative Applications of
Artificial Intelligence, IAAI 2021, The
Eleventh Symposium on Educational
Advances in Artificial Intelligence, EAAI
2021, pages 12583–12591.
Bosca, Alessio and Luca Dini. 2010.
Language identification strategies for cross
language information retrieval. In CLEF
2010 LABs and Workshops, Notebook Papers,
volume 1176 of CEUR Workshop
Proceedings, CEUR-WS.org.
Ceolin, Andrea. 2021. Comparing the
performance of CNNs and shallow models
for language identification. In Proceedings
of the Eighth Workshop on NLP for Similar
Languages, Varieties and Dialects,
VarDial@EACL 2021, pages 102–112.
Chakravarthi, Bharathi Raja, Mihaela
Gaman, Radu Tudor Ionescu, Heidi
Jauhiainen, Tommi Jauhiainen, Krister
Lind´en, Nikola Ljubesic, Niko Partanen,
Ruba Priyadharshini, Christoph Purschke,
Rajagopal Eswari, Yves Scherrer, and
Marcos Zampieri. 2021. Findings of the
vardial evaluation campaign 2021. In
Proceedings of the Eighth Workshop on NLP
for Similar Languages, Varieties and Dialects,
VarDial@EACL 2021, pages 1–11.
Conneau, Alexis, Kartikay Khandelwal,
Naman Goyal, Vishrav Chaudhary,
Guillaume Wenzek, Francisco Guzm´an,
Edouard Grave, Myle Ott, Luke
Zettlemoyer, and Veselin Stoyanov. 2020.
Unsupervised cross-lingual representation
learning at scale. In Proceedings of the 58th
Annual Meeting of the Association for
Computational Linguistics, ACL 2020,
pages 8440–8451. https://doi.org/10
.18653/v1/2020.acl-main.747
Conneau, Alexis and Guillaume Lample.
2019. Cross-lingual language model
pretraining. In Advances in Neural
Information Processing Systems 32: Annual
Conference on Neural Information Processing
Systems 2019, NeurIPS 2019,
pages 7057–7067.
Dai, Zihang, Zhilin Yang, Yiming Yang, Jaime
G. Carbonell, Quoc Viet Le, and Ruslan
Salakhutdinov. 2019. Transformer-XL:
Attentive language models beyond a
fixed-length context. In Proceedings of
the 57th Conference of the Association for
Computational Linguistics, ACL 2019,
Volume 1: Long Papers, pages 2978–2988.
https://doi.org/10.18653/v1/P19
-1285
Deng, Xuelian, Yuqing Li, Jian Weng, and
Jilian Zhang. 2019. Feature selection for
text classification: A review. Multimedia
Tools and Applications, 78(3):3797–3816.
https://doi.org/10.1007/s11042-018
-6083-5
Deshwal, Deepti, Pardeep Sangwan, and
Divya Kumar. 2019. Feature extraction
methods in language identification: A
survey. Wireless Personal Communications,
107(4):2071–2103. https://doi.org/10
.1007/s11277-019-06373-3
Devlin, Jacob, Ming-Wei Chang, Kenton Lee,
and Kristina Toutanova. 2019. BERT:
Pre-training of deep bidirectional
transformers for language understanding.
In Proceedings of the 2019 Conference of the
903
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
-
p
d
f
/
/
/
/
4
8
4
8
8
7
2
0
6
1
7
9
3
/
c
o
l
i
_
a
_
0
0
4
5
1
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Computational Linguistics
Volume 48, Number 4
North American Chapter of the Association for
Computational Linguistics: Human Language
Technologies, NAACL-HLT 2019, Volume 1
(Long and Short Papers), pages 4171–4186.
https://doi.org/10.18653/v1/N19
-1423
Duvenhage, Bernardt. 2019. Short text
language identification for under
resourced languages. CoRR,
abs/1911.07555.
Garla, Vijay and Cynthia Brandt. 2012.
Ontology-guided feature engineering for
clinical text classification. Journal of
Biomedical Informatics, 45(5):992–998.
https://doi.org/10.1016/j.jbi.2012
.04.010, PubMed: 22580178
Glorot, Xavier, Antoine Bordes, and Yoshua
Bengio. 2011. Domain adaptation for
large-scale sentiment classification: A deep
learning approach. In Proceedings of the
28th International Conference on Machine
Learning, ICML 2011, pages 513–520.
Godinez, Erick Velazquez, Zolt´an Szl´avik,
Selene Baez Santamar´ıa, and Robert-Jan
Sips. 2020. Language identification for
short medical texts. In Proceedings of the
13th International Joint Conference on
Biomedical Engineering Systems and
Technologies (BIOSTEC 2020) - Volume 5:
HEALTHINF, pages 399–406. https://
doi.org/10.5220/0008950903990406
Guo, Maosheng, Yu Zhang, and Ting Liu.
2019. Gaussian transformer: A lightweight
approach for natural language inference.
In The Thirty-Third AAAI Conference on
Artificial Intelligence, AAAI 2019, The
Thirty-First Innovative Applications of
Artificial Intelligence Conference, IAAI 2019,
The Ninth AAAI Symposium on Educational
Advances in Artificial Intelligence, EAAI
2019, pages 6489–6496. https://doi.org
/10.1609/aaai.v33i01.33016489
He, Kaiming, Xiangyu Zhang, Shaoqing Ren,
and Jian Sun. 2016. Deep residual learning
for image recognition. In 2016 IEEE
Conference on Computer Vision and Pattern
Recognition, CVPR 2016, pages 770–778.
https://doi.org/10.1109/CVPR.2016.90
Jauhiainen, Tommi, Heidi Jauhiainen, and
Krister Lind´en. 2021. Naive Bayes-based
experiments in Romanian dialect
identification. In Proceedings of the Eighth
Workshop on NLP for Similar Languages,
Varieties and Dialects, VarDial@EACL 2021,
pages 76–83.
Jauhiainen, Tommi, Marco Lui, Marcos
Zampieri, Timothy Baldwin, and Krister
Lind´en. 2019. Automatic language
identification in texts: A survey. Journal of
904
Artificial Intelligence Research, 65:675–782.
https://doi.org/10.1613/jair.1
.11675
Joulin, Armand, Edouard Grave, Piotr
Bojanowski, and Tom´as Mikolov. 2017. Bag
of tricks for efficient text classification. In
Proceedings of the 15th Conference of the
European Chapter of the Association for
Computational Linguistics, EACL 2017,
Volume 2: Short Papers, pages 427–431.
https://doi.org/10.18653/v1/E17-2068
Kim, Yoon. 2014. Convolutional neural
networks for sentence classification. In
Proceedings of the 2014 Conference on
Empirical Methods in Natural Language
Processing, EMNLP 2014, A meeting of
SIGDAT, a Special Interest Group of the ACL,
pages 1746–1751. https://doi.org/10
.3115/v1/D14-1181
Kocmi, Tom and Ondrej Bojar. 2017.
LanideNN: Multilingual language
identification on character window. CoRR,
abs/1701.03338. https://doi.org/10
.18653/v1/E17-1087
Koehn, Philipp. 2004. Statistical significance
tests for machine translation evaluation.
In Proceedings of the 2004 Conference on
Empirical Methods in Natural Language
Processing, EMNLP 2004, A meeting of
SIGDAT, a Special Interest Group of the ACL,
held in conjunction with ACL 2004,
pages 388–395.
Li, Jian, Zhaopeng Tu, Baosong Yang,
Michael R. Lyu, and Tong Zhang. 2018.
Multi-head attention with disagreement
regularization. In Proceedings of the 2018
Conference on Empirical Methods in Natural
Language Processing, pages 2897–2903.
https://doi.org/10.18653/v1/D18-1317
Li, Juntao, Chang Liu, Jian Wang, Lidong
Bing, Hongsong Li, Xiaozhong Liu,
Dongyan Zhao, and Rui Yan. 2020.
Cross-lingual low-resource
set-to-description retrieval for global
e-commerce. In The Thirty-Fourth AAAI
Conference on Artificial Intelligence, AAAI
2020, The Thirty-Second Innovative
Applications of Artificial Intelligence
Conference, IAAI 2020, The Tenth AAAI
Symposium on Educational Advances in
Artificial Intelligence, EAAI 2020,
pages 8212–8219, AAAI Press. https://
doi.org/10.1609/aaai.v34i05.6335
Lui, Marco and Timothy Baldwin. 2012.
langid.py: An off-the-shelf language
identification tool. In The 50th Annual
Meeting of the Association for Computational
Linguistics, Proceedings of the System
Demonstrations, pages 25–30.
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
-
p
d
f
/
/
/
/
4
8
4
8
8
7
2
0
6
1
7
9
3
/
c
o
l
i
_
a
_
0
0
4
5
1
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Ren et al.
Effective Approaches to Neural Query Language Identification
Lui, Marco, Jey Han Lau, and Timothy
Baldwin. 2014. Automatic detection and
language identification of multilingual
documents. Transactions of the Association
for Computational Linguistics, 2:27–40.
https://doi.org/10.1162/tacl_a_00163
Mager, Manuel, ¨Ozlem C¸ etinoglu, and
Katharina Kann. 2019. Subword-level
language identification for intra-word
code-switching. In Proceedings of the 2019
Conference of the North American Chapter of
the Association for Computational Linguistics:
Human Language Technologies, NAACL-HLT
2019, Volume 1 (Long and Short Papers),
pages 2005–2011. https://doi.org/10
.18653/v1/N19-1201
Majlis, Martin and Zdenek Zabokrtsk ´y. 2012.
Language richness of the web. In
Proceedings of the Eighth International
Conference on Language Resources and
Evaluation, LREC 2012, pages 2927–2934.
Mandal, Soumil and Anil Kumar Singh. 2018.
Language identification in code-mixed
data using multichannel neural networks
and context capture. In Proceedings of the
4th Workshop on Noisy User-generated Text,
NUT@EMNLP 2018, pages 116–120.
https://doi.org/10.18653/v1/W18-6116
Qi, Zhaodi, Yong Ma, and Mingliang Gu.
2019. A study on low-resource language
identification. In 2019 Asia-Pacific Signal
and Information Processing Association
Annual Summit and Conference, APSIPA
ASC 2019, pages 1897–1902, IEEE.
https://doi.org/10.1109
/APSIPAASC47483.2019.9023075
Ren, Xingzhang, Baosong Yang, Dayiheng
Liu, Haibo Zhang, Xiaoyu Lv, Liang Yao,
and Jun Xie. 2022. Unsupervised
preference-aware language identification.
In Findings of the Association for
Computational Linguistics: ACL 2022,
pages 3847–3852. https://doi.org/10
.18653/v1/2022.findings-acl.303
Sabet, Ali, Prakhar Gupta, Jean-Baptiste
Cordonnier, Robert West, and Martin
Jaggi. 2019. Robust cross-lingual
embeddings from parallel sentences.
CoRR, abs/1912.12481.
Sch¨afer, Roland. 2016. CommonCOW:
Massively huge web corpora from
CommonCrawl data and a method to
distribute them freely under restrictive EU
copyright laws. In Proceedings of the Tenth
International Conference on Language
Resources and Evaluation LREC 2016.
Strubell, Emma, Patrick Verga, Daniel Andor,
David Weiss, and Andrew McCallum.
2018. Linguistically-informed
self-attention for semantic role labeling.
In Proceedings of the 2018 Conference on
Empirical Methods in Natural Language
Processing, pages 5027–5038. https://doi
.org/10.18653/v1/D18-1548
Sun, Shuo, Suzanna Sia, and Kevin Duh.
2020. Clireval: Evaluating machine
translation as a cross-lingual information
retrieval task. In Proceedings of the 58th
Annual Meeting of the Association for
Computational Linguistics: System
Demonstrations, ACL 2020, pages 134–141.
https://doi.org/10.18653/v1/2020
.acl-demos.18
Tambi, Ritiz, Ajinkya Kale, and Tracy
Holloway King. 2020. Search query
language identification using weak
labeling. In Proceedings of the 12th Language
Resources and Evaluation Conference, LREC
2020, pages 3520–3527.
Tiedemann, J ¨org and Santhosh Thottingal.
2020. OPUS-MT: Building open translation
services for the world. In Proceedings of the
22nd Annual Conference of the European
Association for Machine Translation, EAMT
2020, pages 479–480.
Vaswani, Ashish, Noam Shazeer, Niki
Parmar, Jakob Uszkoreit, Llion Jones,
Aidan N. Gomez, Lukasz Kaiser, and Illia
Polosukhin. 2017. Attention is all you
need. In Advances in Neural Information
Processing Systems 30: Annual Conference on
Neural Information Processing Systems 2017,
pages 5998–6008.
Vo, Duy-Tin and Richard Khoury. 2019.
Language identification on massive
datasets of short message using an
attention mechanism CNN. CoRR,
abs/1910.06748. https://doi.org/10
.1109/ASONAM49781.2020.9381393
Wan, Yu, Baosong Yang, Derek F. Wong,
Lidia S. Chao, Haihua Du, and Ben C. H.
Ao. 2020. Unsupervised neural dialect
translation with commonality and
diversity modeling. In Proceedings of the
AAAI Conference on Artificial Intelligence,
pages 9130–9137. https://doi.org/10
.1609/aaai.v34i05.6448
Wan, Yu, Baosong Yang, Derek Fai Wong,
Lidia Sam Chao, Liang Yao, Haibo Zhang,
and Boxing Chen. 2022. Challenges of
neural machine translation for short texts.
Computational Linguistics, 48(2):321–342.
https://doi.org/10.1162/coli_a_00435
Wei, Jason W. and Kai Zou. 2019. EDA: Easy
data augmentation techniques for boosting
performance on text classification tasks.
In Proceedings of the 2019 Conference on
Empirical Methods in Natural Language
905
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
-
p
d
f
/
/
/
/
4
8
4
8
8
7
2
0
6
1
7
9
3
/
c
o
l
i
_
a
_
0
0
4
5
1
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Computational Linguistics
Volume 48, Number 4
Processing and the 9th International Joint
Conference on Natural Language
Processing, EMNLP-IJCNLP 2019,
pages 6381–6387. https://doi.org/10
.18653/v1/D19-1670
Wu, Nianheng, Eric DeMattos, Kwok Him
So, Pin-zhen Chen, and C¸ a ˘grı C¸ ¨oltekin.
2019. Language discrimination and
transfer learning for similar languages:
Experiments with feature combinations
and adaptation. In Proceedings of the Sixth
Workshop on NLP for Similar Languages,
Varieties and Dialects, pages 54–63.
https://doi.org/10.18653/v1/W19
-1406
Wu, Yonghui, Mike Schuster, Zhifeng Chen,
Quoc V. Le, Mohammad Norouzi,
Wolfgang Macherey, Maxim Krikun, Yuan
Cao, Qin Gao, Klaus Macherey, Jeff
Klingner, Apurva Shah, Melvin Johnson,
Xiaobing Liu, Lukasz Kaiser, Stephan
Gouws, Yoshikiyo Kato, Taku Kudo,
Hideto Kazawa, Keith Stevens, George
Kurian, Nishant Patil, Wei Wang, Cliff
Young, Jason Smith, Jason Riesa, Alex
Rudnick, Oriol Vinyals, Greg Corrado,
Macduff Hughes, and Jeffrey Dean. 2016.
Google’s neural machine translation
system: Bridging the gap between
human and machine translation. CoRR,
abs/1609.08144.
Xu, Mingzhou, Derek F. Wong, Baosong
Yang, Yue Zhang, and Lidia S. Chao. 2019.
Leveraging local and global patterns for
self-attention networks. In Proceedings of
the 57th Annual Meeting of the Association for
Computational Linguistics, pages 3069–3075.
https://doi.org/10.18653/v1/P19-1295
Xu, Mingzhou, Baosong Yang, Derek F.
Wong, and Lidia S. Chao. 2022. Multi-view
self-attention networks. Knowledge-Based
Systems, 241:108268. https://doi.org/10
.1016/j.knosys.2022.108268
Yang, Baosong, Longyue Wang, Derek F.
Wong, Lidia S. Chao, and Zhaopeng Tu.
2019. Convolutional self-attention
networks. In Proceedings of the 2019
Conference of the North American Chapter of
the Association for Computational Linguistics:
Human Language Technologies, NAACL-HLT
2019, Volume 1 (Long and Short Papers),
pages 4040–4045. https://doi.org/10
.18653/v1/N19-1407
Yao, Liang, Baosong Yang, Haibo Zhang,
Boxing Chen, and Weihua Luo. 2020.
Domain transfer based data augmentation
for neural query translation. In Proceedings
of the 28th International Conference on
Computational Linguistics. https://doi
.org/10.18653/v1/2020.coling-main
.399
Zhang, Yuan, Jason Riesa, Daniel Gillick,
Anton Bakalov, Jason Baldridge, and
David Weiss. 2018. A fast, compact,
accurate model for language identification
of codemixed text. In Proceedings of the 2018
Conference on Empirical Methods in Natural
Language Processing, pages 328–337.
https://doi.org/10.18653/v1/D18
-1030
906
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
-
p
d
f
/
/
/
/
4
8
4
8
8
7
2
0
6
1
7
9
3
/
c
o
l
i
_
a
_
0
0
4
5
1
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3