Automatic Adaptation of Annotations
Wenbin Jiang∗
Chinese Academy of Sciences
Yajuan L ¨u∗
Chinese Academy of Sciences
Liang Huang∗∗
Queens College and Graduate Center,
The City University of New York
Qun Liu∗†
Dublin City University
Chinese Academy of Sciences
Manually annotated corpora are indispensable resources, yet for many annotation tasks, 例如
the creation of treebanks, there exist multiple corpora with different and incompatible annotation
guidelines. This leads to an inefficient use of human expertise, but it could be remedied by
integrating knowledge across corpora with different annotation guidelines. In this article we
describe the problem of annotation adaptation and the intrinsic principles of the solutions, 和
present a series of successively enhanced models that can automatically adapt the divergence
between different annotation formats.
We evaluate our algorithms on the tasks of Chinese word segmentation and dependency
解析. For word segmentation, where there are no universal segmentation guidelines be-
cause of the lack of morphology in Chinese, we perform annotation adaptation from the much
larger People’s Daily corpus to the smaller but more popular Penn Chinese Treebank. 为了
dependency parsing, we perform annotation adaptation from the Penn Chinese Treebank to
a semantics-oriented Dependency Treebank, which is annotated using significantly different
注释指南. In both experiments, automatic annotation adaptation brings significant
改进, achieving state-of-the-art performance despite the use of purely local features in
训练.
∗ Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese
Academy of Sciences, 不. 6 Kexueyuan South Road, Haidian District, P.O. Box 2704, 北京 100190,
中国. 电子邮件: {jiangwenbin, liuqun, lvyajuan}@ict.ac.cn.
∗∗ Department of Computer Science, Queens College / CUNY, 65-30 Kissena Blvd., Queens, 纽约 11367.
电子邮件: liang.huang.sh@gmail.com.
† Centre for Next Generation Localisation, Faculty of Engineering and Computing, Dublin City University.
电子邮件: qliu@computing.dcu.ie.
提交材料已收到: 24 四月 2013; 收到修订版: 6 行进 2014; 接受出版:
18 四月 2014.
土井:10.1162/大肠杆菌a 00210
© 2015 计算语言学协会
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
C
哦
我
我
/
我
A
r
t
我
C
e
–
p
d
F
/
/
/
/
4
1
1
1
1
9
1
8
0
5
3
5
3
/
C
哦
我
我
_
A
_
0
0
2
1
0
p
d
.
F
乙
y
G
你
e
s
t
t
哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3
计算语言学
体积 41, 数字 1
1. 介绍
Much of statistical NLP research relies on some sorts of manually annotated corpora
to train models, but annotated resources are extremely expensive to build, 尤其
on a large scale. The creation of treebanks is a prime example (马库斯, 圣托里尼岛,
和马尔辛凯维奇 1993). 然而, the linguistic theories motivating these annotation
efforts are often heavily debated, and as a result there often exist multiple corpora
for the same task with vastly different and incompatible annotation philosophies. 为了
例子, there are several treebanks for English, including the Chomskian-style Penn
树库 (马库斯, 圣托里尼岛, 和马尔辛凯维奇 1993), the HPSG LinGo Redwoods
树库 (奥彭等人. 2002), and a smaller dependency treebank (Buchholz and Marsi
2006). From the perspective of resource accumulation, it seems a waste in human
efforts.1
A second, related problem is that the raw texts are also drawn from dif-
ferent domains, which for the above example range from financial news (Penn
Treebank/Wall Street Journal) to transcribed dialog (LinGo). It would be nice if a
system could be automatically ported from one set of guidelines and/or domain to
其他, in order to exploit a much larger data set. The second problem, domain
适应, is very well studied (例如, Blitzer, 麦当劳, & 佩雷拉 2006; Daum´e III
2007). This work focuses on the widely existing and equally important problem, 一个-
notation adaptation, in order to adapt the divergence between different annotation
guidelines and integrate linguistic knowledge in corpora with incongruent annotation
formats.
在本文中, we describe the problem of annotation adaptation and the intrinsic
principles of the solutions, and present a series of successively improved concrete
型号, the goal being to transfer the annotations of a corpus (source corpus) 到
annotation format of another corpus (target corpus). The transfer classifier is the fun-
damental component for annotation adaptation algorithms. It learns correspondence
regularities between annotation guidelines from a parallel annotated corpus, which has
two kinds of annotations for the same data. In the simplest model (模型 1), the source
classifier trained on the source corpus gives its predications to the transfer classifier
trained on the parallel annotated corpus, so as to integrate the knowledge in the two
语料库. In a variant of the simplest model (模型 2), the transfer classifier is used to
transform the annotations in the source corpus into the annotation format of the target
语料库; then the transformed source corpus and the target corpus are merged in order
to train a more accurate classifier. Based on the second model, we finally develop an
optimized model (模型 3), where two optimization strategies, iterative training and
predict-self re-estimation, are integrated to further improve the efficiency of annotation
适应.
We experiment on Chinese word segmentation and dependency parsing to test
the efficacy of our methods. For word segmentation, the problem of incompatible
annotation guidelines is one of the most glaring: No segmentation guideline has been
widely accepted due to the lack of a clear definition of Chinese word morphology.
For dependency parsing there also exist multiple disparate annotation guidelines. 为了
1 Different annotated corpora for the same task facilitate the comparison of linguistic theories. 由此
看法, having multiple standards is not necessarily a waste but rather a blessing, because it is a
necessary phase in coming to a consensus if there is one.
120
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
C
哦
我
我
/
我
A
r
t
我
C
e
–
p
d
F
/
/
/
/
4
1
1
1
1
9
1
8
0
5
3
5
3
/
C
哦
我
我
_
A
_
0
0
2
1
0
p
d
.
F
乙
y
G
你
e
s
t
t
哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3
Jiang et al.
Automatic Adaptation of Annotations
NN
VV NR
NR
{1 B2 o3 Ú4 –5 u6
我们. Vice-President visited China
乙
n
ns
{1 B2 o3 Ú4 –5 u6
我们. 副
President visited-China
v
Figure 1
Incompatible word segmentation and POS tagging guidelines between CTB (upper) and PD
(以下).
例子, the dependency relations extracted from a constituency treebank follow syn-
tactic principles, whereas the semantic dependency treebank is annotated in a semantic
perspective.
The two corpora for word segmentation are the much larger People’s Daily cor-
脓 (PD) (5.86M words) (Yu et al. 2001) and the smaller but more popular Penn
Chinese Treebank (CTB) (0.47M words) (薛等. 2005). They utilize very differ-
ent segmentation guidelines; 例如, 如图 1, PD breaks Vice-
President into two words and combines the phrase visited-China as a compound,
compared with the segmentation following the CTB annotation guideline. It is prefer-
able to transfer knowledge from PD to CTB because the latter also annotates tree
结构, which are useful for downstream applications like parsing, summariza-
的, and machine translation, yet it is much smaller in size. For dependency pars-
英, we use the dependency treebank (DCTB) extracted from CTB according to the
rules of Yamada and Matsumoto (2003), and the Semantic Dependency Treebank
(SDT) built on a small part of the CTB text (Che et al. 2012). Compared with the
automatically extracted dependencies in DCTB, semantic dependencies in SDT re-
veal semantic relationships between words, rather than the syntactic relationships in
syntactic dependencies. 数字 2 shows an example. Experiments on both word seg-
mentation and dependency parsing show that annotation adaptation results in signifi-
cant improvement over the baselines, and achieves the state-of-the-art with only local
features.
本文的其余部分组织如下. 部分 2 gives a description
of the problem of annotation adaptation. 部分 3 briefly introduces the tasks
of word segmentation and dependency parsing as well as their state-of-the-art
型号. 在部分 4 we first describe the transfer classifier that indicates the
(西德:1117)(西德:3373) (西德:4649) (西德:3910) (西德:5424)(西德:7022) (西德:6208)(西德:13593) (西德:7104)(西德:10086)
(西德:1117)(西德:3373) (西德:4649) (西德:3910) (西德:5424)(西德:7022) (西德:6208)(西德:13593) (西德:7104)(西德:10086)
Figure 2
Incompatible dependency guidelines between SDT (顶部) and DCTB (底部).
121
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
C
哦
我
我
/
我
A
r
t
我
C
e
–
p
d
F
/
/
/
/
4
1
1
1
1
9
1
8
0
5
3
5
3
/
C
哦
我
我
_
A
_
0
0
2
1
0
p
d
.
F
乙
y
G
你
e
s
t
t
哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3
计算语言学
体积 41, Number 1
intrinsic principles of annotation adaptation, and then depict the three successively
enhanced models for automatic adaptation of annotations. After the description of
experimental results in Section 5 and the discussion of application scenarios in Sec-
的 6, we give a brief review of related work in Section 7, drawing conclusions in
Section 8.
2. Automatic Annotation Adaptation
We define annotation adaptation as a task aimed at automatically adapting the
divergence between different annotation guidelines. Statistical models can be de-
two annotation guidelines in order to trans-
signed to learn the relevance of
form a corpus from one annotation guideline to another. From this point of
看法, annotation adaptation can be seen as a special case of transfer learning.
Through annotation adaptation, the linguistic knowledge in different corpora is
integrated, resulting in enhanced NLP systems without complicated models and
features.
Much research has considered the problem of domain adaptation (Blitzer,
麦当劳, 和佩雷拉 2006; Daum´e III 2007), which also can be seen as a special case
of transfer learning. It aims to adapt models trained in one domain (例如, 化学) 到
work well in other domains (例如, medicine). Despite superficial similarities between
domain adaptation and annotation adaptation, we argue that the underlying problems
are quite different. Domain adaptation assumes that the labeling guidelines are
preserved between the two domains—for example, an adjective is always labeled
as JJ regardless of whether it is from the Wall Street Journal (WSJ) or a biomedical
文本, and only the distributions are different—for example, the word control is most
likely a verb in WSJ but often a noun in biomedical texts (as in control experiment).
Annotation adaptation, 然而, tackles the problem where the guideline itself
is changed, 例如, one treebank might distinguish between transitive and
intransitive verbs, while merging the different noun types (NN, NNS, ETC。), or one
treebank (PTB) might be much flatter than the other (LinGo), not to mention the
fundamental disparities between their underlying linguistic representations (CFG vs.
HPSG).
A more formal description will allow us to make these claims more precise. Let X
be the data and Y be the annotation. Annotation adaptation can be understood as a
change of P(是) due to the change in annotation guidelines while P(X) remains constant.
Through annotation adaptation, we want to change the annotations of the data from one
guideline to another, leaving the data itself unchanged. 然而, in domain adaptation,
磷(X) 变化, but P(是) is assumed to be constant. The word assumed means that the
distributions P(是, X) 和P(是|X) are actually changed because P(X) is changed. Domain
adaptation aims to make the model better adapt to a different domain with the same
annotation guidelines.
According to this analysis, annotation adaptation seems more motivated from a
语言学的 (rather than statistical) point of view, and tackles a serious problem fun-
damentally different from domain adaptation, which is also a serious problem (经常
leading to >10% loss in accuracy). More interestingly, annotation adaptation, 没有
any assumptions about distributions, can be simultaneously applied to both domain
and annotation adaptation problems, which is very appealing in practice because the
latter problem often implies the former.
122
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
C
哦
我
我
/
我
A
r
t
我
C
e
–
p
d
F
/
/
/
/
4
1
1
1
1
9
1
8
0
5
3
5
3
/
C
哦
我
我
_
A
_
0
0
2
1
0
p
d
.
F
乙
y
G
你
e
s
t
t
哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3
Jiang et al.
Automatic Adaptation of Annotations
3. Case Studies: Word Segmentation and Dependency Parsing
3.1 Word Segmentation and Character Classification Method
In many Asian languages there are no explicit word boundaries, thus word segmen-
tation is a fundamental task for the processing and understanding of these languages.
Given a sentence as a sequence of n characters:
x = x1 x2 .. xn
where xi is a character, word segmentation aims to split the sequence into m(≤ n)
字:
x1:e1 xe1+1:e2
.. xem−1+1:嗯
where each subsequence xi:j
xi to xj.
indicates a Chinese word spanning from characters
Word segmentation can be formalized as a sequence labeling problem (Xue and
沉 2003), where each character in the sentence is given a boundary tag representing
its position in a word. Following Ng and Low (2004), joint word segmentation and part-
of-speech (销售点) tagging can also be solved using a character classification approach by
extending boundary tags to include POS information. For word segmentation we adopt
the four boundary tags of Ng and Low (2004), 乙, 中号, 乙, 和S, where B, 中号, and E mean
the beginning, the middle, and the end of a word, 分别, and S indicates a single-
character word. The word segmentation result can be generated by splitting the labeled
character sequence into subsequences of pattern S or BM∗E, indicating single-character
words or multi-character words, respectively.
Given the character sequence x, the decoder finds the output ˜y that maximizes the
score function:
˜y = argmax
F(X, y) · w
y
= argmax
y
X
xi∈x,yi∈y
F(希, 做) · w
(1)
Where function f maps (X, y) into a feature vector, w is the parameter vector generated
by the training algorithm, and f(X, y) · w is the inner product of f(X, y) and w. 这
score of the sentence is further factorized into each character, where yi is the character
classification label of character xi.
The training procedure of perceptron learns a discriminative model mapping from
the inputs x to the outputs y. Algorithm 1 shows the perceptron algorithm for tuning the
parameter w. The “averaged parameters” technology (柯林斯 2002) is used for better
表现. The feature templates of the classifier is shown in Table 1. The function
Pu(·) returns true for a punctuation character and false for others; the function T(·) clas-
sifies a character into four types: 数字, 日期, English letter, 和别的, 相应的
to function values 1, 2, 3, 和 4, respectively.
3.2 Dependency Parsing and Spanning Tree Method
Dependency parsing aims to link each word to its arguments so as to form a directed
graph spanning the whole sentence. 通常情况下, the directed graph is restricted to a
123
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
C
哦
我
我
/
我
A
r
t
我
C
e
–
p
d
F
/
/
/
/
4
1
1
1
1
9
1
8
0
5
3
5
3
/
C
哦
我
我
_
A
_
0
0
2
1
0
p
d
.
F
乙
y
G
你
e
s
t
t
哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3
计算语言学
体积 41, Number 1
Algorithm 1 Perceptron training algorithm.
1: 输入: Training set C
2: w ← 0
3: for t ← 1 .. T do
4:
5:
为了 (X, y) ∈ C do
˜z ← argmaxz f(X, z) · w
if ˜z 6= y then
w ← w + F(X, y) − f(X, ˜z)
6:
7:
8:
end if
⊲ T iterations
⊲ update the parameters
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
C
哦
我
我
/
我
A
r
t
我
C
e
–
p
d
F
/
/
/
/
4
1
1
1
1
9
1
8
0
5
3
5
3
/
C
哦
我
我
_
A
_
0
0
2
1
0
p
d
.
F
乙
y
G
你
e
s
t
t
哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3
end for
9:
10: end for
11: 输出: Parameters w
dependency tree where each word depends on exactly one parent, and all words find
their parents. Given a sentence as a sequence n words:
x = x1 x2 .. xn
dependency parsing finds a dependency tree y, 在哪里 (我, j) ∈ y is an edge from the head
word xi to the modifier word xj. The root r ∈ x in the tree y has no head word, and each
of the other words, j(j ∈ x and j 6= r), depends on a head word i(i ∈ x and i 6= j).
For many languages, the dependency structures are supposed to be projective. 如果
xj is dependent on xi, then all the words between i and j must be directly or indirectly
dependent on xi. 所以, if we put the words in their linear order, preceded by the
root, all edges can be drawn above the words without crossing. We follow this constraint
because the dependency treebanks in our experiments are projective.
Following the edge-based factorization method (艾斯纳 1996), the score of a de-
pendency tree can be factorized into the dependency edges in the tree. The spanning
Table 1
Feature templates and instances for character classification-based word segmentation model.
C0 denotes the current character, and C−i/Ci denote the ith character to the left/right of C0.
Suppose we are considering the third character “总” in “美-副-总-统-访-华”.
Type
Templates
Instances
n-gram
C−2
C−1
C0
C1
C2
C−2C−1
C−1C0
C0C1
C1C2
C−1C1
C−2=美
C−1=副
C0=总
C1=统
C2=访
C−2C−1=美副
C−1C0=副总
C0C1=总统
C1C2=统访
C−1C1=副统
function
Pu(C0)
时间(C−2:2)
Pu(C0)=true
时间(C−2:2)= 44444
124
Jiang et al.
Automatic Adaptation of Annotations
tree method (麦当劳, Crammer, 和佩雷拉 2005) factorizes the score of the tree as
the sum of the scores of all its edges, and the score of an edge is defined as the inner
product of the feature vector f and the weight vector w. Given a sentence x, the parsing
procedure searches for the candidate dependency tree with the maximum score:
˜y = argmax
F(X, y) · w
y
= argmax
y
X
(我,j)∈y
F(我, j) · w
(2)
The averaged perceptron algorithm is used again to train the parameter vector. A
bottom–up dynamic programming algorithm is designed to search for the candidate
parse with the maximum score as shown in Algorithm 2, where V[我, j] contains the
candidate dependency fragments of the span [我, j]. The feature templates are similar
to those of the first-ordered MST model (麦当劳, Crammer, 和佩雷拉 2005). 每个
feature is composed of some words and POS tags surround word i and/or word j, 作为
well as an optional distance representation between the two words. 桌子 2 shows the
feature templates without distance representations.
4. Models for Automatic Annotation Adaptation
在这个部分, we present a series of discriminative learning algorithms for the auto-
matic adaptation of annotation guidelines. To facilitate the description of the algorithms,
several shortened forms are adopted for convenience of description. We use source
corpus to denote the corpus with the annotation guideline that we do not require, 哪个
is of course the source side of adaptation, and target corpus denotes the corpus with
the desired guideline. 相应地, the annotation guidelines of the two corpora
are denoted as source guideline and target guideline, and the classifiers following the
two annotation guidelines are respectively named as source classifier and target clas-
sifier. Given a parallel annotated corpus, 那是, a corpus labeled with two annotation
Algorithm 2 Dependency parsing algorithm.
4:
5:
6:
7:
8:
9:
1: 输入: sentence x to be parsed
2: 为了 [我, j] ⊆ [1, |X|] in topological order do
3:
buf ← ∅
for k ← i..j − 1 做
for l ∈ V[我, k] and r ∈ V[k + 1, j] 做
insert DERIV(我, r) into buf
insert DERIV(r, 我) into buf
end for
end for
V[我, j] ← best K in buf
10:
11: end for
12: 输出: the best of V[1, |X|]
13: function DERIV(p, C)
14:
15: end function
return p ∪ c ∪ {(p · root, c · root)}
⊲ all partitions
⊲ new derivation
125
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
C
哦
我
我
/
我
A
r
t
我
C
e
–
p
d
F
/
/
/
/
4
1
1
1
1
9
1
8
0
5
3
5
3
/
C
哦
我
我
_
A
_
0
0
2
1
0
p
d
.
F
乙
y
G
你
e
s
t
t
哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3
计算语言学
体积 41, 数字 1
桌子 2
Feature templates and instances for dependency parsing. Wi/Pi denotes the word/POS of the
hypothetic head, Wj/Pj denotes the word/POS of the hypothetic modifier, and Px+/−1 denotes
the POS to the right/left of the x-th token. Suppose we are considering the words “外” (i = 2)
and “开放” (j = 3) in “中国/NR-对/P-外/NN-开放/VV-成绩/NN-斐然/VV”.
Type
Templates
Instances
unigram WiPi
bigram
语境
Wi
圆周率
WjPj
Wj
Pj
WiPiWjPj
WiWjPj
PiWjPj
WiPiWj
WiPiPj
WiWj
WiPj
PiWj
PiPj
WiPi=外◦NN
Wi=外
Pi=NN
WjPj=开放◦VV
Wj=开放
Pj=VV
WiPiWjPj=外◦NN◦开放◦VV
WiWjPj=外◦开放◦VV
PiWjPj=NN◦开放◦VV
WiPiWj=外◦NN◦开放
WiPiPj=外◦NN◦VV
WiWj=外◦开放
WiPj=外◦VV
PiWj=NN◦开放
PiPj=NN◦VV
PiPi+1Pj−1Pj
Pi−1PiPj−1Pj
PiPi+1PjPj+1
Pi−1PiPjPj+1
Pi−1PiPj−1
Pi−1PiPj+1
PiPi+1Pj−1
PiPi+1Pj+1
Pi−1Pj−1Pj
Pi−1PjPj+1
Pi+1Pj−1Pj
Pi+1PjPj+1
PiPj−1Pj
PiPjPj+1
Pi−1PiPj
PiPi+1Pj
PiPi+1Pj−1Pj=NN◦VV◦NN◦VV
Pi−1PiPj−1Pj=P◦NN◦NN◦VV
PiPi+1PjPj+1=NN◦VV◦VV◦NN
Pi−1PiPjPj+1=P◦NN◦VV◦NN
Pi−1PiPj−1=P◦NN◦NN
Pi−1PiPj+1=P◦NN◦NN
PiPi+1Pj−1=NN◦VV◦NN
PiPi+1Pj+1=NN◦VV◦NN
Pi−1Pj−1Pj=P◦NN◦VV
Pi−1PjPj+1=P◦VV◦NN
Pi+1Pj−1Pj=VV◦NN◦VV
Pi+1PjPj+1=VV◦VV◦NN
PiPj−1Pj=NN◦NN◦VV
PiPjPj+1=NN◦VV◦NN
Pi−1PiPj=P◦NN◦VV
PiPi+1Pj=NN◦VV◦VV
guidelines, a transfer classifier can be trained to capture the regularity of transforma-
tion from the source annotation to the target annotation. The classifiers mentioned
here are normal discriminative classifiers that take a set of features as input and give
a classification label as output. For the POS tagging problem, the classification label
is a POS tag, and for the parsing task, the classification label is a dependency edge, A
constituency span, or a shift-reduce action.
The parallel annotated corpus is the knowledge source of annotation adaptation.
The annotation quality and data size of the parallel annotated corpus determine the
accuracy of the transfer classifier. Such a corpus is difficult to build manually, 虽然
we can generate a noisy one automatically from the source corpus and the target corpus.
例如, we can apply the source classifier on the target corpus, thus generating
126
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
C
哦
我
我
/
我
A
r
t
我
C
e
–
p
d
F
/
/
/
/
4
1
1
1
1
9
1
8
0
5
3
5
3
/
C
哦
我
我
_
A
_
0
0
2
1
0
p
d
.
F
乙
y
G
你
e
s
t
t
哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3
Jiang et al.
Automatic Adaptation of Annotations
a parallel annotated corpus with noisy source annotations and accurate target anno-
tations. The training procedure of the transfer classifier predicts the target annotations
with guiding features extracted from the source annotations. This approach can alleviate
the effect of the noise in source annotations, and learn annotation adaptation regularities
accurately. By reducing the noise in the automatically generated parallel annotated
语料库, a higher accuracy of annotation adaptation can be achieved.
在以下部分中, we first describe the transfer classifier that reveals the
intrinsic principles of annotation adaptation, then describe a series of successively
enhanced models that are developed from our previous investigation (Jiang, 黄,
和刘 2009; Jiang et al. 2012). In the simplest model (模型 1), two classifiers, a source
classifier and a transfer classifier, are used in a cascade. The classification results of the
lower source classifier provide additional guiding features to the upper transfer classi-
fier, yielding an improved classification result. A variant of the first model (模型 2)
uses the transfer classifier to transform the source corpus from source guideline to
target guideline first, then merges the transformed source corpus into the target corpus
in order to train an improved target classifier on the enlarged corpus. An optimized
模型 (模型 3) is further proposed based on Model 2. Two optimization strategies,
iterative training and predict-self re-estimation, are adopted to improve the efficiency
of annotation adaptation, in order to fully utilize the knowledge in heterogeneous
语料库.
4.1 Transfer Classifier
In order to learn the regularity of the adaptation from one annotation guideline to
其他, a parallel annotated corpus is needed to train the transfer classifier. The parallel
annotated corpus is a corpus with two different annotation guidelines, the source guide-
line and the target guideline. With the target annotation labels as learning objectives
and the source annotation labels as guiding information, the transfer classifier learns
the statistical regularity of the adaptation from the source annotations to the target
注释.
The training procedure of the transfer classifier is analogous to the training of a
normal classifier except for the introduction of additional guiding features. For word
segmentation, the most intuitive guiding feature is the source annotation label itself. 为了
dependency parsing, an effective guiding feature is the dependency path between the
hypothetic head and modifier, 如图 3. 然而, our effort is not limited
to this, and more special features are introduced: A classification label or dependency
path is attached to each feature of the baseline classifier to generate combined guiding
特征. This is similar to the feature design in discriminative dependency parsing
(麦当劳, Crammer, 和佩雷拉 2005; McDonald and Pereira 2006), where the basic
特征, composed of words and POSs in the context, are also conjoined with link
direction and distance in order to generate more special features.
桌子 3 shows an example of guide features (as well as baseline features) for word
segmentation, where “α = B” indicates that the source classification label of the current
character is B, demarcating the beginning of a word. The combination strategy derives
a series of specific features, helping the transfer classifier to produce more precise clas-
sifications. The parameter-tuning procedure of the transfer classifier will automatically
learn the regularity of using the source annotations to guide the classification decision.
In decoding, if the current character shares some basic features in Table 3 它是
classified as B in the source annotation, then the transfer classifier will probably classify
it as M.
127
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
C
哦
我
我
/
我
A
r
t
我
C
e
–
p
d
F
/
/
/
/
4
1
1
1
1
9
1
8
0
5
3
5
3
/
C
哦
我
我
_
A
_
0
0
2
1
0
p
d
.
F
乙
y
G
你
e
s
t
t
哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3
计算语言学
体积 41, 数字 1
(西德:1117)(西德:3373) (西德:4649) (西德:3910) (西德:5424)(西德:7022) (西德:6208)(西德:13593) (西德:7104)(西德:10086)
( (西德:6208)(西德:13593)(西德:3910)(西德:68)
,
(西德:32))
数字 3
The guiding feature for dependency parsing, where α(我, j) denotes the dependency path between
i and j. 在这个例子中, j is a son of the great-grandfather of i.
此外, the original features used in the normal classifier are also used in order
to leverage the knowledge from the target annotation of the parallel annotated corpus,
and the training procedure of the transfer classifier also learns the relative weights
between the guiding features and the original features. 所以, the knowledge from
both the source annotation and the target annotation are automatically integrated
一起, and higher and more stable prediction accuracy can be achieved.
桌子 3
Feature templates and corresponding instances for the transfer classifier. Suppose we are
considering the third character “总” in “美-副-总-统-访-华”. α = B indicates that the classification
label given by the source classifier is B.
Type
Templates
Instances
Original C−2
C−1
C0
C1
C2
C−2C−1
C−1C0
C0C1
C1C2
C−1C1
Pu(C0)
时间(C−2:2)
Guiding α
C−2=美
C−1=副
C0=总
C1=统
C2=访
C−2C−1=美副
C−1C0=副总
C0C1=总统
C1C2=统访
C−1C1=副统
Pu(C0)=true
时间(C−2:2)= 44444
α=B
α=B ◦ C−2=美
α=B ◦ C−1=副
α=B ◦ C0=总
α=B ◦ C1=统
α=B ◦ C2=访
α=B ◦ C−2C−1=美副
α=B ◦ C−1C0=副总
α=B ◦ C0C1=总统
α=B ◦ C1C2=统访
α=B ◦ C−1C1=副统
α=B ◦ Pu(C0)=true
α ◦ C−2
α ◦ C−1
α ◦ C0
α ◦ C1
α ◦ C2
α ◦ C−2C−1
α ◦ C−1C0
α ◦ C0C1
α ◦ C1C2
α ◦ C−1C1
α ◦ Pu(C0)
α ◦ T(C−2:2) α=B ◦ T(C−2:2)= 44444
128
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
C
哦
我
我
/
我
A
r
t
我
C
e
–
p
d
F
/
/
/
/
4
1
1
1
1
9
1
8
0
5
3
5
3
/
C
哦
我
我
_
A
_
0
0
2
1
0
p
d
.
F
乙
y
G
你
e
s
t
t
哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3
Jiang et al.
Automatic Adaptation of Annotations
source corpus
train with
normal features
target corpus
source classifier
transformed
target corpus
train with
guiding features
transfer classifier
数字 4
The pipeline for training of Model 1.
4.2 模型 1
The most intuitive model for annotation adaptation uses two cascaded classifiers, 这
source classifier and the transfer classifier, to integrate the knowledge in corpora with
different annotation guidelines. In the training procedure, a source classifier is trained
on the source corpus and is used to process the target corpus, generating a parallel
annotated corpus (albeit a noisy one). 然后, the transfer classifier is trained on the
parallel annotated corpus,with the target annotations as the classification labels, 和
the source annotation as guiding information. 数字 4 depicts the training pipeline. 这
best training iterations for the source classifier and the transfer classifier are determined
on the development sets of the source corpus and target corpus.
In the decoding procedure, a sequence of characters (for word segmentation) 或者
字 (for dependency parsing) is input into the source classifier to obtain a classifica-
tion result under the source guideline; then it is input into the transfer classifier with
this classification result as the guiding information to get the final result following the
target guideline. This coincides with the stacking method for combining dependency
parsers (Martins et al. 2008; Nivre and McDonald 2008), and is also similar to the Pred
baseline for domain adaptation in Daum´e et al. (Daum´e III and Marcu 2006; Daum´e III
2007). 数字 5 shows the pipeline for decoding.
raw sentence
source classifier
result with
source guideline
transfer classifier
result with
target guideline
数字 5
The pipeline for decoding of Model 1.
129
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
C
哦
我
我
/
我
A
r
t
我
C
e
–
p
d
F
/
/
/
/
4
1
1
1
1
9
1
8
0
5
3
5
3
/
C
哦
我
我
_
A
_
0
0
2
1
0
p
d
.
F
乙
y
G
你
e
s
t
t
哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3
计算语言学
体积 41, 数字 1
4.3 模型 2
The previous model has a drawback: It has to cascade two classifiers in decoding to
integrate the knowledge in two corpora, which seriously degrades the processing speed.
Here we describe a variant of the previous model, aiming at automatic transforma-
的 (rather than integration as in Model 1) between annotation guidelines of human-
annotated corpora. The source classifier and the transfer classifier are trained in the
same way as in the previous model. The transfer classifier is used to process the source
语料库, with the source annotations as guiding information, so as to relabel the source
corpus with the target annotation guideline. By merging the target corpus and the
transformed source corpus for the training of the final classifier, improved classification
accuracy can be achieved.
From this point on we describe the pipelines of annotation transformation in
pseudo-codes for simplicity and convenience of extensions. Algorithm 3 shows the
overall training algorithm for the variant model. Cs and Ct denote the source corpus
and the target corpus. Ms and Ms→t denote the source classifier and the transfer
classifier. Cq
p denotes the p corpus relabeled in q annotation guideline; 例如, CT
s
is a corpus that labels the text of the source corpus with the target guideline. Func-
tions TRAIN and TRANSTRAIN train the source classifier and the transfer classifier,
分别, both invoking the perceptron algorithm, but with different feature sets.
Functions ANNOTATE and TRANSANNOTATE call the function DECODE with different
型号 (source/transfer classifiers), feature functions (without/with guiding features),
and inputs (raw/source-annotated sentences).
In the algorithm the parameters corresponding to development sets are omitted for
simplicity. Compared to the online knowledge integration methodology of the previous
模型, annotation transformation leads to improved performance in an offline manner
by integrating corpora before the training procedure. This approach enables processing
speeds several times faster than the cascaded classifiers in the previous model. 它也是
has another advantage in that we can integrate the knowledge in more than two corpora
without slowing down the processing of the final classifier.
4.4 模型 3
The training of the transfer classifier is based on an automatically generated (相当
than a gold standard) parallel annotated corpus, where the source annotations are
Algorithm 3 Baseline annotation adaptation.
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
C
哦
我
我
/
我
A
r
t
我
C
e
–
p
d
F
/
/
/
/
4
1
1
1
1
9
1
8
0
5
3
5
3
/
C
哦
我
我
_
A
_
0
0
2
1
0
p
d
.
F
乙
y
G
你
e
s
t
t
哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3
⊲ source classifier
⊲ transfer classifier
⊲ integrated corpus with target guideline
t , CT)
1: function ANNOTRANS(Cs, CT)
2: Ms ← TRAIN(Cs)
Cs
t ← ANNOTATE(Ms, CT)
3:
4: Ms→t ← TRANSTRAIN(Cs
5:
CT
s ← TRANSANNOTATE(Ms→t, Cs)
CT
∗ ← Ct
s ∪ Ct
6:
return Ct
7:
∗
8: end function
9: function DECODE(中号, ΦΦΦ, X)
10:
11: end function
return argmaxy∈GEN(X) S(y|中号, ΦΦΦ, X)
130
Jiang et al.
Automatic Adaptation of Annotations
provided by the source classifier. 所以, the performance of annotation transfor-
mation is correspondingly determined by the accuracy of the source classifier, 和我们
can generate a more accurate parallel annotated corpus for better annotation adaptation
if an improved source classifier can be obtained. Based on Model 2, two optimization
strategies—iterative bidirectional training and predict-self hypothesis—are introduced
to optimize the parallel annotated corpora for better annotation adaptation.
We first use an iterative training procedure to gradually improve the transformation
accuracy by iteratively optimizing the parallel annotated corpora. In each training
iteration, both source-to-target and target-to-source annotation transformations are per-
形成的, and the transformed corpora are used to provide better annotations for the
parallel annotated corpora of the next iteration. Then in the new iteration, the better
parallel annotated corpora will result in more accurate transfer classifiers, 导致
the generation of better transformed corpora.
Algorithm 4 shows the overall procedure of the iterative training method. 这
loop of lines 6–13 iteratively performs source-to-target and target-to-source annota-
tion transformations. The source annotations of the parallelly annotated corpora, Cs
t
and Ct
s, are initialized by applying the source and target classifiers on the target and
source corpora, 分别 (lines 2–5). In each training iteration, the transfer classifiers
are trained on the current parallel annotated corpora (lines 7–8); they are used to
produce the transformed corpora (lines 9–10), which provide better annotations for
the parallel annotated corpora of the next iteration. The iterative training terminates
when the performance of the classifier trained on the merged corpus Ct
s ∪ Ct converges
(线 13).
The discriminative training of TRANSTRAIN predicts the target annotations with the
guidance of source annotations. In the first iteration, the transformed corpora generated
by the transfer classifiers are better than the initial ones generated by the source and tar-
get classifiers, due to the assistance of the guiding features. In the following iterations,
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
C
哦
我
我
/
我
A
r
t
我
C
e
–
p
d
F
/
/
/
/
4
1
1
1
1
9
1
8
0
5
3
5
3
/
C
哦
我
我
_
A
_
0
0
2
1
0
p
d
.
F
乙
y
G
你
e
s
t
t
哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3
⊲ source classifier
⊲ target classifier
⊲ source-to-target transfer classifier
⊲ target-to-source transfer classifier
Algorithm 4 Iterative annotation transformation.
Cs
t ← ANNOTATE(Ms, CT)
1: function ITERANNOTRANS(Cs, CT)
2: Ms ← TRAIN(Cs)
3:
4: Mt ← TRAIN(CT)
5:
6:
CT
s ← ANNOTATE(公吨, Cs)
repeat
7:
8:
9:
10:
11:
12:
t , CT)
s, Cs)
Ms→t ← TRANSTRAIN(Cs
Mt→s ← TRANSTRAIN(CT
CT
s ← TRANSANNOTATE(Ms→t, Cs)
Cs
t ← TRANSANNOTATE(Mt→s, CT)
∗ ← Ct
CT
s ∪ Ct
M∗ ← TRAIN(CT
∗)
⊲ enhanced classifier trained on merged corpus
until EVAL(M∗) converges
13:
return Ct
14:
∗
15: end function
16: function DECODE(中号, ΦΦΦ, X)
17:
18: end function
return argmaxy∈GEN(X) S(y|中号, ΦΦΦ, X)
131
计算语言学
体积 41, 数字 1
the transformed corpora provide better annotations for the parallel annotated corpora
of the subsequent iteration; the transformation accuracy will improve gradually along
with the optimization of the parallel annotated corpora until convergence.
The predict-self hypothesis is introduced to improve the transformation accuracy
from another perspective. This hypothesis is implicit in many unsupervised learning
方法, such as Markov random field; it has also been successfully used by Daum´e
三、 (2009) in unsupervised dependency parsing. The basic idea of predict-self is, if a
prediction is a better candidate for an input, it would be easier to convert it back to the
original input by a reverse procedure. If applied to annotation transformation, predict-
self indicates that a better transformation candidate following the target guideline can
be more easily transformed back to the original form following the source guideline.
The most intuitive strategy to introduce the predict-self methodology into annota-
tion transformation is using a reversed annotation transformation procedure to filter
out unreliable predictions of the previous transformation. In detail, a source-to-target
annotation transformation procedure is performed on the source corpus to obtain a pre-
diction that follows the target guideline; then a second, target-to-source transformation
procedure is performed on this prediction result to check whether it can be transformed
back to the original source annotation. The source-to-target prediction results that fail
in this reverse-verification step are discarded, so this strategy can be called predict-self
filtering.
A more sophisticated strategy can be called predict-self re-estimation. Instead of
using the reversed transformation procedure for filtering, the re-estimation strategy
integrates the scores given by the source-to-target and target-to-source annotation
transformation models when evaluating the transformation candidates. By properly
tuning the relative weights of the two transformation directions, better transformation
performance is achieved. The scores of the two transformation models are weighted,
integrated in a log-linear manner:
S+(y|Ms→t, Mt→s, ΦΦΦ, X) = (1 − λ) × S(y|Ms→t, ΦΦΦ, X) + λ × S(X|Mt→s, ΦΦΦ, y)
(3)
The weight parameter λ is tuned on the development set. To integrate the predict-
self reestimation into the iterative transformation training, a reversed transformation
model is introduced and the enhanced scoring function is used when the function
TRANSANNOTATE invokes the function DECODE.
5. 实验
To evaluate the performance of annotation adaptation, we experiment on two impor-
tant NLP tasks, Chinese word segmentation and dependency parsing, both of which
can be modeled as discriminative classification problems. For both tasks, we give the
performances of the baseline models and the annotation adaptation algorithms.
5.1 Experimental Set-ups
We perform annotation adaptation for word segmentation from People’s Daily (PD)
(Yu et al. 2001) to Penn Chinese Treebank 5.0 (CTB) (薛等. 2005). The two corpora
are built according to different segmentation guidelines and differ largely in quantity
of data. CTB is smaller in size with about 0.5M words, whereas PD is much larger,
containing nearly 6M words. 桌子 4 shows the data partitioning for the two corpora.
132
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
C
哦
我
我
/
我
A
r
t
我
C
e
–
p
d
F
/
/
/
/
4
1
1
1
1
9
1
8
0
5
3
5
3
/
C
哦
我
我
_
A
_
0
0
2
1
0
p
d
.
F
乙
y
G
你
e
s
t
t
哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3
Jiang et al.
Automatic Adaptation of Annotations
桌子 4
Data partitioning for CTB and PD.
分割
Sections
# of Words
CTB
Training
Developing
Test
PD
Training
Test
1 - 270
400 - 931
1, 001 - 1, 151
301 - 325
271 - 300
02 - 06
01
0.47中号
6.66K
7.82K
5.86中号
1.07中号
To approximate more general scenarios of annotation adaptation problems, 我们
extract from PD a subset that is comparable to CTB in size. Because there are many
extremely long sentences in the original PD corpus, we first split them into normal sen-
tences according to the full-stop punctuation symbol. We randomly select 20, 000 sen-
时态 (0.45M words) from the PD training data as the new training set, 和 1, 000/1, 000
sentences from the PD test data as the new testing/developing set. We label the smaller
version of PD as SPD. The balanced source corpus and target corpus also facilitate the
investigation of annotation adaptation.
Annotation adaptation for dependency parsing is performed from the CTB-derived
syntactic dependency treebank (DCTB) (Yamada and Matsumoto 2003) to the Seman-
tic Dependency Treebank (SDT) (Che et al. 2012). Semantic dependency encodes the
semantic relationships between words, which are very different from syntactic depen-
dencies. SDT is annotated on a small portion of the CTB text as depicted in Table 5;
所以, we use the subset of DCTB covering the remaining CTB text as the source
语料库. We still denote the source corpus as DCTB in the following for simplicity.
5.2 Baseline Segmenters and Parsers
We train the baseline perceptron classifiers for Chinese word segmentation on the train-
ing sets of CTB and SPD, using corresponding development sets to determine the best
桌子 5
Data partitioning for SDT.
分割
Sections
# of Words
Training
Developing
Test
1 - 10
36 - 65
81 - 121
1, 001 - 1, 078
1, 100 - 1, 119
1, 126 - 1, 140
66 - 80
1, 120 - 1, 125
11 - 35
1, 141 - 1, 151
244.44K
14.97K
33.51K
133
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
C
哦
我
我
/
我
A
r
t
我
C
e
–
p
d
F
/
/
/
/
4
1
1
1
1
9
1
8
0
5
3
5
3
/
C
哦
我
我
_
A
_
0
0
2
1
0
p
d
.
F
乙
y
G
你
e
s
t
t
哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3
计算语言学
体积 41, 数字 1
数字 6
Learning curve of the averaged perceptron classifier on the CTB developing set.
training iterations. The performance measurement indicator for word segmentation is
the balanced F-measure, F = 2PR/(磷 + 右), a function of Precision P and Recall R, 在哪里
P is the percentage of words in segmentation results that are segmented correctly, 和
R is the percentage of correctly segmented words in the gold standard words.
For both syntactic and semantic dependency parsing, we concentrate on nonlabeled
parsing that predicts the graphic dependency structure for the input sentence without
considering dependency labels. The perceptron-based baseline dependency models are
trained on the training sets of DCTB and SDT, using the development sets to determine
the best training iterations. The performance measurement indicator for dependency
parsing is the Unlabeled Attachment Score, denoted as Precision P, indicating the
percentage of words in predicted dependency structure that are correctly attached to
their head words.
数字 6 shows the learning curve of the averaged perceptron for word segmenta-
tion on the development set. Accuracies of the baseline classifiers are listed in Table 6.
We also report the performance of the classifiers on the testing sets of the opposite
语料库. Experimental results are in line with our expectations. A classifier performs
better in its corresponding testing set, and performs significantly worse on testing data
following a different annotation guideline.
桌子 7 shows the accuracies of the baseline syntactic and semantic parsers, 还有
as the performance of the parsers on the testing sets of the opposite corpora. 相似的
桌子 6
Performance of the baseline word segmenters.
Test on (F1%)
Train on
CTB
SPD
97.35
91.23(↓ 3.02)
86.65(↓ 10.70)
94.25
CTB
SPD
134
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
C
哦
我
我
/
我
A
r
t
我
C
e
–
p
d
F
/
/
/
/
4
1
1
1
1
9
1
8
0
5
3
5
3
/
C
哦
我
我
_
A
_
0
0
2
1
0
p
d
.
F
乙
y
G
你
e
s
t
t
哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3
Jiang et al.
Automatic Adaptation of Annotations
桌子 7
Performance of the baseline dependency parsers.
Test on (P%)
Train on
SDT
DCTB
SDT
DCTB
77.51
51.92(↓ 32.62)
53.62(↓ 23.89)
84.54
桌子 8
Performance of automatic annotation adaptation models for word segmentation.
模型
时间 (秒) Accuracy (F1%)
Merging
模型 1
模型 2
模型 3
Baseline
1.33
4.39
1.33
1.33
1.21
93.79
97.67
97.69
97.97
97.35
to the situations in word segmentation, two parsers give state-of-the-art accuracies on
their own testing sets, but perform poorly on the other testing sets. This indicates the
degree of divergence between the annotation guidelines of DCTB and SDT.
5.3 Automatic Annotation Adaptation
As a variant of Model 1, 模型 2 shares the same transfer classifier, and differs only in
training and decoding of the final classifier. Tables 8 和 9 show the performances of
systems resulting from Models 1 和 2, as well as the classifiers trained on the directly
merged corpora. The time costs for decoding are also listed to facilitate the practical
比较.
We find that the simple corpus merging strategy leads to a dramatic decrease in
准确性, due to the different and incompatible annotation guidelines. 模型 1, 这
simplest model for annotation adaptation, gives significant improvement over the
baseline classifiers for word segmentation and dependency parsing. This indicates that
the statistical regularities for annotation adaptation learned by the transfer classifiers
桌子 9
Performance of automatic annotation adaptation models for dependency parsing.
模型
时间 (min) Accuracy (P%)
Merging
模型 1
模型 2
模型 3
Baseline
7.38
15.43
7.38
7.38
6.67
65.95
78.69
78.73
79.34
77.51
135
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
C
哦
我
我
/
我
A
r
t
我
C
e
–
p
d
F
/
/
/
/
4
1
1
1
1
9
1
8
0
5
3
5
3
/
C
哦
我
我
_
A
_
0
0
2
1
0
p
d
.
F
乙
y
G
你
e
s
t
t
哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3
计算语言学
体积 41, 数字 1
数字 7
Training curve of the iterative training strategy over Model 2 for word segmentation.
bring performance improvement, utilizing guided decisions in the cascaded classifiers.
模型 2 leads to classifiers with accuracy increments comparable to those of Model 1,
while consuming only one third of the decoding time. It is inconsistent with our expec-
站. The strategy of directly transforming the source corpus to the target guideline
also facilitates the utilization of more than one source corpus.
We first introduce the iterative training strategy to Model 2. The corresponding
development sets are used to determine the best training iterations for the iterative
annotation transformations. After each iteration, we test the performance of the clas-
sifiers trained on the merged corpora. 人物 7 和 8 show the performance curves
for Chinese word segmentation and semantic dependency parsing, 分别, 和
iterations ranging from 1 到 10. The performance of Model 2 is naturally included
)
%
磷
(
y
C
A
r
你
C
C
A
80
79.5
79
78.5
78
+ iterative training
模型 2 (基线)
0
1
2
3
4
5
6
7
8
9
10
Training iterations
数字 8
Training curve of the iterative training strategy over Model 2 for dependency parsing.
136
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
C
哦
我
我
/
我
A
r
t
我
C
e
–
p
d
F
/
/
/
/
4
1
1
1
1
9
1
8
0
5
3
5
3
/
C
哦
我
我
_
A
_
0
0
2
1
0
p
d
.
F
乙
y
G
你
e
s
t
t
哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3
Jiang et al.
Automatic Adaptation of Annotations
数字 9
Performance of predict-self filtering and predict-self re-estimation over Model 2 for word
segmentation.
in the curves (located at iteration 1). The curves show that, for both segmentation
and parsing, the accuracies of the classifiers trained on the merged corpora consis-
tently improve in the earlier iterations (例如, from iteration 2 to iteration 5 for word
segmentation).
Experiments for introduction of predict-self filtering and predict-self re-estimation
are shown in Figures 9 和 10. The curves show the performances of the predict-
self re-estimation with a series of weight parameters, 范围从 0 到 1 with step
0.05. Note that in both figures, the points at λ = 0 show the performances of Model 2.
We find that predict-self filtering brings a slight improvement over the baseline for
80
79.5
79
78.5
)
%
磷
(
y
C
A
r
你
C
C
A
模型 2 + predict-self reestimation
模型 2 + predict-self filtering
模型 2 (基线)
0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
1
Predict-self ratio
数字 10
Performance of predict-self filtering and predict-self re-estimation over Model 2 for dependency
解析.
137
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
C
哦
我
我
/
我
A
r
t
我
C
e
–
p
d
F
/
/
/
/
4
1
1
1
1
9
1
8
0
5
3
5
3
/
C
哦
我
我
_
A
_
0
0
2
1
0
p
d
.
F
乙
y
G
你
e
s
t
t
哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3
计算语言学
体积 41, 数字 1
数字 11
Training curve of annotation adaptation with Model 3 for word segmentation.
)
%
磷
(
y
C
A
r
你
C
C
A
80.5
80
79.5
79
78.5
78
模型 3
+ iterative training
模型 2 (基线)
0
1
2
3
4
5
6
7
8
9
10
Training iterations
数字 12
Training curve of annotation adaptation with Model 3 for dependency parsing.
word segmentation, but even decreases the accuracy for dependency parsing. An initial
analysis on the experimental results reveals that the filtering strategy discards some
complicated sentences in the source corpora, and the discarded sentences would bring
further improvement if properly used. 例如, in word segmentation, predict-self
filtering discards 5% of sentences from the source corpus, containing nearly 10% 的
training words. For the two tasks, the predict-self re-estimation outperforms the filtering
战略. With properly tuned weights, predict-self re-estimation can make better use of
the training data. The largest accuracy improvements achieved over Model 2 for word
segmentation and dependency parsing are 0.3 points and 0.6 点.
人物 11 和 12 show the performance curves after the introduction of both
iterative training and predict-self re-estimation on the basis of Model 2 (this enhanced
138
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
C
哦
我
我
/
我
A
r
t
我
C
e
–
p
d
F
/
/
/
/
4
1
1
1
1
9
1
8
0
5
3
5
3
/
C
哦
我
我
_
A
_
0
0
2
1
0
p
d
.
F
乙
y
G
你
e
s
t
t
哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3
Jiang et al.
Automatic Adaptation of Annotations
桌子 10
The performance of the enhanced annotation adaptation for word segmentation, compared with
previous work.
模型
SPD → CTB
PD → CTB
Previous Work
(Jiang et al. 2008)
(Kruengkrai et al. 2009)
(Zhang and Clark 2010)
(Sun 2011)
Accuracy (F1%)
97.97
98.43
97.85
97.87
97.79
98.17
桌子 11
The performance of the enhanced annotation adaptation for dependency parsing, compared
with the results of SemEval-2012 contest, where the best systems of each institute are listed.
模型
DCTB → SDT
SemEval-2012 Contest
Zhijun Wu-1
Zhou Qiaoli-3
NJU-Parser-1
ICT-1
Giuseppe Attardi-SVM-1-R
Accuracy (P%)
79.34
80.64
80.60
80.35
73.20
60.83
model is denoted as Model 3). We find that the predict-self re-estimation brings im-
provement to the iterative training at each iteration, for both word segmentation and
dependency parsing. The maximum performance is achieved at iteration 4 for word
segmentation, and at iteration 5 for dependency parsing. The corresponding models
are evaluated on the corresponding testing sets, and the experimental results are also
shown in Tables 8 和 9. Compared to Model 1, the optimized annotation adaptation
战略, 模型 3, leads to classifiers with significantly higher accuracy and to process-
ing speeds that are several times faster. Tables 10 和 11 show the experimental results
compared with previous work. For both Chinese word segmentation and semantic de-
pendency parsing, automatic annotation adaptation yields state-of-the-art performance,
despite using single classifiers with only local features. Note that for the systems in the
SemEval contest (Che et al. 2012), many other technologies including clause segmen-
站, system combination, and complicated features were adopted, as well as elab-
orate engineering. We also performed significance tests2 to verify the effectiveness of
annotation adaptation.We find that for both Chinese word segmentation and semantic
dependency parsing, annotation adaptation brings significant improvement (p < 0.001)
over the baselines trained on the target corpora only.
2 http://www.cis.upenn.edu/∼dbikel/download/compare.pl.
139
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
-
p
d
f
/
/
/
/
4
1
1
1
1
9
1
8
0
5
3
5
3
/
c
o
l
i
_
a
_
0
0
2
1
0
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Computational Linguistics
Volume 41, Number 1
Table 12
Quantitative analysis for performance of annotation adaptation on word segmentation. The
words are grouped according to POS tags, and for each group, the recall values of baseline
and annotation adaptation are reported. To save space, only the categories with proportions
of more than 1% and with performance fluctuations of more than 0.1 points are listed.
Word Type
Proportion Baseline Anno. Ada. Trend
AD
CD
JJ
LC
NN
NR
VA
VV
3.99
2.07
3.25
1.36
29.95
9.81
1.58
13.52
98.75
98.79
95.01
98.16
97.79
97.32
95.27
98.33
96.87
99.39
95.40
99.08
98.83
98.98
97.63
98.61
↓
↑
↑
↑
↑
↑
↑
↑
To evaluate the stability of annotation adaptation, we perform quantitative analysis
on the results of annotation adaptation. For word segmentation, the words are grouped
according to POS tags. For dependency parsing, the dependency edges are grouped
according to POS tag pairs. For each category, the recall values of baseline and annota-
tion adaptation are reported. To filter the lists, we set two significance thresholds with
respect to the proportion of a category and the performance fluctuation between two
systems. For word segmentation, only the categories with proportions of more than
1% and with fluctuations of more than 0.1 points are reserved, and for dependency
parsing, the two thresholds are 1% and 0.5. Tables 12 and 13 show the analysis results for
word segmentation and dependency parsing, respectively. For both tasks, annotation
adaptation brings improvement for most of the situations.
Table 13
Quantitative analysis for performance of annotation adaptation on dependency parsing.
The dependency edges are grouped according to POS tag pairs, and for each group, the recall
values of baseline and annotation adaptation are reported. To save space, only the categories
with proportions of more than 1% and with performance fluctuations of more than 0.5 points
are listed.
Edge Type
Proportion Baseline Anno. Ada. Trend
2.61
1.00
1.68
3.14
2.31
3.15
1.89
1.33
1.07
5.55
1.76
11.36
1.71
9.89
7.72
90.76
98.19
83.48
79.82
85.49
75.71
64.64
67.87
77.80
81.69
93.82
78.85
76.19
64.40
59.74
93.41
97.29
87.43
83.86
84.83
76.39
71.81
70.13
82.86
83.00
94.51
82.02
79.01
66.91
62.21
↑
↓
↑
↑
↓
↑
↑
↑
↑
↑
↑
↑
↑
↑
↑
NN→JJ
NN→LC
NN→M
NN→NR
NN→P
NN→PU
NN→VV
VC→PU
VE→NN
VV→AD
VV→DEC
VV→NN
VV→NR
VV→PU
VV→VV
140
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
-
p
d
f
/
/
/
/
4
1
1
1
1
9
1
8
0
5
3
5
3
/
c
o
l
i
_
a
_
0
0
2
1
0
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Jiang et al.
Automatic Adaptation of Annotations
)
%
F
(
y
c
a
r
u
c
c
A
98
97
96
95
94
93
92
91
90
Target corpus only
Annotation Adaptation
1000
2000
4000
8000
16000
Scale of target corpus (# of sentences)
Figure 13
Performance of annotation adaptation with varying-size target corpora for Chinese word
segmentation.
We further investigate the effect of varying the sizes of the target corpora. Exper-
iments are conducted for word segmentation and dependency parsing with fixed-size
source corpora and varying-size target corpora. We use SPD and DCTB as the source
corpora for word segmentation and dependency parsing, respectively. Figures 13 and
14 show the performance curves on the testing sets. We find that, for both word segmen-
tation and dependency parsing, the improvements brought by annotation adaptation
are more significant when the target corpora are smaller. It means that the automatic
annotation adaptation is more valuable when the size of the target corpus is small,
)
%
P
(
y
c
a
r
u
c
c
A
80
79
78
77
76
75
74
73
72
71
70
Target corpus only
Annotation Adaptation
500
1000
2000
4000
8000
Scale of target corpus (# of sentences)
Figure 14
Performance of annotation adaptation with varying-size target corpora for semantic dependency
parsing.
141
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
-
p
d
f
/
/
/
/
4
1
1
1
1
9
1
8
0
5
3
5
3
/
c
o
l
i
_
a
_
0
0
2
1
0
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Computational Linguistics
Volume 41, Number 1
which is good news for the situation where the corpus we are concerned with is smaller
but a larger differently annotated corpus exists.
Of course, the comparison between automatic annotation adaptation and previous
strategies without using additional training data is unfair. Our work aims to find
another way to improve NLP tasks: focusing on the collection of more training data
instead of making full use of a certain corpus. We believe that the performance of
automatic annotation adaptation can be further improved by adopting the advanced
technologies of previous work, such as complicated features and model combination. It
would be useful to conduct experiments with more source-annotated training data, such
as the SIGHAN data set for word segmentation, to investigate the trend of improvement
along with the further increment of annotated sentences. It would also be valuable to
evaluate the improved word segmenter and dependency parser on the out-of-domain
data sets. However, currently most corpora for word segmentation and dependency
parsing do not explicitly distinguish the domains of their data sections, making such
evaluations difficult to conduct.
6. Discussion: Application Situations
Automatic annotation adaptation aims to transform the annotations in a corpus to the
annotations following other guidelines. The models for annotation adaptation use a
transfer classifier to learn the statistical correspondence regularities between different
annotation guidelines. These statistical regularities are learned from a parallel anno-
tated corpus, which does not need to be manually annotated. In fact, the models for
annotation adaptation train the transfer classifier on an automatically generated parallel
annotated corpus, which is generated by processing a corpus with a classifier trained
on another corpus. That is to say, if we want to conduct annotation adaptation across
several corpora, no additional corpora need to be manually annotated. This setting
makes the strategy of annotation adaptation more general, because it is much harder to
manually annotate a parallel annotated corpus, regardless of the language or the NLP
problem under consideration. To tackle the problem of noise in automatically generated
annotations, the advanced models we designed generate a better parallel annotated
corpus by making use of strategies such as iterative optimization.
Automatic annotation adaptation can be applied in any situation where we have
multiple corpora with different and incompatible annotation philosophies for the same
task. As our case studies, both Chinese word segmentation and dependency parsing
have more than one corpora with different annotation guidelines, such as the People’s
Daily and the Penn Chinese Treebank for Chinese word segmentation. In a more abstract
view, constituency grammar and dependency grammar can be treated as two annota-
tion guidelines for parsing. The syntactic knowledge in a constituency treebank and a
dependency treebank, therefore, can be integrated by automatic annotation adaptation.
For example, the LinGo Redwoods Treebank can also be transformed to the annotation
guideline of the Semantic Dependency Treebank.
Furthermore, the annotations (such as a grammar) given by bilingual projection or
unsupervised induction can be seen as following a special annotation philosophy. For
bilingually projected annotations, the annotation guideline would be similar to that of
the counterpart language. For unsupervised induced annotations, the annotation guide-
line reflects the statistical structural distribution of a specific data set. In both situations,
the underlying annotation guidelines may be largely different from that of the testing
sets, which usually come from human-annotated corpora. The system trained on a bilin-
gually projected or unsupervised induced corpus may perform poorly on an existing
142
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
-
p
d
f
/
/
/
/
4
1
1
1
1
9
1
8
0
5
3
5
3
/
c
o
l
i
_
a
_
0
0
2
1
0
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Jiang et al.
Automatic Adaptation of Annotations
testing set, but if the projected or induced corpus has high inner consistency, it could
improve a system trained on an existing corpus by automatic annotation adaptation.
In this point of view, the practical value of the current work on bilingual projection
and unsupervised induction may be underestimated, and annotation adaptation could
make better use of the projected or induced knowledge.3
7. Related Work
There has already been some preliminary work tackling the divergence between dif-
ferent annotation guidelines. Gao et al. (2004) described a transformation-based con-
verter to transfer a certain word segmentation result to another annotation guideline.
They designed class-type transformation templates and used the transformation-based
error-driven learning method of Brill (1995) to learn what word delimiters should be
modified. Many efforts have been devoted to manual treebank transformation, where
PTB is adapted to other grammar formalisms, such as CCG and LFG (Cahill et al.
2002; Hockenmaier and Steedman 2007). However, all these are heuristic-based—that
is, they need manually designed transformation templates and involve heavy human
engineering. Such strategies are hard to be generalized to POS tagging, not to mention
other complicated structural prediction tasks.
We investigated the automatic integration of word segmentation knowledge in
differently annotated corpora (Jiang, Huang, and Liu 2009; Jiang et al. 2012), which can
be seen as the preliminary work of automatic annotation adaptation. Motivated by our
initial investigation, researchers applied similar methodologies to constituency parsing
(Sun, Wang, and Zhang 2010; Zhu, Zhu, and Hu 2011) and word segmentation (Sun
and Wan 2012). This previous work verified the effectiveness of automatic annotation
adaptation, but did not reveal the essential definition of the problem nor the intrinsic
principles of the solutions. Instead, this work clearly defines the problem of annotation
adaptation, reveals the intrinsic principles of the solutions, and systematically describes
a series of gradually improved models. The most advanced model learns transformation
regularities much better and achieves significant higher accuracy for both word segmen-
tation and dependency parsing, without slowing down the final language processors.
The problem of automatic annotation adaptation can be seen as a special case of
transfer learning (Pan and Yang 2010), where the source and target tasks are similar, but
not identical. More specifically, the problem related to annotation adaptation assumes
that the labeling mechanism across the source and target tasks are the same, but the
predictive functions are different. The goal of annotation adaptation is to adapt the
source predictive function to be used for the target task by exploiting the labeled data
of the source task and the target task. Furthermore, automatic annotation adaptation
approximately falls into the spectrum of relational-knowledge-transfer problems
(Mihalkova, Huynh, and Mooney 2007; Mihalkova and Mooney 2008; Davis and
Domingos 2009), but it tackles problems where the relations among data between the
source and target domains can be largely isomerous—or, in other words, with different
and incompatible annotation schemes. This work enriches the research of transfer
learning by proposing and solving an NLP problem different from previous situations.
For more details of transfer learning please refer to the survey of Pan and Yang (2010).
3 We have performed preliminary experiments on word segmentation. Bilingual projection was
conducted from English to Chinese with the Chinese–English FBIS as the bilingual corpus. By
annotation adaptation, the projected corpus for word segmentation brings a significant F-measure
increment of nearly 0.6 points over the baseline trained on CTB only.
143
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
-
p
d
f
/
/
/
/
4
1
1
1
1
9
1
8
0
5
3
5
3
/
c
o
l
i
_
a
_
0
0
2
1
0
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Computational Linguistics
Volume 41, Number 1
The training procedure for an annotation adaptation model requires a parallel an-
notated corpus (which may be automatically generated); this fact puts the method into
the neighborhood of the family of approaches known as annotation projection (Hwa
et al. 2002, 2005; Ganchev, Gillenwater, and Taskar 2009; Smith and Eisner 2009; Jiang
and Liu 2010; Das and Petrov 2011). Essentially, annotation adaptation and annotation
projection tackle different problems; the former aims to transform the annotations from
one guideline to another (of course in the same language), whereas the latter aims to
project the annotation (as well as the annotation guideline) from one language to another.
Therefore, the machine learning methods for annotation adaptation pay attention to
automatic transformation of annotations, while for annotation projection, the machine
learning methods focus on the bilingual projection across languages.
Co-training (Sarkar 2001) and classifier combination (Nivre and McDonald 2008) are
two techniques for training improved dependency parsers. The co-training technique
lets two different parsing models learn from each other during the parsing of unlabeled
text: One model selects some unlabeled sentences it can confidently parse, and provides
them to the other model as additional training data in order to train more powerful
parsers. The classifier combination lets graph-based and transition-based dependency
parsers utilize the features extracted from each other’s parsing results, to obtain com-
bined, enhanced parsers. The two techniques aim to let two models learn from each
other on the same corpus with the same distribution and annotation guideline, whereas
our strategy aims to integrate the knowledge in multiple corpora with different annota-
tion guidelines.
The iterative training procedure used in the optimized model shares some sim-
ilarities with the co-training algorithm in parsing (Sarkar 2001), where the training
procedure lets two different models learn from each other during parsing of the raw
text. The key idea of co-training is to utilize the complementarity of different parsing
models to mine additional training data from raw text, whereas iterative training for
annotation adaptation emphasizes the iterative optimization of the parallel annotated
corpora used to train the transfer classifiers. The predict-self methodology is implicit in
many unsupervised learning approaches; it has been successfully used in unsupervised
dependency parsing (Daum´e III 2009). We adapt this idea to the scenario of annotation
adaptation to improve transformation accuracy.
In recent years much effort has been devoted to the improvement of word seg-
mentation and dependency parsing. For example, the introduction of global training
or complicated features (Zhang and Clark 2007, 2010); the investigation of word struc-
tures (Li 2011); the strategies of hybrid, joint, or stacked modeling (Nakagawa and
Uchimoto 2007; Kruengkrai et al. 2009; Wang, Zong, and Su 2010; Sun 2011); and the
semi-supervised and unsupervised technologies utilizing raw text (Zhao and Kit 2008;
Johnson and Goldwater 2009; Mochihashi, Yamada, and Ueda 2009; Hewlett and Cohen
2011). We believe that the annotation adaptation technologies can be adopted jointly
with complicated features, system combination, and semi-supervised/unsupervised
technologies to further improve the performance of word segmentation and depen-
dency parsing.
8. Conclusion and Future Work
We have described the problem of annotation adaptation and the intrinsic principles
of its solutions, and proposed a series of successively enhanced models that can au-
tomatically adapt the divergence between different annotation formats. These models
learn the statistical regularities of adaptation between different annotation guidelines,
144
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
-
p
d
f
/
/
/
/
4
1
1
1
1
9
1
8
0
5
3
5
3
/
c
o
l
i
_
a
_
0
0
2
1
0
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Jiang et al.
Automatic Adaptation of Annotations
and integrate the knowledge in corpora with different annotation guidelines. In the
problems of Chinese word segmentation and semantic dependency parsing, annotation
adaptation algorithms bring significant improvements by integrating the knowledge in
differently annotated corpora, People’s Daily and Penn Chinese Treebank for word seg-
mentation, and Penn Chinese Treebank and Semantic Dependency Parsing for depen-
dency parsing. For both tasks, annotation adaptation leads to a segmenter and a parser
achieving the state-of-the-art, despite using only local features in single classifiers.
Many aspects related to annotation adaptation deserve further investigation in the
future. First, models for annotation adaptation can be adapted to other NLP tasks such
as semantic analysis. Second, jointly tackling the divergences in both annotations and
domains is an important problem. In addition, an unsupervised-induced or bilingually
projected corpus, despite performing poorly on the specified testing data, may have
high inner annotation consistency. That is to say, the induced corpora can be treated
as a knowledge source following another annotation guideline, and the performance of
current unsupervised or bilingually projected models may be seriously underestimated.
Annotation adaptation may give us a new perspective on knowledge induction and
measurement for such methods.
Acknowledgments
Jiang, L ¨u, and Liu were supported by
National Natural Science Foundation
of China (contract 61202216) and the
National Key Technology R&D Program
(no. 2012BAH39B03). Huang was
supported in part by the DARPA DEFT
Project (FA8750-13-2-0041). Liu was partially
supported by the Science Foundation
Ireland (grant no. 07/CE/I1142) as part
of the CNGL at Dublin City University.
We also thank the anonymous reviewers
for their insightful comments. Finally,
we want to thank Chris Hokamp for
proofreading.
References
Blitzer, John, Ryan McDonald, and Fernando
Pereira. 2006. Domain adaptation with
structural correspondence learning. In
Proceedings of EMNLP, pages 120–128,
Sydney.
Brill, Eric. 1995. Transformation-based
error-driven learning and natural
language processing: A case study in
part-of-speech tagging. Computational
Linguistics, 21(4):543–565.
Buchholz, Sabine and Erwin Marsi. 2006.
CONLL-X shared task on multilingual
dependency parsing. In Proceedings of
CoNLL, pages 149–164, New York, NY.
Cahill, Aoife, Mairead McCarthy, Josef van
Genabith, and Andy Way. 2002. Automatic
annotation of the Penn treebank with LFG
F-structure information. In Proceedings of
the LREC Workshop, Las Palmas.
Che, Wanxiang, Meishan Zhang, Yanqiu
Shao, and Ting Liu. 2012. Semeval-2012
task 5: Chinese semantic dependency
parsing. In Proceedings of SemEval,
pages 378–384, Montreal.
Collins, Michael. 2002. Discriminative
training methods for hidden Markov
models: Theory and experiments with
perceptron algorithms. In Proceedings of
EMNLP, pages 1–8, Philadelphia, PA.
Das, Dipanjan and Slav Petrov. 2011.
Unsupervised part-of-speech tagging with
bilingual graph-based projections. In
Proceedings of ACL, pages 600–609,
Portland, OR.
Daum´e III, Hal. 2007. Frustratingly easy
domain adaptation. In Proceedings of ACL,
pages 256–263, Prague.
Daum´e III, Hal. 2009. Unsupervised search-
based structured prediction. In Proceedings
of ICML, pages 209–216, Montreal.
Daum´e III, Hal and Daniel Marcu. 2006.
Domain adaptation for statistical
classifiers. Journal of Artificial Intelligence
Research, 26:101–126.
Davis, Jesse and Pedro Domingos.
2009. Deep transfer via second-order
Markov logic. In Proceedings of ICML,
pages 217–224, Montreal.
Eisner, Jason M. 1996. Three new
probabilistic models for dependency
parsing: An exploration. In Proceedings of
COLING, pages 340–345, Copenhagen.
Ganchev, Kuzman, Jennifer Gillenwater, and
Ben Taskar. 2009. Dependency grammar
induction via bitext projection constraints.
In Proceedings of ACL, pages 369–377,
Singapore.
145
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
-
p
d
f
/
/
/
/
4
1
1
1
1
9
1
8
0
5
3
5
3
/
c
o
l
i
_
a
_
0
0
2
1
0
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Computational Linguistics
Volume 41, Number 1
Gao, Jianfeng, Andi Wu, Mu Li, Chang-Ning
Huang, Hongqiao Li, Xinsong Xia, and
Haowei Qin. 2004. Adaptive Chinese word
segmentation. In Proceedings of ACL,
pages 462–469, Barcelona.
Hewlett, Daniel and Paul Cohen. 2011. Fully
unsupervised word segmentation with
BVE and MDL. In Proceedings of ACL,
pages 540–545, Portland, OR.
Hockenmaier, Julia and Mark Steedman.
2007. CCGBank: A corpus of CCG
derivations and dependency structures
extracted from the Penn treebank.
Computational Linguistics, 33(3):355–396.
Hwa, Rebecca, Philip Resnik, Amy
Weinberg, Clara Cabezas, and Okan Kolak.
2005. Bootstrapping parsers via syntactic
projection across parallel texts. Natural
Language Engineering, 11(3):311–325.
Hwa, Rebecca, Philip Resnik, Amy
Weinberg, and Okan Kolak. 2002.
Evaluating translational correspondence
using annotation projection. In Proceedings
of ACL, pages 392–399, Philadephia, PA.
Jiang, Wenbin, Liang Huang, and Qun Liu.
2009. Automatic adaptation of annotation
standards: Chinese word segmentation
and POS tagging – A case study. In
Proceedings of ACL, pages 522–530,
Singapore.
Jiang, Wenbin, Liang Huang, Yajuan L ¨u, and
Qun Liu. 2008. A cascaded linear model
for joint Chinese word segmentation and
part-of-speech tagging. In Proceedings of
ACL, pages 897–904, Columbus, OH.
Jiang, Wenbin and Qun Liu. 2010.
Dependency parsing and projection
based on word-pair classification. In
Proceedings of the ACL, pages 12–20,
Uppsala.
Jiang, Wenbin, Fandong Meng, Qun Liu,
and Yajuan L ¨u. 2012. Iterative annotation
transformation with predict-self
reestimation for Chinese word
segmentation. In Proceedings of EMNLP,
pages 412–420, Jeju Island.
Johnson, Mark and Sharon Goldwater. 2009.
Improving nonparameteric Bayesian
inference: Experiments on unsupervised
word segmentation with adaptor
grammars. In Proceedings of NAACL,
pages 317–325, Boulder, CO.
Kruengkrai, Canasai, Kiyotaka Uchimoto,
Junichi Kazama, Yiou Wang, Kentaro
Torisawa, and Hitoshi Isahara. 2009. An
error-driven word-character hybrid model
for joint Chinese word segmentation and
POS tagging. In Proceedings of
ACL-IJCNLP, pages 513–521, Singapore.
146
Li, Zhongguo. 2011. Parsing the internal
structure of words: A new paradigm for
Chinese word segmentation. In Proceedings
of ACL, pages 1,405–1,414, Portland, OR.
Marcus, Mitchell P., Beatrice Santorini, and
Mary Ann Marcinkiewicz. 1993. Building
a large annotated corpus of English: The
Penn treebank. Computational Linguistics,
19(2):313–330.
Martins, Andr´e F. T., Dipanjan Das, Noah A.
Smith, and Eric P. Xing. 2008. Stacking
dependency parsers. In Proceedings of
EMNLP, pages 157–166, Honolulu, HI.
McDonald, Ryan, Koby Crammer,
and Fernando Pereira. 2005. Online
large-margin training of dependency
parsers. In Proceedings of ACL,
pages 91–98, Ann Arbor, MI.
McDonald, Ryan and Fernando Pereira.
2006. Online learning of approximate
dependency parsing algorithms. In
Proceedings of EACL, pages 81–88, Trento.
Mihalkova, Lilyana, Tuyen Huynh, and
Raymond J. Mooney. 2007. Mapping
and revising Markov logic networks for
transfer learning. In Proceedings of AAAI,
volume 7, pages 608–614, Vancouver.
Mihalkova, Lilyana and Raymond J. Mooney.
2008. Transfer learning by mapping with
minimal target data. In Proceedings of
AAAI Workshop Transfer Learning for
Complex Tasks, Chicago, IL.
Mochihashi, Daichi, Takeshi Yamada,
and Naonori Ueda. 2009. Bayesian
unsupervised word segmentation
with nested Pitman-Yor language
modeling. In Proceedings of ACL-IJCNLP,
pages 100–108, Singapore.
Nakagawa, Tetsuji and Kiyotaka Uchimoto.
2007. A hybrid approach to word
segmentation and POS tagging. In
Proceedings of ACL, pages 217–220, Prague.
Ng, Hwee Tou and Jin Kiat Low. 2004.
Chinese part-of-speech tagging:
One-at-a-time or all-at-once? Word-based
or character-based? In Proceedings of
EMNLP, pages 277–284, Barcelona.
Nivre, Joakim and Ryan McDonald.
2008. Integrating graph-based and
transition-based dependency parsers.
In Proceedings of ACL, pages 950–958,
Columbus, OH.
Oepen, Stephan, Kristina Toutanova,
Stuart Shieber, Christopher Manning
Dan Flickinger, and Thorsten Brants.
2002. The LinGo Redwoods treebank:
Motivation and preliminary applications.
In Proceedings of COLING, volume 2,
pages 1–5, Taipei.
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
-
p
d
f
/
/
/
/
4
1
1
1
1
9
1
8
0
5
3
5
3
/
c
o
l
i
_
a
_
0
0
2
1
0
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Jiang et al.
Automatic Adaptation of Annotations
Pan, Sinno Jialin and Qiang Yang. 2010.
A survey on transfer learning. IEEE
TKDE, 22(10):1345–1359.
Sarkar, Anoop. 2001. Applying
co-training methods to statistical
parsing. In Proceedings of NAACL,
pages 1–8, Pittsburgh, PA.
Smith, David and Jason Eisner. 2009.
Parser adaptation and projection with
quasi-synchronous grammar features.
In Proceedings of EMNLP, volume 2,
pages 822–831, Singapore.
Sun, Weiwei. 2011. A stacked sub-word
model for joint Chinese word segmentation
and part-of-speech tagging. In Proceedings
of ACL, pages 1,385–1,394, Portland, OR.
Sun, Weiwei and Xiaojun Wan. 2012.
Reducing approximation and estimation
errors for Chinese lexical processing with
heterogeneous annotations. In Proceedings
of ACL, volume 1, pages 232–241,
Jeju Island.
Sun, Weiwei, Rui Wang, and Yi Zhang. 2010.
Discriminative parse reranking for
Chinese with homogeneous and
heterogeneous annotations. In Proceedings
of CIPS-SIGHAN, Beijing. Available at
http://aclweb.org/anthology/W10-4144.
Wang, Kun, Chengqing Zong, and Keh-Yih
Su. 2010. A character-based joint model for
Chinese word segmentation. In Proceedings
of COLING, pages 1,173–1,181, Beijing.
Xue, Nianwen and Libin Shen. 2003. Chinese
word segmentation as LMR tagging. In
Proceedings of SIGHAN Workshop,
volume 17, pages 176–179, Sapporo.
Xue, Nianwen, Fei Xia, Fu-Dong Chiou, and
Martha Palmer. 2005. The Penn Chinese
treebank: Phrase structure annotation
of a large corpus. Natural Language
Engineering, 11(2):207–238.
Yamada, H. and Y. Matsumoto. 2003.
Statistical dependency analysis with
support vector machines. In Proceedings
of IWPT, pages 195–206, Nancy.
Yu, Shiwen, Jianming Lu, Xuefeng Zhu,
Huiming Duan, Shiyong Kang, Honglin
Sun, Hui Wang, Qiang Zhao, and
Weidong Zhan. 2001. Processing norms
of modern Chinese corpus. Technical
report, Institute of Computational
Linguistics, Peking University.
Zhang, Yue and Stephen Clark. 2007.
Chinese segmentation with a word-based
perceptron algorithm. In Proceedings of
ACL, pages 840–847, Prague.
Zhang, Yue and Stephen Clark. 2010. A fast
decoder for joint word segmentation and
POS-tagging using a single discriminative
model. In Proceedings of EMNLP,
pages 843–852, Cambridge, MA.
Zhao, Hai and Chunyu Kit. 2008.
Unsupervised segmentation helps
supervised learning of character tagging
for word segmentation and named entity
recognition. In Proceedings of IJCNLP,
pages 106–111, Hyderabad.
Zhu, Muhua, Jingbo Zhu, and Minghan Hu.
2011. Better automatic treebank conversion
using a feature-based approach.
In Proceedings of ACL, volume 2,
pages 715–719, Portland, OR.
147
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
-
p
d
f
/
/
/
/
4
1
1
1
1
9
1
8
0
5
3
5
3
/
c
o
l
i
_
a
_
0
0
2
1
0
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
-
p
d
f
/
/
/
/
4
1
1
1
1
9
1
8
0
5
3
5
3
/
c
o
l
i
_
a
_
0
0
2
1
0
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
下载pdf