Out-of-Domain Discourse Dependency Parsing via Bootstrapping:
An Empirical Analysis on Its Effectiveness and Limitation
Noriki Nishida and Yuji Matsumoto
RIKEN Center for Advanced Intelligence Project, 日本
{noriki.nishida, yuji.matsumoto}@riken.jp
抽象的
Discourse parsing has been studied for de-
你跌倒了. 然而, it still remains challenging to
utilize discourse parsing for real-world appli-
cations because the parsing accuracy degrades
significantly on out-of-domain text. In this pa-
每, we report and discuss the effectiveness
and limitations of bootstrapping methods for
adapting modern BERT-based discourse de-
pendency parsers to out-of-domain text with-
out relying on additional human supervision.
具体来说, we investigate self-training, 共-
训练,
tri-training, and asymmetric tri-
training of graph-based and transition-based
discourse dependency parsing models, 还有
as confidence measures and sample selection
criteria in two adaptation scenarios: mono-
logue adaptation between scientific disciplines
and dialogue genre adaptation. We also release
COVID-19 Discourse Dependency Treebank
(COVID19-DTB), a new manually annotated
resource for discourse dependency parsing of
biomedical paper abstracts. 实验的
results show that bootstrapping is significantly
and consistently effective for unsupervised
domain adaptation of discourse dependency
解析, but the low coverage of accurately
predicted pseudo labels is a bottleneck for
further improvement. We show that active
learning can mitigate this limitation.
1
介绍
Discourse parsing aims to uncover structural or-
ganization of text, which is useful in Natural
语言处理 (自然语言处理) applications such as
document summarization (Louis et al., 2010;
Hirao et al., 2013; Yoshida et al., 2014; Bhatia
等人。, 2015; Durrett et al., 2016; 徐等人。,
2020), text categorization (Ji and Smith, 2017;
Ferracane et al., 2017), question answering
(Verberne et al., 2007; Jansen et al., 2014), 和
information extraction (Quirk and Poon, 2017).
尤其, dependency-style representation of
127
discourse structure has been studied intensively
in recent years (亚瑟和拉斯卡里德斯, 2003;
Hirao et al., 2013; 李等人。, 2014乙; Morey et al.,
2018; Hu et al., 2019; Shi and Huang, 2019).
数字 1 shows an example of discourse depen-
dency structure, which is recorded in COVID-19
Discourse Dependency Treebank (COVID19-DTB),
a new manually annotated resource for discourse
dependency parsing of biomedical abstracts.
State-of-the-art discourse dependency parsers
are generally trained on a manually annotated
treebank, which is available in a limited number
of domains, such as RST-DT (Carlson et al., 2001)
for news articles, SciDTB (Yang and Li, 2018)
for NLP abstracts, and STAC (Asher et al., 2016)
and Molweni (李等人。, 2020) for multi-party
对话. 然而, when the parser is applied
directly to out-of-domain documents, the parsing
accuracy degrades significantly due to the domain
shift problem. 实际上, we normally face this issue
in the real world because human supervision is
generally scarce and expensive to obtain in the
domain of interest.
Unsupervised Domain Adaptation (UDA) aims
to adapt a model trained on a source domain, 在哪里
a limited amount of labeled data is available, 到
a target domain, where only unlabeled data is
可用的. 自举 (or pseudo labeling) 有
been shown to be effective for the UDA problem
of syntactic parsing (Steedman et al., 2003乙,A;
Reichart and Rappoport, 2007; Søaard and Rishøj,
2010; Weiss et al., 2015). In bootstrapping for
syntactic parsing, we first train a model on the
labeled source sentences, the model is used to
give pseudo labels (IE。, parse trees) to unlabeled
target sentences, and then the model is retrained on
the manually and automatically labeled sentences.
相反, despite the significant progress
achieved in discourse parsing so far (李等人。,
2014乙; Ji and Eisenstein, 2014; Joty et al., 2015;
Perret et al., 2016; 王等人。, 2017; Kobayashi
计算语言学协会会刊, 卷. 10, PP. 127–144, 2022. https://doi.org/10.1162/tacl 00451
动作编辑器: Shay Cohen. 提交批次: 6/2021; 修改批次: 9/2021; 已发表 2/2022.
C(西德:2) 2022 计算语言学协会. 根据 CC-BY 分发 4.0 执照.
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
t
A
C
我
/
我
A
r
t
我
C
e
–
p
d
F
/
d
哦
我
/
.
1
0
1
1
6
2
/
t
我
A
C
_
A
_
0
0
4
5
1
1
9
8
7
0
3
1
/
/
t
我
A
C
_
A
_
0
0
4
5
1
p
d
.
F
乙
y
G
你
e
s
t
t
哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3
数字 1: An example of discourse dependency struc-
ture for a COVID-19 related biomedical paper abstract
(Israeli et al., 2020), which we manually annotated for
our new dataset.
等人。, 2020; Koto et al., 2021), bootstrapping
for the UDA problem of discourse parsing is
still not well understood. Jiang et al. (2016) 和
Kobayashi et al. (2021) explored how to enrich
the labeled dataset using bootstrapping methods;
然而, their studies are limited to the in-domain
setup, where the labeled and unlabeled datasets
are derived from the same domain. 相比之下
to these studies, we focus on the more realistic
and challenging scenario, 即, out-of-domain
discourse parsing, where the quality and diversity
of the pseudo-labeled dataset become more crucial
for performance enhancement.
在本文中, we perform a series of analyses of
various bootstrapping methods in UDA of modern
BERT-based discourse dependency parsers and
report the effectiveness and limitations of these
方法. 数字 2 shows an overview of our
bootstrapping system. 具体来说, we investigate
self-training (Yarowsky, 1995), co-training (布鲁姆
and Mitchell, 1998; Zhou and Goldman, 2004),
tri-training (Zhou and Li, 2005), and asymmet-
ric tri-training (Saito et al., 2017) of graph-based
and transition-based discourse dependency pars-
ing models, as well as confidence measures and
sample selection criteria in two adaptation sce-
narios: monologue adaptation between scientific
disciplines and dialogue genre adaptation. 我们
show that bootstrapping improves out-of-domain
discourse dependency parsing significantly and
consistently across different adaptation setups.
128
数字 2: An overview of our bootstrapping system for
unsupervised domain adaptation of discourse depen-
dency parsing.
Our analyses also reveal that bootstrapping has
a difficulty in creating pseudo-labeled data that
is both diverse and accurate, which is a current
limiting factor in further improving accuracy, 和
furthermore it is difficult to boost the coverage by
simply increasing the number of unlabeled docu-
评论. We show that an active learning approach
can be an effective solution to the limitation.1
The rest of this paper is organized as follows:
部分 2 provides an overview of related studies.
部分 3 clarifies the problem and describes the
方法论: bootstrapping algorithms, 话语
dependency parsing models, confidence measures,
and sample selection criteria. 部分 4 描述
the details of COVID19-DTB. 部分 5 描述
the experimental setup. 在部分 6 we report
and discuss the experimental results and provide
practical recommendations for out-of-domain dis-
course dependency parsing. 最后, 部分 7
concludes the paper.
2 相关工作
Various discourse parsing models have been
proposed in the past decades. For constituency-
style discourse structure like RST (Mann and
汤普森, 1988), the parsing models can be cate-
gorized into the chart-based approach (Joty et al.,
2013; Joty et al., 2015; 李等人。, 2014A, 2016A),
which finds the globally optimal tree using an effi-
cient algorithm like dynamic programming, 或者
1Our code is available at https://github.com
/norikinishida/discourse-parsing. The COVID19-
DTB dataset is also available at https://github.com
/norikinishida/biomedical-discourse-treebanks.
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
t
A
C
我
/
我
A
r
t
我
C
e
–
p
d
F
/
d
哦
我
/
.
1
0
1
1
6
2
/
t
我
A
C
_
A
_
0
0
4
5
1
1
9
8
7
0
3
1
/
/
t
我
A
C
_
A
_
0
0
4
5
1
p
d
.
F
乙
y
G
你
e
s
t
t
哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3
transition-based (or sequential) 方法 (马可,
1999; Sagae, 2009; Hernault et al., 2010乙; 冯
and Hirst, 2014; Ji and Eisenstein, 2014; 王
等人。, 2017; Kobayashi et al., 2020; 张等人。,
2020; Koto et al., 2021), which builds a tree in-
crementally by performing a series of decisions.
For dependency-style discourse structure like the
RST variants (Hirao et al., 2013; 李等人。, 2014乙;
Morey et al., 2018) or Segmented Discourse Rep-
resentation Theory (亚瑟和拉斯卡里德斯, 2003),
the models can also be categorized into the graph-
based approach (李等人。, 2014乙; Yoshida et al.,
2014; Afantenos et al., 2015; Perret et al., 2016) 或者
the transition-based (sequential) 方法 (穆勒
等人。, 2012; Hu et al., 2019; Shi and Huang, 2019).
最近, pre-trained transformer encoders such as
BERT (Devlin et al., 2019) and SpanBERT (Joshi
等人。, 2019) have been shown to greatly improve
discourse parsing accuracy (Guz and Carenini,
2020; Koto et al., 2021). 在本文中, we are not
aiming at developing novel parsing models. 在-
代替, we aim to investigate the effectiveness and
limitations of bootstrapping methods for adapting
the modern BERT-based discourse parsers.
Manually annotated discourse treebanks are sig-
nificantly scarce, and their domains are limited.
例如, the most popular discourse tree-
bank, RST-DT (Carlson et al., 2001), 包含
仅有的 385 labeled documents in total. To address
the lack of large-scale labeled data, a number
of semi-supervised, weakly supervised, and un-
supervised techniques have been proposed in the
discourse parsing literature. 埃尔诺等人. (2010A)
proposed a semi-supervised method that utilizes
unlabeled documents to expand feature vectors
in SVM classifiers in order to achieve better
generalization for infrequent discourse relations.
Liu and Lapata (2018) and Huber and Carenini
(2019) proposed to exploit document-level class
labels (例如, 情绪) as distant supervision to in-
duce discourse dependency structures from neural
attention weights. Badene et al. (2019A,乙) investi-
gated a data programming paradigm (Ratner et al.,
2016), which uses rule-based labeling functions to
automatically annotate unlabeled documents and
trains a generative model on the weakly supervised
数据. Kobayashi et al. (2019) and Nishida and
Nakayama (2020) proposed fully unsupervised
discourse constituency parsers, which can produce
only tree skeletons and rely strongly on pre-trained
word embeddings or human prior knowledge on
document structure.
Technically most similar to our work, Jiang
等人. (2016) and Kobayashi et al. (2021) proposed
to enlarge the training dataset using a combination
of multiple parsing models. Jiang et al. (2016)
used co-training for enlarging the RST-DT train-
ing set with 2,000 Wall Street Journal articles,
with a focus on improving classification accu-
racy on infrequent discourse relations. Kobayashi
等人. (2021) proposed to exploit discourse sub-
trees that are agreed by two different models for
enlarging the RST-DT training set. 有趣的是,
their proposed methods improved the classifica-
tion accuracy especially for infrequent discourse
关系.
These studies mainly assume the in-domain sce-
nario and focus on enlarging the labeled set (例如,
RST-DT training set) using in-domain unlabeled
文件, and the system evaluation is generally
performed on the same domain with the original
labeled set (例如, RST-DT test set). 在本文中,
反而, we particularly focus on the UDA sce-
成员, where the goal is to parse the target-domain
documents accurately without relying on human
supervision in the target domain. We believe this
research direction is important for developing us-
able discourse parsers, because a target domain to
which one would like to apply a discourse parser is
normally different from the domains/genres of ex-
isting corpora, and manually annotated resources
are rarely available in most domains/genres.
3 方法
3.1 Problem Formulation
The input is a document represented as a se-
quence of clause-level (in single-authored text)
or utterance-level (in multi-party dialogues) 跨度
called Elementary Discourse Units (EDUs).2 我们的
goal is to derive a discourse dependency structure,
y = {(H, d, r) | 0 ≤ h ≤ n, 1 ≤ d ≤ n, r ∈ R},
given the input EDUs, x = e0, e1, . . . , 在, 哪个
is analogous to syntactic dependency structure.
A discourse dependency, (H, d, r), represents that
the d-th EDU (called dependent) relates to the h-th
EDU (called head) directly with the discourse re-
lation r ∈ R. Each EDU except for the root node,
e0, has a single head.
在本文中, we assume that we have a lim-
ited number of labeled documents in the source
2We call both single-authored text and multi-party
dialogues as documents.
129
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
t
A
C
我
/
我
A
r
t
我
C
e
–
p
d
F
/
d
哦
我
/
.
1
0
1
1
6
2
/
t
我
A
C
_
A
_
0
0
4
5
1
1
9
8
7
0
3
1
/
/
t
我
A
C
_
A
_
0
0
4
5
1
p
d
.
F
乙
y
G
你
e
s
t
t
哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3
domain, while a large collection of unlabeled
documents is available in the target domain. 在
特别的, we assume that the source and target
domains have different data distributions lexically
or rhetorically (例如, 词汇, document length,
and discourse relation distributions), but the do-
mains share the same annotation scheme (例如,
definition of discourse relation classes). Our task
is to adapt a parsing model (or models) trained in
the source domain to the target domain using the
unlabeled target data.
3.2 自举
The aim of this paper is to investigate the effec-
tiveness and limitations of various bootstrapping
methods in UDA of modern BERT-based dis-
course dependency parsers. We show the overall
flow of the bootstrapping methods in Figure 2. Ini-
tially we have a small set of labeled documents,
Ls, in the source domain, and a large collec-
tion of unlabeled documents, U t, in the target
domain. Then the bootstrapping procedure works
as follows:
(1) Train initial models on Ls = {(xs, ys)}.
(2) Parse unlabeled documents xt ∈ U t using
the current model f , 例如, F : U t → Lt =
{(xt, F (xt))}.3,4
(3) Measure the confidence scores of
这
pseudo-labeled data and select a subset,
˜Lt ⊂ Lt, that is expected to be reliable and
有用.
(4) Retrain the models on Ls ∪ ˜Lt for several
纪元 (set to 3 in this work).
Steps (2)-(4)
predefined stopping criterion is met.
loop for many rounds until a
Bootstrapping can be interpreted as a methodol-
ogy where teachers generate pseudo supervision
for students, and the students learn the task on it.
Existing bootstrapping methods vary depending
3For bootstrapping methods that employ multiple models
(例如, co-training), Lt is created for each model f .
4In our experiments, for every bootstrapping round we
used 5K sampled documents instead of the whole unlabeled
documents U t, because parsing the whole documents at every
bootstrapping round is computationally expensive and does
not scale to a large-scale dataset. The 5K samples were
flashed for every bootstrapping round.
on how the teacher and student models are used.
在本文中, we specifically explore the following
bootstrapping methods: self-training (Yarowsky,
1995; McClosky et al., 2006; Reichart and
Rappoport, 2007; Suzuki and Isozaki, 2008; 黄
and Harper, 2009), co-training (Blum and Mitchell,
1998; Zhou and Goldman, 2004; Steedman et al.,
2003乙,A), tri-training (Zhou and Li, 2005; 韦斯
等人。, 2015; Ruder and Plank, 2018), and asym-
metric tri-training (Saito et al., 2017).
Self-Training Self-Training (英石) starts with a
single model f trained on Ls. The overall proce-
dure is the same as the one described above. 这
single model is both a teacher and a student for
本身. 因此, it is difficult for the model to obtain
novel knowledge (or supervision) that the model
has not learn, and its errors may be amplified by
the retraining cycle.
Co-Training Co-Training (CT) starts with two
parsing models, f1 and f2, that are expected to
have different inductive biases with each other.
The two models are pre-trained on the same Ls.
In Step 2, each model independently parses the
unlabeled documents: U t → Lt
我 (i = 1, 2). In Step
3, each of the pseudo-labeled sets are filtered by
→ ˜Lt
a selection criterion: LT
我. In Step 4, each
我
j (j (西德:8)= i).
model fi is retrained on Ls ∪ Lt
In CT, the two models teach each other. 因此,
each model is the teacher and the student for the
other model simultaneously. In contrast to ST,
each model can obtain knowledge that it has not
yet learned. CT can be viewed as enhancing the
agreement between the models.
Tri-Training (TT) Tri-Training (TT) consists
of three different models, f1, f2, and f3, 哪个
are initially trained on the same Ls. In contrast to
CT, where the single teacher fi is used to generate
pseudo labels ˜Lt
i for the student fj (j (西德:8)= i), TT
uses two teachers, fi and fj (j (西德:8)= i), to generate a
pseudo-labeled set Lt
我,j for the remaining student
fk (k (西德:8)= i, j). We measure the confidence for
the pair of teachers’ parse trees, (yt
j), 使用
the ratio of agreed dependencies (described in
Subsection 3.4), based on which we determine
whether or not to include the teachers’ predictions
in the pseudo-labeled set.
我, yt
Asymmetric Tri-Training (AT) Asymmetric
Tri-training (AT) is an extension of TT for UDA.
A special domain-specific model f t
1 is used only
130
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
t
A
C
我
/
我
A
r
t
我
C
e
–
p
d
F
/
d
哦
我
/
.
1
0
1
1
6
2
/
t
我
A
C
_
A
_
0
0
4
5
1
1
9
8
7
0
3
1
/
/
t
我
A
C
_
A
_
0
0
4
5
1
p
d
.
F
乙
y
G
你
e
s
t
t
哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3
for test inference; the other two models, f2 and f3,
are used only to generate pseudo labels ˜Lt. 这
1 is retrained on only ˜Lt,
domain-specific model f t
while f2 and f3 are retrained on Ls ∪ ˜Lt.
3.3 Parsing Models
We employ three types of BERT-based dis-
course dependency parsers: (1) A graph-based
(McDonald et al., 2005)
arc-factored model
with a biaffine attention mechanism (Dozat and
曼宁, 2017), (2) a transition-based shift-
reduce model (Nivre, 2004; Chen and Manning,
2014; Kiperwasser and Goldberg, 2016), 和 (3)
the backward variant of the shift-reduce model.
EDU Embedding We compute EDU embed-
dings using a pre-trained Transformer encoder.
This manner is common across the three parsing
型号, though the Transformer parameters are
untied and fine-tuned separately. 具体来说, 我们
first break down the input document into non-
overlapping segments of 512 subtokens, 进而
encode each segment independently by the Trans-
former encoder. 最后, we compute EDU-level
span embeddings as a concatenation of the Trans-
former output states at the span endpoints (wi and
wj) and the span-level syntactic head word5 wk,
IE。, [wi; wj; wk].
(西德:2)
Arc-Factored Model Arc-Factored Model (A)
is a graph-based dependency parser, which can
find the globally optimal dependency structure
using dynamic programming. 具体来说, 我们
employ the biaffine attention model (Dozat and
曼宁, 2017) for computing dependency scores
s(H, d) ε R, and we decode the optimal struc-
ture y∗ using Eisner Algorithm, such that the
(H,d)∈y s(H, d) is maximized. 我们
tree score
predict the discourse relation classes for each
unlabeled dependency (H, d) ∈ y∗ using an-
other biaffine attention layer and MLP, 即,
r∗ = argmax
磷 (r | H, d). To reduce the com-
putational time for inference, we employed the
Hierarchical Eisner Algorithm (张等人。,
2021), which decodes dependency trees from the
sentence level to the paragraph level and then to
the whole text level.
r
5A span-level syntactic head word is a token whose parent
in the syntactic dependency graph is ROOT or is not within
the EDU’s span. When there are multiple head words in
an EDU, we choose the left most one. We used the spaCy
en core web sm model to obtain the syntactic dependency
图形.
131
Shift-Reduce Model Shift-Reduce Model (S)
is a transition-based dependency parser, 哪个
builds a dependency structure incrementally by
executing a series of local actions. 具体来说, 我们
employ the arc-standard system proposed by Nivre
(2004), which has a buffer to store the input EDUs
to be analyzed and a stack to store the in-progress
subtrees and defines the following action classes:
SHIFT, RIGHT-ARC-l, and LEFT-ARC-l. 我们
decode the dependency structure y∗ using a greedy
search algorithm, IE。, taking the action a∗ that is
valid and the most probable at each decision step:
a∗ = argmax
磷 (A | σ), where σ denotes the
parsing configuration.
A
Backward Shift-Reduce Model We expect that
different inductive biases can be introduced by
processing the document from the back. 作为
third model option, we develop a backward variant
of the Shift-Reduce Model (乙), which processes
the input sequence in the reverse order.
3.4 Confidence Measures
The key challenge in bootstrapping on out-of-
domain data is how to assess the reliability (或者
usefulness) of the pseudo labels and how to select
an error-free and high-coverage subset. We define
confidence measures to assess the reliability of
the pseudo-labeled data. 在部分 3.5, 我们定义
selection criteria to filter out unreliable pseudo-
labeled data based on their confidence scores.
Model-based Confidence For the bootstrap-
ping methods that use a single teacher to generate
a pseudo-labeled set (IE。, 英石, CT), 我们定义
the confidence of the teacher model based on
predictive probabilities of the decisions used to
build a parse tree. A discourse dependency struc-
ture consists of a set (or series) of decisions.
所以, we use the average of the predictive
probabilities over the decisions.6 How to calculate
(西德:2)
6We also tested a model-based confidence measure
using the entropy of predictive probabilities, where we
replaced the predictive probability of a decision (例如,
磷 (h∗ | d)) with the corresponding (negative) entropy, 例如,
−H(H | d) =
0≤h≤n P (H | d) log P (H | d). Entropy has
been used especially in the active learning literature to cal-
culate data uncertainty (李等人。, 2016乙; Kasai et al., 2019).
然而, the predictive probabilities outperformed the en-
tropy counterparts consistently in our experiments. 因此,
we adopted the predictive probabilities for the model-based
confidence measure.
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
t
A
C
我
/
我
A
r
t
我
C
e
–
p
d
F
/
d
哦
我
/
.
1
0
1
1
6
2
/
t
我
A
C
_
A
_
0
0
4
5
1
1
9
8
7
0
3
1
/
/
t
我
A
C
_
A
_
0
0
4
5
1
p
d
.
F
乙
y
G
你
e
s
t
t
哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3
the model-based confidence measure C(X, y)
depends on the parsing models:
• Arc-Factored Model:
C(X, y) =
1
2n
n(西德:3)
d=1
在哪里 (H, d, r) ∈ y.
{磷 (H | d) + 磷 (r | H, d)},
• Shift-Reduce Model, Backward Model:
C(X, y) =
1
|A(X, y)|
(西德:3)
磷 (A | σ),
(A,σ)∈A(X,y)
where A(X, y) denotes the action and configura-
tion sequence to produce the parse tree y for x.
Agreement-based Confidence For the boot-
strapping methods that use multiple teachers to
generate a pseudo-labeled set (IE。, TT, AT), 我们
use the agreement level between the two teacher
models as the confidence for the pseudo-labeled
数据. 具体来说, we compute the rate of la-
beled dependencies agreed between two predicted
结构, yi and yj, as follows:
C(X, 做, yj) =
(西德:4)
1
d = hj
你好
d
1
n
n(西德:3)
d=1
∧ ri
d = rj
d
(西德:5)
,
在哪里 1[·] is the indicator function, and hi
d and ri
d
denote the head and the discourse relation class
for the dependent d in yi, 分别. It is worth
noting that both yi and yj have the same number
of dependencies, n. The higher the percentage
是, the more correct dependencies are expected
to be included.
3.5 Sample Selection Criteria
Inspired by Steedman et al. (2003A), 我们定义
two kinds of sample selection criteria, 每一个
which focuses on the reliability (IE。, 准确性)
and the usefulness (IE。, training utility) 的
数据, 分别.
Rank-above-k This is a reliability-oriented se-
lection criterion. We keep only the top N × k
samples with higher confidence scores, where N
is the number of candidate pseudo-labeled data,
and k ∈ [0.0, 1.0]. 具体来说, we first rank
the candidate pseudo-labeled data based on the
teacher-side confidence scores, and then we se-
132
COVID19-DTB
SciDTB
Root
Elaboration
比较
Cause-Result
Condition
Temporal
Joint
Enablement
Manner-Means
归因
Background
发现
Textual-Organization –
Same-Unit
Root
Elaboration, Progression, Summary
Contrast, 比较
Cause-Effect, Explain
Condition
Temporal
Joint
Enablement
Manner-Means
归因
Background
评估
Same-Unit
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
t
A
C
我
/
我
A
r
t
我
C
e
–
p
d
F
/
d
哦
我
/
.
1
0
1
1
6
2
/
t
我
A
C
_
A
_
0
0
4
5
1
1
9
8
7
0
3
1
/
/
t
我
A
C
_
A
_
0
0
4
5
1
p
d
.
F
乙
y
G
你
e
s
t
t
哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3
桌子 1: Discourse relation classes in COVID19-
DTB and their correspondences with SciDTB’s
类.
lect a subset that satisfies R(X) ≤ N × k. 在哪里
右(X) ∈ [1, 氮 ] denotes the ranking of x.
Rank-diff-k This is a utility-oriented selection
criterion. In contrast to Rank-above-k, 哪个重新-
lies only on the teacher-side confidence,
这
criterion utilizes both the teacher-side and the
student-side confidence scores. This criterion
retain the pseudo-labeled data whose relative rank-
ing on the teacher side is higher than the relative
ranking on the student side by a margin k or more.
具体来说, after ranking the candidates indepen-
dently for each side, we compute the gap of the
relative rankings on the two sides, and then select
a subset that meets Rteacher(X) + k ≤ Rstudent(X).
4 COVID19-DTB
We release a new discourse dependency tree-
bank for scholarly paper abstracts on COVID-19
and related coronaviruses like SARS and MERS
in order to test unsupervised domain adaptation
of discourse dependency parsing. We name our
new treebank COVID-19 Discourse Dependency
树库 (COVID19-DTB).
4.1 Construction
We followed the RST-DT annotation guideline
(Carlson and Marcu, 2001) for EDU segmen-
站. Based on SciDTB and Penn Discourse
树库 (PDTB) (Prasad et al., 2008), we defined
14 discourse relation classes shown in Table 1.
We carefully analyzed the annotation data of
SciDTB and found that some classes are hard
to discriminate even for humans, which can lead
COVID19-DTB
SciDTB
Total number of documents
Total number of EDUs
Avg number of EDUs / 文档
Avg dependency distance
Max. dependency distance
Avg Root position
300
6005
20.0
2.7
38
6.6
1045 (独特的: 798)
15723
15.0
2.5
26
3.9
桌子 2: Dataset statistics for the COVID19-DTB
and SciDTB datasets.
to undesirable inconsistencies in the new dataset.
因此, we have merged some classes, 例如
Cause-Effect + Explain → Cause-Result. 一些
classes are also renamed from SciDTB to fit
the biomedical domain, such as Evaluation →
发现.
第一的, we sampled 300 abstracts randomly from
这 2020 September snapshot of The COVID-19
Open Research Dataset (CORD-19) (王等人。,
2020), which contains over 500,000 scholarly arti-
cles on COVID-19 and related coronaviruses like
SARS and MERS. 然后, 这 300 abstracts were
segmented into EDUs manually by the authors.
然后, we employed two professional annotators
to give gold discourse dependency structures to the
300 abstracts. The annotators were trained using
a few examples and a manual guideline, 进而
they annotated the 300 abstracts independently.7
We divided the results into development and test
splits, each of which consists of 150 examples.
4.2 语料库统计
桌子 2 和图 3 show the statistics and the
discourse relation distribution of COVID19-DTB.
We also show the statistics and the distribution
of SciDTB for comparison. We mapped discourse
relations in SciDTB to the corresponding classes in
COVID19-DTB. We removed the Root relations
in computing the proportions.
The average number of EDUs per document
in each corpus was 20.0 和 15.0, 分别.
Although the average dependency distances in the
two corpora are almost the same (2.7 与. 2.5),
the maximum dependency distance of COVID19-
DTB is significantly longer than that of SciDTB.
此外, the average position of Root’s direct
dependent is located further back in COVID19-
DTB (6.6 与. 3.9). Although the overall discourse
7The inter-annotator agreement is thus not calculated in
the current version of the dataset. 反而, we had several
discussions with each annotator to maintain the annotation
consistency at a satisfactory level.
数字 3: Distributions of discourse relation classes
in COVID19-DTB and SciDTB. Discourse relations in
SciDTB are mapped to the corresponding classes in
COVID19-DTB.
relation distributions look similar, the propor-
tions of Elaboration and Same-Unit are larger in
COVID19-DTB. These differences reflect the fact
that biomedical abstracts tend to be longer, 有
more complex sentences with embedded clauses,
and contain more detailed information, suggesting
the difficulty of discourse parser adaptation across
the two domains.
5 实验装置
Datasets We evaluated the bootstrapping meth-
ods on two UDA scenarios: The first setup was
a monologue adaptation between scientific dis-
纪律: NLP and biomedicine (especially on
COVID-19), which is actually an important sce-
nario because there is still no text-level discourse
treebank on biomedical documents. We used the
training split of SciDTB (Yang and Li, 2018)
as the labeled source dataset, which contains 742
manual discourse dependency structures on the ab-
stracts in ACL Anthology. We also used the 2020
September snapshot of CORD-19 (王等人。,
2020) as the unlabeled target dataset, which con-
tains about 76,000 biomedical abstracts. We used
the development and test splits of COVID19-
DTB for validation and testing, 分别.
The discourse relation labels in the SciDTB
training set were mapped to the corresponding
classes of COVID19-DTB. We mapped Textual-
Organization relations in COVID19-DTB to Elab-
oration, because there is no corresponding class
in SciDTB. We also mapped Temporal relations
in the two datasets to Condition to reduce the
significant class imbalance.
The second setup was an adaptation across di-
alogue genres, 那是, dialogues in a multi-party
133
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
t
A
C
我
/
我
A
r
t
我
C
e
–
p
d
F
/
d
哦
我
/
.
1
0
1
1
6
2
/
t
我
A
C
_
A
_
0
0
4
5
1
1
9
8
7
0
3
1
/
/
t
我
A
C
_
A
_
0
0
4
5
1
p
d
.
F
乙
y
G
你
e
s
t
t
哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3
game and dialogues in Ubuntu Forum. We used
the training split of STAC (Asher et al., 2016)
as the labeled source dataset, which contains 887
manually labeled discourse dependency structures
on multi-party dialogues in the game, The Settlers
of Catan. We also used the Ubuntu Dialogue
语料库 (Lowe et al., 2015) as the unlabeled tar-
get dataset, which contains dialogues extracted
from the Ubuntu chat logs. We retained dialogues
和 7-16 utterances and 2-9 speakers. 我们也
removed dialogues with long utterances (更多的
比 20 字). 最后, we obtained approxi-
马特利 70,000 对话. We used the develop-
ment and test splits of Molweni (李等人。, 2020)
for validation and testing. Each split contains 500
manually labeled discourse dependency structures
on multi-party dialogues derived from the Ubuntu
Dialogue Corpus.
The unlabeled target documents in both se-
tups were segmented into EDUs using a publicly
available EDU segmentation tool (王等人。,
2018).
Evaluation We employed the traditional eval-
uation metrics in dependency parsing literature,
即, Labeled Attachment Score (LAS) 和
Unlabeled Attachment Score (UAS). We also used
Root Accuracy (RA), which indicates how well a
system can identify the most representative EDU
in the document (IE。, the dependent of the special
root node).
Implementation Details As
the pre-trained
transformer encoders, we used SciBERT (Beltagy
等人。, 2019) and SpanBERT (Joshi et al., 2019)
in the first and second adaptation setups, 重新指定-
主动地. The dimensionality of the MLPs in the
arc-factored model and the shift-reduce models
是 100 和 128, 分别. We used AdamW
and Adam optimizers for optimizing the trans-
former’s parameters (θbert) and the task-specific
参数 (θtask), 分别, following Joshi
等人. (2019). We first trained the base models
on the labeled source dataset using the following
hyper-parameters: batch size = 1, learning rate
(LR) for θbert = 2e−5, LR for θtask = 1e−4, warmup
steps = 2.4K. 然后, we ran the bootstrapping meth-
ods using the models with: batch size = 1, LR for
θbert = 2e−6, LR for θtask = 1e−5, warmup steps =
7K. We trained all approaches for a maximum of
40 纪元. We applied early stopping when the
validation LAS does not increase for 10 纪元.
方法
Source-only (A)
Source-only (S)
Source-only (乙)
英石 (A ← A)
英石 (S ← S)
CT (A ← S)
CT (S ← A)
CT (A ← S)
CT (S ← A)
CT (S ← B)
CT (B ← S)
CT (S ← B)
CT (B ← S)
TT (A ← S, 乙)
TT (S ← A, 乙)
TT (A ← S, 乙)
TT (S ← A, 乙)
AT (A ← S, 乙)
AT (S ← A, 乙)
AT (A ← S, 乙)
AT (S ← A, 乙)
Dialogues
Selection LAS UAS RA LAS UAS
Abstracts
-
-
-
多于-0.6
多于-0.6
多于-0.6
多于-0.6
diff-100
diff-100
多于-0.6
多于-0.6
diff-100
diff-100
多于-0.6
多于-0.6
diff-100
diff-100
多于-0.6
多于-0.6
diff-100
diff-100
61.3
61.8
60.0
65.8
65.3
66.2
66.1
66.0
66.2
65.3
65.6
65.5
65.5
65.9
65.9
65.4
65.1
64.9
65.3
65.3
64.6
74.8
74.5
72.9
78.7
76.9
78.1
78.2
78.3
78.8
76.8
76.9
76.8
76.6
78.5
78.4
77.4
77.7
77.3
77.4
77.6
77.6
82.0
78.0
78.0
88.7
84.7
86.0
86.0
88.0
84.7
84.0
87.3
86.0
86.7
87.3
86.0
86.7
87.3
85.3
88.7
84.7
85.3
29.9
33.2
29.2
34.7
37.9
38.0
39.1
38.5
39.5
38.1
38.5
39.1
39.2
38.5
39.1
38.6
38.9
36.9
38.6
36.9
38.2
55.1
66.1
55.6
60.6
67.4
64.8
64.4
66.5
66.0
67.2
67.4
67.5
67.7
66.6
66.7
66.8
66.5
66.7
63.2
65.7
61.9
桌子 3: LAS for methods with and without boot-
strapping in the two UDA setups. Arrows indicate
the teacher and student models: 例如, TT
(S ← A, 乙) shows the test performance of the
Shift-Reduce Model (S) that is trained with the
Arc-Factored Model (A) and the Backward Shift-
Reduce Model (乙) using Tri-Training (TT). RA is
omitted for the dialogue adaptation setup because
the accuracy is nearly 100% for most systems.
6 Results and Discussion
6.1 Effectiveness
We verified the effectiveness of bootstrapping
methods on the two UDA scenarios. We evalu-
ated the source-only models, which were trained
only on the labeled source dataset, as the baseline.
桌子 3 shows the results. The bootstrapping
methods consistently gave gains in performance
regardless of the adaptation scenarios. The best
systems were CT (A ← S) with Rank-above-0.6
and CT (S ← A) with Rank-diff-100, which out-
performed the source-only systems (例如, 来源-
仅有的 (S)) by more than 4.4 LAS points on the
monologue setup (NLP → COVID-19) 并由
多于 6.3 LAS points on the dialogue setup
(Game → Ubuntu Forum), 分别. CT, TT,
and AT tended to achieve higher accuracy than
英石, particularly in the dialogue adaptation setup.
These results indicate that bootstrapping is signif-
icantly and consistently effective for UDA of dis-
course dependency parsing in various adaptation
scenarios, and that employing multiple models
134
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
t
A
C
我
/
我
A
r
t
我
C
e
–
p
d
F
/
d
哦
我
/
.
1
0
1
1
6
2
/
t
我
A
C
_
A
_
0
0
4
5
1
1
9
8
7
0
3
1
/
/
t
我
A
C
_
A
_
0
0
4
5
1
p
d
.
F
乙
y
G
你
e
s
t
t
哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3
Abstracts
方法
CT (A ← S)
CT (S ← A)
CT (A ← A)
CT (S ← S)
Selection LAS UAS RA
多于-0.6
多于-0.6
多于-0.6
多于-0.6
66.2
66.1
65.5
65.5
78.1
78.2
77.8
77.9
86.0
86.0
76.0
86.0
桌子 4: Comparison of co-training systems
employing different types or the same type of
parsing models.
labeled data. 这里, we analyzed the confidence
措施.
Confidence Scores Correlate with Quality
Regardless of the selection criteria, data with high
confidence scores tend to be selected. 数字 5
(A) shows the relationships between the confi-
dence scores and the parsing quality (LAS) 在里面
target domain. 具体来说, we calculated the con-
fidence scores of each example in the COVID19-
DTB test set, sorted the examples in descending
order of their confidence scores, and evaluated
LAS for each of the top k% subset. We confirmed
that the confidence scores were roughly correlated
with the parsing quality, and the top candidates
of higher confidence tended to be more accurate
than the ones of lower confidence. 例如, 如果
we restricted the test data to the top 10% 与
highest confidence scores, the LAS of CT (A ←
S) with Rank-above-0.6 曾是 76.8%, 这是
much higher than the LAS of this system on the
full test set (IE。, 66.2%).
Confident (Accurate) Pseudo Labels are Biased
下一个, we examined what kind of documents are
assigned with higher confidence scores. 数字 5
(乙) shows the relationships between the confidence
scores and the document length (IE。, 的数量
EDUs). We found the strong correlation between
他们: Documents with higher confidence scores
are biased to shorter documents. This bias did not
depend on the confidence measures (model-based
与. agreement-based), the sample selection crite-
ria, and even the presence of bootstrapping. Based
on these results, we can conjecture that longer
documents tend to be of poor quality (low con-
fidence) and less likely to be included in the
selected pseudo-labeled set ˜Lt. This conjecture
further implies that the current bottleneck of the
数字 4: How bootstrapping methods improve per-
formance as a function of document length (IE。, 这
number of EDUs) in the target domain.
is particularly effective in reducing the unintended
tendency of ST to amplify its own errors.
下一个, we further analyzed in what kind of doc-
uments the bootstrapping system is particularly
effective. We divided the COVID19-DTB test set
into bins by the number of EDUs in each docu-
蒙特 (n ≤ 15, 15 < n ≤ 30, n > 30), 并为
each bin we examined the percentage of examples
improved by the bootstrapping systems over the
source-only system. 数字 4 shows the results.
When the document length was 10 or shorter, 那里
was no improvement in most examples; 然而,
when the length was longer than 30, the percent-
age was jumped to around 80% with CT, TT,
and AT. These results indicate that the longer the
文件 (or maybe the higher the document
复杂) in the target domain, the greater the
benefit of bootstrapping.
We also investigated the importance of employ-
ing different types of parsing models in CT. 这
theoretical importance of employing models with
different views (or inductive biases) in CT has
been discussed (Blum and Mitchell, 1998; 阿布尼,
2002; Zhou and Goldman, 2004). We trained base
models with the same neural architecture but with
different initial parameters on the labeled source
dataset, and then retrained them using CT. 我们可以
see from the results in Table 4 that the LAS of CT
using different model types is consistently higher
than that of CT with the same model types, 苏格-
gesting empirically that it is effective to employ
different model types in bootstrapping.
6.2 Analysis of Confidence Measures
One of the key challenges in bootstrapping for
UDA is to assess the true reliability of pseudo-
135
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
t
A
C
我
/
我
A
r
t
我
C
e
–
p
d
F
/
d
哦
我
/
.
1
0
1
1
6
2
/
t
我
A
C
_
A
_
0
0
4
5
1
1
9
8
7
0
3
1
/
/
t
我
A
C
_
A
_
0
0
4
5
1
p
d
.
F
乙
y
G
你
e
s
t
t
哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
t
A
C
我
/
我
A
r
t
我
C
e
–
p
d
F
/
d
哦
我
/
.
1
0
1
1
6
2
/
t
我
A
C
_
A
_
0
0
4
5
1
1
9
8
7
0
3
1
/
/
t
我
A
C
_
A
_
0
0
4
5
1
p
d
.
F
乙
y
G
你
e
s
t
t
哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3
数字 5: Relationships between the confidence scores and the parsing quality (LAS) or document length. 我们
used the examples in the COVID19-DTB test set. The examples are sorted in descending order of the confidence
scores.
The Problem is Not Sampling, but Prediction
To alleviate this low-coverage issue, we modified
the confidence measures defined in Subsection
3.4 to select longer documents more aggressively
for the pseudo-labeled set ˜Lt. We simply omit-
ted the averaging calculation over the decisions.
然而, 如图 6 (see the results
with ‘‘w/o avg.’’), the performance degradation
tendency for longer documents did not change.
This fact indicates that the current bottleneck
is not the low coverage of the selected pseudo-
labeled set ˜Lt, but the low coverage of accurate
supervision in the candidate pseudo-labeled data
pool, 那是, LT.
6.3 Analysis of Selection Criteria
Another important challenge in bootstrapping for
UDA is to select an error-free and high-coverage
subset from the candidate pseudo-labeled data
pool. 这里, we analyzed the selection criteria.
There Is a Reliability-Coverage Trade-off
Varying the parameter k of Rank-above-k and
Rank-diff-k, we examined the final parsing quality
and the average number of selected pseudo-labeled
数据 (out of 5K candidates). We trained and eval-
uated CT (A ← S) on the COVID19-DTB test
放. 数字 7 (lines with circle markers) 节目
结果. Rank-above-k achieved a slightly
higher performance than Rank-diff-k. 然而,
Rank-diff-k achieved the best performance with
less pseudo-labeled data. More interestingly, 我们
confirmed that, for both criteria, there is a trade-off
数字 6: Relationships between the document length
and the parsing quality (LAS). Documents are sorted
in descending order of the document length.
bootstrapping systems is the low coverage of the
selected pseudo-labeled set ˜Lt.
Low Coverage of Accurate Pseudo Labels
Based on the above conjecture, it is natural to
expect that there is too little accurate supervi-
sion for longer documents in the selected pseudo-
labeled set, and that the parsing accuracy of the
bootstrapping systems drop especially for longer
文件. 数字 6 shows the relationships be-
tween the parsing quality and the document length.
The use of bootstrapping methods improved the
overall performance over the source-only systems;
然而, regardless of the bootstrapping types,
the performance dropped significantly for longer
文件. These results confirm the shortage of
accurate supervision for longer documents in the
selected pseudo-labeled set.
136
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
t
A
C
我
/
我
A
r
t
我
C
e
–
p
d
F
/
d
哦
我
/
.
1
0
1
1
6
2
/
t
我
A
C
_
A
_
0
0
4
5
1
1
9
8
7
0
3
1
/
/
t
我
A
C
_
A
_
0
0
4
5
1
p
d
.
F
乙
y
G
你
e
s
t
t
哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3
数字 7: Impacts of the sample selection criteria (Rank-above-k, Rank-diff-k) for different parameters k. 我们也
show the results when the number of selected pseudo-labeled data is adjusted to 3K.
between reliability (precision) and coverage (关于-
call) in the selected pseudo-labeled set: When k
was too strict (IE。, when k was too small for
Rank-above-k, and when the margin k was too
large for Rank-diff-k), the number of selected
pseudo-labeled data was too small, 导致
lower LAS. When k was relaxed to some extent,
the number of selected pseudo-labeled data in-
creased and the LAS reached the highest LAS.
if k was relaxed further,
the accu-
然而,
racy decreased from the highest point due to the
contamination of too much noisy supervision.
Quantity Is Not the Only Issue Next, 我们
evaluated the sample selection criteria with-
出去
the influence of the number of selected
pseudo-labeled data. We selected the same num-
ber of pseudo-labeled data (set to 3K) across
different k by adjusting the number of the un-
labeled samples (放
to 5K in the previous
实验) appropriately. 例如, to se-
lect 3K pseudo-labeled data with Rank-above-0.2,
we sampled 15K unlabeled data at each boot-
strapping round. 数字 7 (lines with rectangle
标记) shows the results. In the region where k
was strict, the LAS curves improved compared to
before the adjustment because the number of se-
lected pseudo-labeled data increased. 同时,
in the region where k was relaxed, the LAS curves
decreased or were retained because the selection
size is decreased. More interestingly, even with
this adjusted setting, the strictest k was not the best
parameter. These results indicate that, 虽然
number of selected pseudo-labeled data is impor-
坦特, it is not the only factor that determines the
数字 8: Impacts of the number of unlabeled target
文件.
optimal parameter k, and that it is still difficult to
identify truly useful pseudo-labeled data based on
these sample selection criteria alone.
6.4 Increasing the Unlabeled Dataset Size
迄今为止, we have demonstrated that the current ma-
jor limitation of bootstrapping is the difficulty of
generating diverse and accurate pseudo-labeled
data pool Lt. The most straightforward way for
mitigating this low-coverage problem is to in-
crease the number of unlabeled target documents.
数字 8 shows that increasing the number of
unlabeled data with bootstrapping improved the
parsing quality. 然而, the quality improve-
ment saturated after 5K documents. These facts
demonstrate that the low-coverage problem can
not be mitigated by simply adding more unlabeled
文件. We suspect this is because increas-
ing the diversity of unlabeled documents does
not always increase the diversity of accurately
pseudo-labeled data.
137
方法
Confidence
LAS
-
-
Source-only (S)
CT (S ← A) w/ above-0.6
Source-only (S) + AL
Source-only (S) + AL
CT (S ← A) w/ above-0.6 + AL
CT (S ← A) w/ above-0.6 + AL model-based
CT (S ← A) w/ above-0.6 + AL
random
model-based
random
agreement-based
33.2
39.1
45.2
45.8
45.9
46.4
46.3
桌子 5: LAS for methods with and without active
学习 (AL).
6.5 Active Learning
A more direct and promising solution than in-
creasing the unlabeled corpus size is to manually
annotate a small amount of documents that the
bootstrapping system can not analyze accurately.
We tested the potential effectiveness of active
学习 (AL) (Settles, 2009). To emulate the AL
过程, we used the Molweni training set (9K
对话) as the unlabeled target documents and
leveraged the gold annotation. We first mea-
sured the confidence (or uncertainty) scores of
each unlabeled document using the source-only or
co-training systems that had already been trained
in the dialogue adaptation setup. 然后, we sampled
100 documents with the worst confidence scores,
because such data are unlikely to be selected in
bootstrapping and accurately parsed. 最后, 我们
fine-tuned each model on the 100 actively-labeled
数据. We also used random confidence (uncer-
污点) measures as the baseline, whose results are
平均超过 5 试验. 桌子 5 shows that, 甚至
though only 100 dialogues were annotated manu-
盟友, AL improved the performance significantly,
which was difficult to achieve by bootstrapping
独自的. Annotating highly uncertainty data is more
effective than annotating randomly sampled di-
alogues. We can also see that the combination
of bootstrapping and AL achieves higher per-
formance than the source-only model with AL,
suggesting that bootstrapping and AL can be com-
plementary and that bootstrapping is useful to
identify potentially useful data in AL. The per-
formance improvement could be further increased
by repeating bootstrapping and AL alternatively,
which is worth investigating in the future.
6.6 Summary and Recommendations
这里, we summarize what we have learned from
the experiments and push the findings a step
138
further in order to provide practical guidelines for
out-of-domain discourse dependency parsing.
1. Bootstrapping improves out-of-domain dis-
course dependency parsing significantly and
consistently in various adaptation scenarios.
尤其, we recommend co-training with
the arc-factored and shift-reduce models be-
cause co-training tends to be more effective
and more efficient in training than tri-training
variants.
2. A labeled source dataset that is as close as
possible to the target domain is preferable
to suppress the domain-shift problem. 这
labeled source dataset should also follow
the the same annotation framework with the
target domain (例如, definitions of EDUs and
discourse relation classes).
3. It is reasonable to use the models’ predic-
tive probability as the confidence measure
to filter out noisy pseudo labels, 因为
confidence scores correlate with the accuracy
of the pseudo labels. 尤其, we recom-
mend the Rank-above-k criterion because,
unlike Rank-diff-k, k is independent of the
number of unlabeled data. 然而, 自从
the accurately predicted pseudo labels are
biased towards simpler documents, the pars-
ing accuracy on more complex documents is
difficult to improve even with bootstrapping.
4. The low-coverage problem of pseudo labels
is not alleviated by increasing the number
of unlabeled target documents. We recom-
mend manually annotating a small amount of
target documents using active learning and
combining it with bootstrapping.
7 结论
在本文中, we investigated the effectiveness
and limitation of bootstrapping methods in un-
supervised domain adaptation of BERT-based
discourse dependency parsers. The results demon-
strate that bootstrapping is effective significantly
and consistently in various adaptation scenar-
ios. 然而, regardless of the tuned confidence
measures and sample selection criteria, the boot-
strapping methods have a difficulty in generating
both diverse and accurate pseudo labels, 这是
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
t
A
C
我
/
我
A
r
t
我
C
e
–
p
d
F
/
d
哦
我
/
.
1
0
1
1
6
2
/
t
我
A
C
_
A
_
0
0
4
5
1
1
9
8
7
0
3
1
/
/
t
我
A
C
_
A
_
0
0
4
5
1
p
d
.
F
乙
y
G
你
e
s
t
t
哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3
the current limiting factor in further improvement.
This low-coverage problem cannot be mitigated
by just increasing the unlabeled corpus size. 我们
confirmed that the active learning can be the
effective solution to this problem.
We have a limitation in this study: Our ex-
periments use only English documents. 虽然
bootstrapping and discourse parsing models are
language-independent at the algorithmic level, 前任-
periments of domain adaptation require labeled
datasets on both source and target domains for
training and evaluation. In order to investigate the
universality and language-dependence features of
the bootstrapping methods in various languages,
it is necessary to develop discourse treebanks in a
variety of languages.
将来, we will expand the COVID19-
DTB dataset with additional biomedical abstracts
to facilitate the exploration and application of
discourse parsing technologies to biomedical
knowledge acquisition.
致谢
We would like to thank the action editor and three
anonymous reviewers for their thoughtful and
insightful comments, which we found very helpful
in improving the paper. This work was supported
by JSPS KAKENHI 21K17815. This work was
also partly supported by JST, AIP Trilateral AI
研究, grant number JPMJCR20G9, 日本.
参考
Steven Abney. 2002. 自举. In Proceed-
ings of the 40th Annual Meeting of the Asso-
ciation for Computational Linguistics (前交叉韧带).
https://doi.org/10.3115/1073083
.1073143
Stergos Afantenos, 埃里克·高, Nicholas Asher,
和杰米·佩雷特. 2015. Discourse parsing for
multi-party chat dialogues. 在诉讼程序中
这 2015 经验方法会议
自然语言处理博士 (EMNLP),
928–937. https://doi.org/10
页面
.18653/v1/D15-1109
Nicholas Asher, Julie Hunter, Mathieu Morey,
Farah Benamara, and Afantenos Stergos. 2016.
Discourse structure and dialogue acts in multi-
party dialogue: the STAC corpus. In Proceed-
ings of the 10th International Conference on
语言资源与评估 (LREC),
pages 2721–2727.
Nicholas Asher and Alex Lascarides. 2003. Logics
of Conversation. 剑桥大学出版社.
Sonia Badene, Kate Thompson, Jean-Pierre Lorr´e,
and Nicholas Asher. 2019A. Data programming
for learning discourse structure. In Proceed-
ings of the 57th Annual Meeting of the Asso-
ciation for Computational Linguistics (前交叉韧带),
pages 640–645. https://doi.org/10.18653
/v1/P19-1061
Sonia Badene, Kate Thompson, Jean-Pierre Lorr´e,
and Nicholas Asher. 2019乙. Weak supervision
for learning discourse structure. In Proceedings
的 2019 Conference on Empirical Meth-
ods in Natural Language Processing and the
9th International Joint Conference on Natu-
ral Language Processing (EMNLP-IJCNLP),
pages 2296–2305. https://doi.org/10
.18653/v1/D19-1234
Iz Beltagy, Kyle Lo, and Arman Cohan. 2019.
SciBERT: A pretrained language model for
scientific text. 在诉讼程序中
这 2019
Conference on Empirical Methods in Natural
Language Processing and the 9th International
Joint Conference on Natural Language Pro-
cessing (EMNLP-IJCNLP), pages 3615–3620.
https://doi.org/10.18653/v1/D19
-1371
Parminder Bhatia, Yangfeng Ji, and Jacob
Eisenstein. 2015. Better document-level senti-
ment analysis from RST discourse parsing. 在
诉讼程序 2015 Conference on Empir-
ical Methods in Natural Language Processing,
pages 2212–2218. https://doi.org/10
.18653/v1/D15-1263
Avrim Blum and Tom Mitchell. 1998. Combin-
ing labeled and unlabeled data with co-training.
In Proceedings of the 11th Annual Conference
on Computational Learning Theory (COLT),
pages 92–100. https://doi.org/10.1145
/279943.279962
Lynn Carlson and Daniel Marcu. 2001. 话语
tagging reference manual. 技术报告
ISI-TR-545. University California Information
Sciences Institute.
139
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
t
A
C
我
/
我
A
r
t
我
C
e
–
p
d
F
/
d
哦
我
/
.
1
0
1
1
6
2
/
t
我
A
C
_
A
_
0
0
4
5
1
1
9
8
7
0
3
1
/
/
t
我
A
C
_
A
_
0
0
4
5
1
p
d
.
F
乙
y
G
你
e
s
t
t
哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3
在里面
Lynn Carlson, 丹尼尔·马可, and Mary Ellen
Okurowski. 2001. Building a discourse-tagged
语料库
framework of Rhetorical
Structure Theory. In Proceedings of the 2nd
SIGdial Workshop on Discourse and Dialogue.
https://doi.org/10.3115/1118078
.1118083
Danqi Chen and Christopher D. 曼宁.
2014. A fast and accurate dependency parser
在诉讼程序中
using neural networks.
这 2014 经验方法会议
自然语言处理博士 (EMNLP),
pages 740–750. https://doi.org/10.3115
/v1/D14-1082
Jacob Devlin, Ming-Wei Chang, Kenton Lee, 和
Kristina Toutanova. 2019. BERT: Pre-training
of deep bidirectional transformers for language
理解. 在诉讼程序中 2019 骗局-
ference of the North American Chapter of the
计算语言学协会: 胡-
man Language Technologies (NAACL-HLT),
pages 4171–4186.
Timothy Dozat and Christopher D. 曼宁.
2017. Deep biaffine attention for neural de-
这
pendency parsing.
5th International Conference on Learning
Representations (ICLR).
在诉讼程序中
Greg Durrett, Taylor Berg-Kirkpatrick, 和
Dan Klein. 2016. Learning-based single-
document summarization with compression
and anaphoricity constraints. In Proceedings
of the 54th Annual Meeting of the Associa-
tion for Computational Linguistics (ACL2016),
pages 1998–2008. https://doi.org/10
.18653/v1/P16-1188
Vanessa Wei Feng and Graema Hirst. 2014. A
linear-time bottom-up discourse parser with
constraints and post-editing. In Proceedings
的
the Asso-
ciation for Computational Linguistics (前交叉韧带),
pages 511–521.
the 52nd Annual Meeting of
Elisa Ferracane, Su Wang, and Raymond J.
Mooney. 2017. Leveraging discourse informa-
tion effectively for authorship attribution. 在
Proceedings of the The 8th International Joint
Conference on Natural Language Processing
(IJCNLP), pages 584–593.
Grigorii Guz and Giuseppe Carenini. 2020.
Coreference for discourse parsing: A neural ap-
普罗奇. 第一届研讨会论文集
on Computational Approaches to Discourse,
pages 160–167.
Hugo Hernault, Danushka Bollegala, and Mitsuru
Ishizuka. 2010A. A semi-supervised approach to
improve classification of infrequent discourse
relations using feature vector extension. 在
诉讼程序 2010 Conference on Empir-
ical Methods in Natural Language Processing
(EMNLP), pages 399–409.
Hugo Hernault, Helmut Prendinger, David a.
DuVerle, and Mitsuru Ishizuka. 2010乙. HILDA:
A discourse parser using support vector ma-
chine classification. Dialogue & 话语,
1(3):1–33. https://doi.org/10.5087
/dad.2010.003
Tsutomu Hirao, Yasuhisa Yoshida, Masaaki
Nishino, Norihito Yasuda, and Masaaki Nagata.
2013. Single-document summarization as a
在诉讼程序中
tree knapsack problem.
这 2013 Conference of Empirical Methods
自然语言处理博士 (EMNLP),
pages 1515–1520.
Wenpeng Hu, Zhangming Chan, Bing Liu,
Dongyan Zhao, Jinwen Ma, and Rui Yan.
2019. GSN: A graph-structured network for
在诉讼程序中
multi-party dialogues.
the Twenty-Eighth International Joint Con-
ference on Artificial
智力 (IJCAI),
pages 5010–5016.
Zhongqiang Huang and Mary Harper. 2009.
Self-training PCFG grammars with latent an-
notations across languages. 在诉讼程序中
这 2009 经验方法会议
自然语言处理博士 (EMNLP),
832–841. https://doi.org/10
页面
.3115/1699571.1699621
Patrick Huber and Giuseppe Carenini. 2019.
Predicting discourse structure using distant
supervision from sentiment. In Proceedings
的 2019 Conference on Empirical Meth-
ods in Natural Language Processing and the
9th International Joint conference on Natu-
ral Language Processing (EMNLP-IJCNLP),
pages 2306–2316. https://doi.org/10
.18653/v1/D19-1235
Ofir Israeli, Adi Beth-Din, Nir Paran, Dana
斯坦因, Shirley Lazar, Shay Weiss, Elad Milrot,
Yafit Atiya-Nasagi, Shmuel Yitzhaki, Orly
Laskar, and Ofir Schuster. 2020. Evaluating
140
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
t
A
C
我
/
我
A
r
t
我
C
e
–
p
d
F
/
d
哦
我
/
.
1
0
1
1
6
2
/
t
我
A
C
_
A
_
0
0
4
5
1
1
9
8
7
0
3
1
/
/
t
我
A
C
_
A
_
0
0
4
5
1
p
d
.
F
乙
y
G
你
e
s
t
t
哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3
the efficacy of RT-qPCR SARS-CoV-2 direct
approaches in comparison to RNA extrac-
的. BioRxiv preprint 2020.06.10.144196v1.
https://doi.org/10.1101/2020.06
.10.144196
Peter Jansen, Mihai Surdeanu, and Peter Clark.
2014. Discourse complements lexical semantics
for non-factoid answer reranking. In Proceed-
ings of the 52nd Annual Meeting of the Asso-
ciation for Computational Linguistics (前交叉韧带),
977–986. https://doi.org/10
页面
.3115/v1/P14-1092
Yangfeng Ji and Jacob Eisenstein. 2014. Rep-
resentation learning for text-level discourse
解析. In Proceedings of the 52nd Annual
Meeting of the Association for Computational
语言学 (前交叉韧带), pages 13–24.
Yangfeng Ji and Noah A. 史密斯. 2017. Neural
discourse structure for text categorization. 在
Proceedings of the 55th Annual Meeting of
the Association for Computational Linguistics
(前交叉韧带), pages 996–1005.
Kailang
Carenini,
Jiang, Giuseppe
和
Raymond T. 的. 2016. Training data enrich-
ment for infrequent discourse relations. 在
Proceedings of the 26th International Confer-
ence on Computational Linguistics (科林),
pages 2603–2614.
Mandar
Joshi, Danqi Chen, Yinhan Liu,
Daniel S. Weld, Luke Zettlemoyer, and Omer
征收. 2019. SpanBERT:
Improving pre-
training by representing and predicting spans.
Transactions of the Association for Compu-
tational Linguistics, 8:64–77. https://土井
.org/10.1162/tacl_a_00300
Shafiq Joty, Giuseppe Carenini, and Raymond
时间. 的. 2015. CODRA: A novel discriminative
framework for rhetorical analysis. Computa-
tional Linguistics, 41(3):385–435. https://
doi.org/10.1162/COLI 00226
Shafiq Joty, Giuseppe Carenini, Raymond T.
的, and Yashar Mehdad. 2013. Combining
intra- and multi-sentential
rhetorical pars-
ing for document-level discourse analysis. 在
Proceedings of the 51st Annual Meeting of
the Association for Computational Linguistics
(前交叉韧带), pages 486–496.
resource deep entity resolution with trans-
fer and active learning. 在诉讼程序中
the 57th Annual Meeting of
the Associa-
tion for Computational Linguistics (前交叉韧带),
pages 5851–5861. https://doi.org/10
.18653/v1/P19-1586
Eliyahu Kiperwasser and Yoav Goldberg. 2016.
Simple and accurate dependency parsing us-
ing bidirectional LSTM feature representations.
Transactions of the Association for Computa-
tional Linguistics, 4:313–327. https://土井
.org/10.1162/tacl_a_00101
Naoki Kobayashi, Tsutomu Hirao, Hidetaka
Kamigaito, Manabu Okumura, and Masaaki
Nagata. 2020. Top-down RST parsing utilizing
granularity levels in documents. In Proceedings
of the Thirty-Fourth AAAI Conference on Ar-
tificial Intelligence (AAAI), pages 8099–8106.
https://doi.org/10.1609/aaai.v34i05
.6321
Naoki Kobayashi, Tsutomu Hirao, Hidetaka
Kamigaito, Manabu Okumura, and Masaaki
Nagata. 2021. Improving neural RST parsing
model with silver agreement subtrees. In Pro-
ceedings of the 2021 Conference of the North
the Association for
American Chapter of
计算语言学: Human Language
Technologies (NAACL-HLT), pages 1600–1612.
https://doi.org/10.18653/v1/2021
.naacl-main.127
Naoki Kobayashi, Tsutomu Hirao, Kengo
Nakamura, Hidetaka Kamigaito, Manabu
Okumura, and Masaaki Nagata. 2019. Split of
merge: Which is better for unsupervised RST
解析? 在诉讼程序中 2019 Confer-
ence of Empirical Methods in Natural Lan-
guage Processing (EMNLP), pages 5797–5802.
https://doi.org/10.18653/v1/D19
-1587
Fajri Koto, Jey Han Lan, and Timothy Baldwin.
2021. Top-down discourse parsing via sequence
labeling. In Proceedings of the 16th Confer-
ence of the European Chapter of the Associ-
ation for Computational Linguistics (EACL),
715–726. https://doi.org/10
页面
.18653/v1/2021.eacl-main.60
Jungo Kasai, Kun Qian, Sairam Gurajada,
Yunyao Li, and Lucian Popa. 2019. 低的-
Jiaqi Li, Ming Liu, Min-Yen Kan, Zihao Zheng,
Zekun Wang, Wenqiang Lei, Ting Liu, 和
141
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
t
A
C
我
/
我
A
r
t
我
C
e
–
p
d
F
/
d
哦
我
/
.
1
0
1
1
6
2
/
t
我
A
C
_
A
_
0
0
4
5
1
1
9
8
7
0
3
1
/
/
t
我
A
C
_
A
_
0
0
4
5
1
p
d
.
F
乙
y
G
你
e
s
t
t
哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3
Bing Qin. 2020. Molweni: A challenge mul-
tiparty dialogue-based machine reading com-
prehension dataset with discourse structure. 在
Proceedings of the 28th International Confer-
ence on Computational Linguistics (科林),
pages 2642–2652.
Jiwei Li, Rumeng Li, and Eduard Hovy. 2014A.
Recursive deep models for discourse parsing.
这 2014 Conference of
在诉讼程序中
Empirical Methods in Natural Language Pro-
cessing (EMNLP), pages 2061–2069.
Qi Li, Tianshi Li, and Baobao Chang. 2016A.
Discourse parsing with attention-based hier-
archical neural networks. 在诉讼程序中
这 2016 Conference of Empirical Methods
自然语言处理博士 (EMNLP),
362–371. https://doi.org/10
页面
.18653/v1/D16-1035
Sujian Li, Liang Wang, Ziqiang Cao, and Wenjie
李. 2014乙. Text-level discourse dependency
解析. In Proceedings of the 52nd Annual
Meeting of the Association for Computational
语言学 (前交叉韧带), pages 25–35.
Zhenghua Li, Min Zhang, Yue Zhang, Zhanyi
刘, Wenliang Chen, Hua Wu, and Heifeng
王. 2016乙. Active learning for dependency
parsing with partial annotation. In Proceedings
的
the Associ-
ation for Computational Linguistics (前交叉韧带),
pages 344–354.
the 54th Annual Meeting of
Yang Liu and Mirella Lapata. 2018. 学习
structured text representations. Transactions
the Assocication for Computational Lin-
的
语言学, 6:63–75. https://doi.org/10
.1162/tacl_a_00005
Annie Louis, Aravind Joshi, and Ani Nenkova.
2010. Discourse indicators for content selec-
tion in summarization. 在诉讼程序中
SIGDIAL 2010 会议, pages 147–156.
Ryan Lowe, Nissan Pow, Lulian Serban, 和
Joelle Pineau. 2015. The ubuntu dialogue cor-
脓: A large dataset for research in unstructured
multi-turn dialogue systems. 在诉讼程序中
the 16th Annual Meeting of the Special Interest
Group on Discourse and Dialogue (SIDDIAL),
pages 285–294.
William C. Mann and Sandra A. 汤普森.
1988. Rhetorical Structure Theory: Towards
a functional theory of text organization. Text-
Interdisciplinary Journal for the Study of Dis-
课程, 8(3):243–281. https://doi.org
/10.1515/text.1.1988.8.3.243
丹尼尔·马可. 1999. A decision-based approach
to rhetorical parsing. In Proceedings of the 37th
Annual Meeting of the Association for Compu-
tational Linguistics (前交叉韧带), pages 365–372.
David McClosky, Eugene Charniak, and Mark
约翰逊. 2006. Effective self-training for
解析. 在诉讼程序中 2006 Confer-
ence of the North American Chapter of the
计算语言学协会:
人类语言技术
(全国AACL-
赫勒特), pages 152–159.
Ryan McDonald, Fernando Pereira, Kiril Ribarov,
和扬·哈吉克. 2005. Non-projective depen-
dency parsing using spanning tree algorithms.
In Proceedings of Human Language Technol-
ogy Conference and Conference on Empirical
Methods in Natural Language Processing
(EMNLP-HLT), pages 523–530. https://
doi.org/10.3115/1220575.1220641
Mathieu Morey, Philippe Muller, and Nicholas
亚瑟. 2018. A dependency perspective on RST
discourse parsing and evaluation. Computa-
tional Linguistics, 44(2):197–235. https://
doi.org/10.1162/coli a 00314
Philippe Muller, Stergos Afantenos, Pascal
Denis, and Nicholas Asher. 2012. Constrained
decoding for text-level discourse parsing. 在
Proceedings of the 24th International Confer-
ence on Computational Linguistics (科林),
pages 1883–1900.
Noriki Nishida and Hideki Nakayama. 2020.
Unsupervised discourse constituency pars-
ing using Viterbi EM. Transactions of
这
计算语言学协会,
8:215–230. https://doi.org/10.1162
/tacl_a_00312
乔金·尼弗尔. 2004. Incrementality in deter-
ministic dependency parsing. In Proceedings
of the ACL Workshop Incremental Parsing:
Bringing Engineering and Cognition Together,
pages 50–57. https://doi.org/10.3115
/1613148.1613156
J´er´emy Perret, Stergos Afantenos, Nicholas Asher,
and Mathieu Morey. 2016. Integer linear pro-
gramming for discourse parsing. In Proceedings
的 2016 Conference of the North American
142
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
t
A
C
我
/
我
A
r
t
我
C
e
–
p
d
F
/
d
哦
我
/
.
1
0
1
1
6
2
/
t
我
A
C
_
A
_
0
0
4
5
1
1
9
8
7
0
3
1
/
/
t
我
A
C
_
A
_
0
0
4
5
1
p
d
.
F
乙
y
G
你
e
s
t
t
哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3
Chapter of the Association for Computational
语言学: 人类语言技术
(NAACL-HLT), pages 99–109. https://土井
.org/10.18653/v1/N16-1013
Rashmi Prasad, Nikhil Dinesh, Alan Lee, Eleni
Miltsakaki, Livio Robaldo, Aravind Joshi, 和
Bonnie Webber. 2008. The Penn Discourse
TreeBank 2.0. In Proceedings of the Sixth In-
ternational Conference on Language Resources
and Evaluation.
Chris Quirk and Hoifung Poon. 2017. Distant su-
pervision for relation extraction beyond the
这
sentence boundary.
15th Conference of the European Chapter of
the Association for Computational Linguistics
(EACL), pages 1171–1182. https://土井
.org/10.18653/v1/E17-1110
在诉讼程序中
Alexander J. Ratner, Cristopher M. De Sa, Sen
吴, Daniel Selsam, and Christopher R´e. 2016.
Data programming: Creating large training sets,
迅速地. 在诉讼程序中
the 30th Inter-
national Conference on Neural Information
Processing Systems (NIPS), pages 3574–3582.
Roi Reichart and Ari Rappoport. 2007. 自己-
training for enhancement and domain adap-
tationa of statistical parsers train on small
datasets. In Proceedings of the 45th Annual
Meeting of the Association for Computational
语言学 (前交叉韧带), pages 616–623.
Sebastian Ruder and Barbara Plank. 2018. Strong
baselines for neural semi-supervised learn-
ing under domain shift. 在诉讼程序中
the 56th Annual Meeting of
the Associa-
tion for Computational Linguistics (前交叉韧带),
pages 1044–1054. https://doi.org/10
.18653/v1/P18-1096
Kenji Sagae. 2009. Analysis of discourse structure
with syntactic dependencies and data-driven
shift-reduce parsing. In Proceedings of the 11th
International Workshop on Parsing Technology
(IWPT), pages 81–84. https://doi.org
/10.3115/1697236.1697253
Kuniaki Saito, Toshitaka Ushiku, and Tatsuya
Harada. 2017. Asymmetric tri-training for un-
supervised domain adaptation. In Proceed-
ings of The 34th International Conference on
Machine Learning (ICML), pages 2988–2997.
Burr Settles. 2009. Active learning literature
survey. Computer Sciences Technical Report
1648. University of Wisconsin–Madison.
143
Zhouxing Shi and Minlie Huang. 2019. A
deep sequential model for discourse parsing
on multi-party dialogues. 在诉讼程序中
the Thirty-Third AAAI Conference on Artifi-
cial Intelligence (AAAI), pages 7007–7014.
https://doi.org/10.1609/aaai.v33i01
.33017007
Anders Søaard and Christian Rishøj. 2010. Semi-
supervised dependency parsing using gener-
alized tri-training. In Proceedings of the 23rd
国际计算会议
语言学 (科林), pages 1065–1073.
标记
斯蒂德曼, Rebecca Hwa,
斯蒂芬
克拉克, Miles Osborne, Anoop Sarkar, 朱莉娅
Hockenmaier, Paul Ruhlen, Steven Baker,
and Jeremiah Crim. 2003A. Example selection
for bootstrapping statistical parsers. In Pro-
ceedings of the 2003 Conference of the North
American Chapter of
the Association for
计算语言学: Human Language
Technologies (NAACL-HLT), pages 236–243.
https://doi.org/10.3115/1073445
.1073476
标记
斯蒂德曼, Miles Osborne, Anoop
Sarkar, Stephen Clark, Rebecca Hwa, 朱莉娅
Hockenmaier, Paula Ruhlen, Steven Baker, 和
Jeremiah Crim. 2003乙. Bootstrapping statistical
parsers from small datasets. 在诉讼程序中
the 10th Conference of the European Chapter
of the Association for Computational Linguis-
抽动症 (EACL), pages 331–338. https://土井
.org/10.3115/1067807.1067851
Jun Suzuki and Hideki Isozaki. 2008. Semi-
supervised sequential labeling and segmenta-
tion using giga-word scale unlabeled data. 在
Proceedings of the 46th Annual Meeting of
the Association for Computational Linguistics
(前交叉韧带), pages 665–673.
Suzan Verberne, Lou Boves, Nelleke Oostdijk,
and Peter-Arno Coppen. 2007. Evaluating
discourse-based answer extraction for why-
question answering. In Proceedings of the 30th
Annual International ACM SIGIR Conference
on Research and Development in Information
Retrieval (SIGIR), pages 735–736. https://
doi.org/10.1145/1277741.1277883
Lucy
Lu Wang, Kyle
Lo, Yoganand
Chandrasekhar, Russell Reas, Jiangjiang Yang,
Douglas Burdick, Darrin Eide, Kathryn Funk,
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
t
A
C
我
/
我
A
r
t
我
C
e
–
p
d
F
/
d
哦
我
/
.
1
0
1
1
6
2
/
t
我
A
C
_
A
_
0
0
4
5
1
1
9
8
7
0
3
1
/
/
t
我
A
C
_
A
_
0
0
4
5
1
p
d
.
F
乙
y
G
你
e
s
t
t
哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3
Yannis Katsis, Rodney Kinney, Yunyao Li,
Ziyang Liu, William Merrill, Paul Mooney,
Dewey Murdick, Devvret Rishi, Jerry Sheehan,
Zhihong Shen, Brandon Stilson, Alex D.
Wade, Kuansan Wang, Nancy Xin Ru Wang,
Chris Wilhelm, Boya Xie, Douglas Raymond,
Daniel S. Weld, Oren Etzioni, and Sebastian
Kohlmeier. 2020. CORD-19: The COVID-19
open research dataset. arXiv 预印本 arXiv:
2004.10706v4.
Yizhong Wang, Sujian Li, and Honfeng Wang.
2017. A two-stage parsing method for text-level
discourse analysis. In Proceedings of the 55th
Annual Meeting of the Association for Com-
putational Linguistics (前交叉韧带), pages 184–188.
https://doi.org/10.18653/v1/P17
-2029
Yizhong Wang, Sujian Li, and Jingfeng Yang.
2018. Toward fast and accurate neural dis-
course segmentation. 在诉讼程序中
这
2018 Conference on Empirical Methods in Nat-
ural Language Processing. https://土井
.org/10.18653/v1/D18-1116
David Weiss, Chris Alberti, Michael Collins,
and Slav Petrov. 2015. Structured training
for neural network transition-based parsing.
在诉讼程序中
the 53rd Annual Meet-
ing of
the Association for Computational
Linguistics and the 7th International Joint
Conference on Natural Language Processing
(ACL-IJCNLP), pages 323–333. https://
doi.org/10.3115/v1/P15-1032
Jiacheng Xu, Zhe Gan, Yu Cheng, and Jingjing
刘. 2020. Discourse-aware neural extractive
text summarization. In Proceedings of the 58th
Annual Meeting of the Association for Compu-
tational Linguistics (前交叉韧带), pages 5021–5031.
An Yang and Sujian Li. 2018. SciDTB: 话语
dependency treebank for scientific abstracts.
In Proceedings of the 56th Annual Meeting of
the Association for Computational Linguistics
(前交叉韧带), pages 444–449. https://doi.org
/10.18653/v1/P18-2071
David Yarowsky. 1995. Unsupervised word sense
disambiguation rivaling supervised methods. 在
Proceedings of the 33rd Annual Meeting of
the Association for Computational Linguistics
(前交叉韧带), pages 189–196. https://doi.org
/10.3115/981658.981684
Yasuhisa Yoshida, Jun Suzuki, Tsutomu Hirao,
and Masaaki Nagata. 2014. Dependency-based
discourse parser for single-document summa-
rization. 在诉讼程序中 2014 会议
on Empirical Methods in Natural Language
加工, pages 1834–1839. https://土井
.org/10.3115/v1/D14-1196
Liwen Zhang, Ge Wang, Wenjuan Han, 和
Kewei Tu. 2021. Adapting unsupervised syn-
tactic parsing methodology for discourse
这
dependency parsing. 在诉讼程序中
59th Annual Meeting of
the Association
for Computational Linguistics and the 11th
International Joint Conference on Natural Lan-
guage Processing (体积 1: Long Papers),
pages 5782–5794. https://doi.org/10
.18653/v1/2021.acl-long.449
Longyin Zhang, Yuqing Xing, Fang Kong,
Peifeng Li, and Guodong Zhou. 2020. A top-
down neural architecture towards text-level
parsing of discourse rhetorical structure. In Pro-
ceedings of the 58th Annual Meeting of the As-
sociation for Computational Linguistics (前交叉韧带),
pages 6386–6395. https://doi.org/10
.18653/v1/2020.acl-main.569
Yan Zhou and Sally Goldman. 2004. Democratic
co-learning. In Proceedings of the 16th IEEE
International Conference on Tools with Artifi-
cial Intelligence (ICTAI).
Zhi-Hua Zhou and Ming Li. 2005. Tri-training:
Exploiting unlabeled data using three classi-
fiers. IEEE Transactions on Knowledge and
Data Engineering, 17(11):1529–1541. https://
doi.org/10.1109/TKDE.2005.186
144
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
t
A
C
我
/
我
A
r
t
我
C
e
–
p
d
F
/
d
哦
我
/
.
1
0
1
1
6
2
/
t
我
A
C
_
A
_
0
0
4
5
1
1
9
8
7
0
3
1
/
/
t
我
A
C
_
A
_
0
0
4
5
1
p
d
.
F
乙
y
G
你
e
s
t
t
哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3