VILA: Mejora de la extracción de contenido estructurado de archivos PDF científicos

VILA: Mejora de la extracción de contenido estructurado de archivos PDF científicos
Using Visual Layout Groups

Zejiang Shen1 Kyle Lo1 Lucy Lu Wang1 Bailey Kuehl1
Daniel S.. Weld1,2 Doug Downey1,3

1Instituto Allen para la IA, USA 2University of Washington, USA 3Northwestern University, EE.UU

{shannons,kylel,lucyw,baileyk,danw,dougd}@allenai.org

Abstracto

Accurately extracting structured content from
PDFs is a critical first step for NLP over
scientific papers. Recent work has improved
extraction accuracy by incorporating elemen-
tary layout information, Por ejemplo, cada
token’s 2D position on the page, into language
model pretraining. We introduce new methods
that explicitly model VIsual LAyout (VILA)
grupos, eso es, text lines or text blocks, a
further improve performance. In our I-VILA
acercarse, we show that simply inserting spe-
cial tokens denoting layout group boundaries
into model inputs can lead to a 1.9% Macro
F1 improvement in token classification. En el
H-VILA approach, we show that hierarchical
encoding of layout-groups can result in up to
47% inference time reduction with less than
0.8% Macro F1 loss. Unlike prior layout-aware
approaches, our methods do not require expen-
sive additional pretraining, only fine-tuning,
which we show can reduce training cost by
hasta 95%. Experiments are conducted on
a newly curated evaluation suite, S2-VLUE,
that unifies existing automatically labeled
datasets and includes a new dataset of man-
ual annotations covering diverse papers from
19 scientific disciplines. Pre-trained weights,
benchmark datasets, and source code are avail-
able at https://github.com/allenai
/VILA.

1

Introducción

Scientific papers are usually distributed in Portable
Document Format (PDF) without extensive se-
mantic markup. Extracting structured document
representations from these PDF files—i.e., identi-
fying title and author blocks, figures, references,
and so on—is a critical first step for downstream
NLP tasks (Beltagy et al., 2019; Wang y cols., 2020)
and is important for improving PDF accessibility
(Wang y cols., 2021).

376

Recent work demonstrates that document lay-
out information can be used to enhance content
extraction via large-scale,
layout-aware pre-
training (Xu et al., 2020, 2021; Le et al.,
2021). Sin embargo, these methods only consider
individual tokens’ 2D positions and do not ex-
plicitly model high-level layout structures like
the grouping of text into lines and blocks (ver
limiting accuracy.
Cifra 1 para un ejemplo),
Más, existing methods come with enormous
computational costs: They rely on further pre-
training an existing pretrained model like BERT
(Devlin et al., 2019) on layout-enriched in-
put, and achieving the best performance from
the models requires more than a thousand (Xu
et al., 2020) to several thousand (Xu et al., 2021)
GPU-hours. This means that swapping in a new
pretrained text model or experimenting with new
layout-aware architectures can be prohibitively
expensive, incompatible with the goals of green
AI (Schwartz et al., 2020).

en este documento, we explore how to improve
the accuracy and efficiency of structured con-
tent extraction from scientific documents by
using VIsual LAyout (VILA) grupos. Following
Zhong et al. (2019) and Tkaczyk et al. (2015),
our methods use the idea that a document page
can be segmented into visual groups of tokens
(either lines or blocks), and that the tokens within
each group generally have the same semantic
categoría, which we refer to as the group unifor-
mity assumption (ver figura 1(b)). Given text
lines or blocks generated by rule-based PDF
analizadores (Tkaczyk et al., 2015) or vision models
(Zhong et al., 2019), we design two different meth-
ods to incorporate the VILA groups and the as-
sumption into modeling: The I-VILA model adds
layout indicator tokens to textual inputs to improve
the accuracy of existing BERT-based language
modelos, while the H-VILA model uses VILA

Transacciones de la Asociación de Lingüística Computacional, volumen. 10, páginas. 376–392, 2022. https://doi.org/10.1162/tacl a 00466
Editor de acciones: Kristina Toutanova. Lote de envío: 8/2021; Lote de revisión: 11/2021; Publicado 4/2022.
C(cid:2) 2022 Asociación de Lingüística Computacional. Distribuido bajo CC-BY 4.0 licencia.

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
4
6
6
2
0
0
6
9
9
3

/

/
t

yo

a
C
_
a
_
0
0
4
6
6
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

The H-VILA model performs group-level
predictions and can reduce model inference
time by 47% with less than 0.8% loss in
Macro F1.

3. We construct a unified benchmark suite
S2-VLUE, which enhances existing datasets
with VILA structures, and introduce a novel
dataset S2-VL that addresses gaps in existing
resources. S2-VL contains hand-annotated
gold labels for 15 token categories on papers
spanning 19 disciplines.

The benchmark datasets, modeling code, and trained
weights are available at https://github
.com/allenai/VILA.

2 Trabajo relacionado

2.1 Structured Content Extraction for

Scientific Documents

Prior work on structured content extraction for
scientific documents usually relies on textual or
visual features. Text-based methods like Scien-
ceParse (Ammar et al., 2018), GROBID (GRO,
2008–2021), or Corpus Conversion Service (Staar
et al., 2018) combine PDF-to-text parsing engines
like CERMINE (Tkaczyk et al., 2015) or pdfalto,1
which output a sequence of tokens extracted
from a PDF, with machine learning models like
RNN (Hochreiter and Schmidhuber, 1997), CRF
(Lafferty et al., 2001), or Random Forest (Breiman
2001) trained to classify the token categories of
the sequence. Though these models are practical
and fairly efficient, they fall short in prediction
accuracy or generalize poorly to out-of-domain
documentos. Vision-based Approaches (Zhong
et al., 2019; He et al., 2017; Siegel et al., 2018),
por otro lado, treat the parsing task as an
image object detection problem: Given document
images, the models predict rectangular bound-
ing boxes, segmenting the page into individual
components of different categories. These models
excel at capturing complex visual layout structures
like figures or tables, but because they operate only
on visual signals without textual information, ellos
cannot accurately predict fine-grained semantic
categories like title, author, or abstract, cual
are of central importance for parsing scientific
documentos.

1https://github.com/kermitt2/pdfalto (last ac-

cessed Jan. 1, 2022).

Cifra 1: (a) Real-world scientific documents often
have intricate layout structures, so analyzing only flat-
tened raw text forfeits valuable information, yielding
sub-optimal results. (b) The complex structures can be
broken down into groups (text blocks or lines) that are
composed of tokens with the same semantic category.

structures to define a hierarchical model that mod-
els pages as collections of groups rather than of
individual tokens, increasing inference efficiency.
Previous datasets for evaluating PDF content
extraction rely on machine-generated labels of
imperfect quality, and comprise papers from a
limited range of scientific disciplines. To bet-
ter evaluate our proposed methods, we design a
new benchmark suite, Semantic Scholar Visual
Layout-enhanced Scientific Text Understanding
Evaluation (S2-VLUE). The benchmark extends
two existing resources (Tkaczyk et al., 2015;
Le et al., 2020) and introduces a newly curated
conjunto de datos, S2-VL, which contains high-quality hu-
man annotations for papers across 19 disciplines.

Our contributions are as follows:

1. We introduce a new strategy for PDF con-
tent extraction that uses VILA structures to
inject layout information into language mod-
los, and show that this improves accuracy
without the expensive pretraining required by
existing methods, and generalizes to different
language models.

2. We design two models that incorporate VILA
features differently. The I-VILA model in-
jects layout indicator tokens into the input
texts and improves prediction accuracy (arriba
a +1.9% Macro F1) and consistency com-
pared with the previous layout-augmented
language model LayoutLM (Xu et al., 2020).

377

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
4
6
6
2
0
0
6
9
9
3

/

/
t

yo

a
C
_
a
_
0
0
4
6
6
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

2.2 Layout-aware Language Models

Recent methods on layout-aware language models
improve prediction accuracy by jointly modeling
documents’ textual and visual signals. LayoutLM
(Xu et al., 2020) learns a set of novel positional
embeddings that can encode tokens’ 2D spatial
location on the page and improves accuracy on
scientific document parsing (Le et al., 2020). Más
trabajo reciente (Xu et al., 2021; Le et al., 2021) aims
to encode the document in a multimodal fashion
by modeling text and images together. Sin embargo,
these existing joint-approach models require ex-
pensive pretraining, and may be less efficient as a
consequence of their joint inputs (Xu et al., 2021),
making them less suitable for deployment at scale.
En este trabajo, we aim to incorporate document lay-
out features in the form of visual layout groupings,
in novel ways that improve or match performance
without the need for expensive pretraining. Nuestro
work is well-aligned with recent efforts for in-
corporating structural information into language
modelos (Lee et al., 2020; Bai et al., 2021; Cual
et al., 2020; Zhang et al., 2019).

2.3 Training and Evaluation Datasets

The available training and evaluation datasets
for scientific content extraction models are auto-
matically generated from author-provided source
data—for example, GROTOAP2 (Tkaczyk et al.,
2014) and PubLayNet (Zhong et al., 2019) are con-
structed from PubMed Central XML and DocBank
(Le et al., 2020) from arXiv LaTeX source. De-
spite their large sample sizes, these datasets have
limited layout variation, leading to poor gener-
alization to papers from other disciplines with
distinct layouts. También, due to the heuristic nature in
which the data are automatically labeled, ellos estafan-
tain systematic classification errors that can affect
downstream modeling performance. We elabo-
rate on the limitations of GROTOAP2 (Tkaczyk
et al., 2014) and DocBank (Le et al., 2020) en
Sección 4. PubLayNet (Zhong et al., 2019) pro-
vides high-quality text block annotations on 330k
document pages, but its annotations only cover
five distinct categories. Livathinos et al. (2021)
and Staar et al. (2018) curated a multi-disciplinary,
manually annotated dataset of 2,940 paper pages,
but only make available the processed page fea-
tures without the raw text or source PDFs needed
for experiments with layout-aware methods. Nosotros

introduce a new evaluation dataset, S2-VL, a
address limitations in these existing datasets.

3 Métodos

3.1 Problem Formulation

Following prior work (Tkaczyk et al., 2015; li
et al., 2020), our task is to map each token ti
in an input sequence T = (t1, . . . , tn) to its se-
mantic category ci (título, body text, reference,
etc.). Input tokens are extracted via PDF-to-text
herramientas, which output both the word ti and its
2D position on the page, a rectangular bound-
ing box ai = (x0, y0, x1, y1) denoting the left,
arriba, bien, and bottom coordinate of the word.
The order of tokens in sequence T may not re-
flect the actual reading order of the text due to
errors in PDF-to-text conversion (p.ej., in the orig-
inal DocBank dataset [Le et al., 2020]), cual
poses an additional challenge to language models
pre-trained on regular texts.

Besides the token sequence T , additional vi-
sual structures G can also be retrieved from the
source document. Scientific papers are organized
into groups of tokens (lines or blocks), cual
consist of consecutive pieces of text that can
be segmented from other pieces based on spa-
tial gaps. The group information can be extracted
via visual layout detection models (Zhong et al.,
2019; He et al., 2017) or rule-based PDF parsing
(Tkaczyk et al., 2015).

Formalmente, given an input page, the group de-
tector identifies a series of m rectangular boxes
for each group bj ∈ B = {b1, . . . , bm} en el
input document page, where bj = (x0, y0, x1, y1)
denotes the box coordinates. Page tokens are allo-
cated to the visual groups gj = (bj, t (j)), dónde
t (j) = {de | ai (cid:4) bj, ti ∈ T } contains all tokens
in the j-th group, and ai (cid:4) bj denotes that the cen-
ter point of token ti’s bounding box ai is strictly
within the group box bi. When two group regions
overlap and share common tokens, the system as-
signs the common tokens to the earlier group by
the estimated reading order from the PDF parser.
We refer to text block groups of a page as G(B) y
text line groups as G(l). In our case, we define text
lines as consecutive tokens appearing at the nearly
same vertical position.2 Text blocks are sets of
adjacent text lines with gaps smaller than a certain
límite, and ideally the same semantic category.
Eso es, even two close lines of different semantic

2Or horizontal position, when the text is written vertically.

378

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
4
6
6
2
0
0
6
9
9
3

/

/
t

yo

a
C
_
a
_
0
0
4
6
6
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

of modeling, rather than only providing positional
information at the initial embedding layers as in
LayoutLM (Xu et al., 2020). We empirically show
that BERT-based models can learn to leverage
such special tokens to improve both the accuracy
and the consistency of category predictions, incluso
without an additional loss penalizing inconsistent
intra-group predictions.

1

1

En la práctica, given G, we linearize tokens T (j)
from each group and flatten them into a 1D
secuencia. To avoid capturing confounding in-
formation in existing pretraining tasks, insertamos
a new token previously unseen by the model,
[BLK], in-between text from different groups
t (j). The resulting input sequence is of the
, . . . , t (j)
nj , [BLK], t (j+1)
forma [[CLS], t (1)
, . . . ,
t (metro)
nm , [SEP]], where T (j)
and nj indicate the i-th
i
token and the total number of tokens respectively
in the j-th group, y [CLS] y [SEP] son los
special tokens used by the BERT model and are
inserted to preserve a similar input structure.3 The
BERT-based models are fine-tuned on the token
classification objective with a cross entropy loss.
When I-VILA uses a visual pretrained language
model as input, such as LayoutLM (Xu et al.,
2020), the positional embeddings for the newly
injected [BLK] tokens are generated from the
corresponding group’s bounding box bj.

3.3 H-VILA: Visual Layout-guided

Hierarchical Model

The uniformity of group token categories also
suggests the possibility of building a group-level
classifier. Inspired by recent advances in model-
ing long documents, hierarchical structures (Cual
et al., 2020; Zhang et al., 2019) provide an ideal
architecture for the end task while optimizing for
computational cost. Illustrated in Figure 3, nuestro
hierarchical approach uses two transformer-based
modelos, one to encode each group in terms of its
palabras, and another modeling the whole document
in terms of the groups. We provide the details
abajo.

The Group Encoder is a lg-layer transformer
that converts each group gi into a hidden vec-
tor hi. Following the typical transformer model
configuración (Vaswani et al., 2017), the model takes a
sequence of tokens T (j) within a group as input,

3El [CLS] y [SEP] tokens are only inserted at the
beginning or end of each input sequence, and they do not
represent the sentence boundaries in this case.

379

Cifra 2: Comparing inserting indicator tokens [BLK]
based on VILA groups and sentence boundaries. En-
dicators representing VILA groups (p.ej., text blocks
in the left figure) are usually consistent with the to-
ken category changes (illustrated by the background
color in (a)), while sentence boundary indicators fail
to provide helpful hints (both ‘‘false positives’’ and
‘‘false negatives’’ occur frequently in (b)). Best viewed
in color.

categories should be allocated to separate blocks,
and in our models we use a block detector trained
toward this objective. En la práctica, block or line
detectors may generate incorrect predictions.

In the following sections, we describe our two
modelos, I-VILA and H-VILA. The models take
a BERT-based pretrained language model as a
foundation, which may or may not
itself be
layout-aware (we experiment with DistilBERT,
BERT, RoBERTa, and LayoutLM in our experi-
mentos). Our models then augment the base model
to incorporate group structures, as detailed below.

3.2 I-VILA: Injecting Visual
Layout Indicators

According to the group uniformity assumption,
token categories are homogeneous within a group,
and categorical changes should happen at group
boundaries. This suggests that layout information
should be incorporated in a way that informs
token category consistency intra-group and signals
possible token category changes inter-group.

Our first method supplies VILA structures by
inserting a special layout indicator token at each
group boundary in the input text, and models this
with a pretrained language model (which may or
may not be position-aware). We refer to this as
the I-VILA method. As shown in Figure 2(a),
the inserted tokens partition the text into seg-
ments that provide helpful structure to the model,
hinting at possible category changes. In I-VILA,
the special tokens are seen at all layers of the
modelo, providing VILA signals at different stages

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
4
6
6
2
0
0
6
9
9
3

/

/
t

yo

a
C
_
a
_
0
0
4
6
6
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
4
6
6
2
0
0
6
9
9
3

/

/
t

yo

a
C
_
a
_
0
0
4
6
6
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Cifra 3: Illustration of the H-VILA model. Texts from each visual layout group are encoded separatedly using
the group encoder, and the generated representation are subsequently modeled by a page encoder. The semantic
category are predicted at the group-level, which significantly improves efficiency.

i

(cid:2)

into a dense vector e(j)
and maps each token T (j)
i
of dimension d. Después, a group vector ag-
gregation function f : Rnj ×d → Rd is applied that
(cid:3)
1 , . . . , mi(j)
mi(j)
projects the token representations
nj
to a single vector ˜hj that represents the group’s
textual information. A group’s 2D spatial infor-
mation is incorporated in the form of positional
embeddings, and the final group representation hj
can be calculated as:

hj = ˜hj + pag(bj).

(1)

where p is the 2D positional embedding similar to
the one used in LayoutLM:

pag(b) = Ex(x0) + Ex(x1) + Ew(x1 − x0)
+ Ey(y0) + Ey(y1) + Eh(y1 − y0),

(2)

where Ex, Ex, Ew, Eh are the embedding matri-
ces for x, y coordinates and width and height. En
práctica, we find that injecting positional infor-
mation using the bounding box of the first token
within the group leads to better results, and we
choose group vector aggregation function f to be
the average over all tokens representations.

The Page Encoder is another stacked trans-
former model of lp layers that operates on the
group representation hj generated by the group
encoder. It generates a final group representation
sj for downstream classification. A MLP-based
linear classifier is attached thereafter, and is
trained to generate the group-level category prob-
ability pjc.

Different from previous work (Yang et al.,
the choice of lg and lp to
2020), restringimos
{1, 12} such that we can load pre-trained weights

380

from BERT base models. Por lo tanto, no addi-
tional pretraining is required, and the H-VILA
model can be fine-tuned directly for the down-
stream classification task. Específicamente, we set
lg = 1 and initialize the group encoder from the
first-layer transformer weights of BERT. The page
encoder is configured as either a one-layer trans-
former or a 12-layer transformer that resembles
a full LayoutLM model. Weights are initialized
from the first-layer or full 12 layers of the Lay-
outLM model, which is trained to model texts in
conjunction with their positions.

Group Token Truncation As suggested in
Yang et al.’s (2020) trabajar, when an input docu-
ment of length N is evenly split into segments of
Ls, the memory footprint of the hierarchical model
is O(lgN Ls + lp( norte
)2), and for long documents
Ls
with N (cid:6) Ls, it approximates as O(N 2/L2
s).
Sin embargo, in our case, it is infeasible to adopt the
Greedy Sentence Filling technique (Yang et al.,
2020) as it mingles signals from different groups
and obfuscates group structures. It is also less
desirable to simply use the maximum token count
per group max1≤j≤m nj to batch the contents
due to the high variance of group token length
(ver tabla 1). En cambio, we choose a group token
truncation count ˜n empirically based on statistics
of the group token length distribution such that
N ≈ ˜nm, and use the first ˜n to aggregate the
group hidden vector hj for all groups (we pad the
sequence to ˜n when it is shorter).

4 Benchmark Suite

To systematically evaluate the proposed meth-
probabilidades, we develop the the Semantic Scholar Visual

Tren / desarrollador / Test Pages
Annotation Method
Scientific Discipline
Visual Layout Group
Number of Categories
Average Token Count2
1203 (591)
Average Text Line Count
90 (51)
Average Text Block Count 12 (16)

GROTOAP2

DocBank

S2-VL

83k / 18k / 18k 398k / 50k / 50k
Automatic
Life Science
PDF parsing
22

Automatic
Matemáticas / Physics / CS 19 Disciplines
Vision model
12

1.3k1
Human Annotation

Gold Label / Detection methods
15

838 (503)
60 (34)
15 (8)

790 (453)
64 (54)
22 (36)

1 This is the total number of pages in the S2-VL dataset; we use 5-fold cross-validation for training and testing.
2 We report the average token, text line, and text block count per page, with standard deviations in parentheses.

Mesa 1: Details for the three datasets in the S2-VLUE benchmark.

Layout-enhanced Scientific Text Understanding
Evaluation (S2-VLUE) benchmark suite. S2-
VLUE consists of three datasets—two previ-
ously released resources that we augment with
VILA information, and a new hand-curated dataset
S2-VL.

Key statistics for S2-VLUE are provided in
Mesa 1. Notablemente, the three constituent datasets
differ with respect to their: 1) annotation method,
2) VILA generation method, y 3) paper domain
coverage. We provide details below.

GROTOAP2 The GROTOAP2 dataset (Tkaczyk
et al., 2014) is automatically annotated. Its text
block and line groupings come from the CER-
MINE PDF parsing tool (Tkaczyk et al., 2015);
text block category labels are then obtained by
pairing block texts with structured data from doc-
ument source files obtained from PubMed Central.
A small subset of data is inspected by experts, y
a set of post-processing heuristics is developed to
further improve annotation quality. Because to-
ken categories are annotated by group, the dataset
achieves perfect accordance between token labels
and VILA structures. Sin embargo, the method of
rule-based PDF parsing employed by the authors
introduces labeling inaccuracies due to imperfect
VILA detection: the authors find that block-level
annotation accuracy achieves only 92 Macro F1
in a small gold evaluation set. Además, todo
samples are extracted from the PMC Open Access
Subset4 that includes only life sciences publica-
ciones; these papers have less representation of
classification types like ‘‘equation’’, cuales son
common in other scientific disciplines.

4https://www.ncbi.nlm.nih.gov/pmc/tools/open

ftlist/ (last accessed Jan. 1, 2022).

DocBank The DocBank dataset (Le et al., 2020)
is fully machine-labeled without any postprocess-
ing heuristics or human assessment. Los autores
first identify token categories by automatically
parsing the source TEX files available from arXiv.
Text block annotations are then generated by
grouping together tokens of the same category
using connected component analysis. Sin embargo,
only a specific set of token tags is extracted from
the main TEX file for each paper, leading to in-
accurate and incomplete token labels, especially
for papers employing LaTeX macro commands,5
y por lo tanto, incorrect visual groupings. Por eso, nosotros
develop a Mask R-CNN-based vision layout de-
tection model based on a collection of existing
resources (Zhong et al., 2019; MFD, 2021; Él
et al., 2017; Shen et al., 2021) to fix these inaccu-
racies and generate trustworthy VILA annotations
at both the text block and line level.6 As a result,
this dataset can be used to evaluate VILA models
under a different setting, since the VILA structures
are generated independently from the token anno-
taciones. Because the papers in DocBank are from
arXiv, sin embargo, they primarily represent domains
like Computer Science, Physics, and Mathematics,
limiting the amount of layout variation.

5Por ejemplo, in DocBank, ‘‘Figure 1’’ in a figure cap-
tion block is usually labeled as ‘‘paragraph’’ rather than
‘‘caption’’. DocBank labels all tokens that are not explicitly
contained in the set of processed LaTeX tags as ‘‘paragraph.’’
6The original generation method for DocBank requires
rendering LaTeX source, which results in layouts different
from the publicly available versions of these documents on
arXiv. Sin embargo, because the authors of the dataset only
provide document page images, rather than the rendered
PDF, we can only use image-based approaches for layout
detección. We refer readers to the appendix for details.

381

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
4
6
6
2
0
0
6
9
9
3

/

/
t

yo

a
C
_
a
_
0
0
4
6
6
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

S2-VL We introduce a new dataset to address
the three major drawbacks in existing work: 1)
annotation quality, 2) VILA fidelity, y 3) hacer-
main coverage. S2-VL is manually labeled by
graduate students who frequently read scien-
tific papers. Using the PAWLS annotation tool
(Neumann et al., 2021), annotators draw rectan-
gular text blocks directly on each PDF page, y
specify the block-level semantic categories from
15 possible candidates.7 Tokens within a group
can therefore inherit the category from the parent
text block. Acuerdo entre anotadores, in terms of
token-level accuracy measured on a 12-paper sub-
colocar, is high at 0.95. The ground-truth VILA labels
in S2-VL can be used to fine-tune visual layout de-
tection models, and paper PDFs are also included,
making PDF-based structure parsing feasible; este
enables VILA annotations to be created by dif-
ferent means, which is helpful for benchmarking
new VILA-based models. Además, S2-VL cur-
rently contains 1337 pages of 87 papers from 19
different disciplines, incluido, Por ejemplo, Phi-
losophy and Sociology, which are not present in
previous data sets.

En general, the datasets in S2-VLUE cover a wide
range of academic disciplines with different lay-
salidas. The VILA structures in the three component
datasets are curated differently, which helps to
evaluate the generality of VILA-based methods.

5 Experimental Setup

5.1 Detalles de implementacion

Our models are implemented using PyTorch
(Paszke et al., 2019) and the transformers library
(Wolf et al., 2020). A series of baseline and VILA
models are fine-tuned on 4-GPU RTX8000 or
A100 machines. The AdamW optimizer (Kingma
and Ba, 2015; Loshchilov and Hutter, 2019)
is adopted with a 5 × 10−5 learning rate and
(β1, β2) = (0.9, 0.999). The learning rate is lin-
early warmed up over 5% steps then linearly
decayed. For all datasets (GROTOAP2, DocBank,
S2-VL), unless otherwise specified, we select the
best fine-tuning batch size (40, 40, y 12) y

7Of our defined categories, 12 are common fields and
taken directly from other similar datasets (p.ej., título, abstract).
We add three categories: equation, header, and footer, cual
commonly occur in scientific papers and are included in
full text mining resources like S2ORC (Lo et al., 2020) y
CORD-19 (Wang y cols., 2020).

training epochs (24, 6,8 y 10) for all models. Como
for S2-VL, given its smaller size, we use 5-fold
cross validation and report averaged scores, y
usar 2 × 10−5 learning rate with 20 epochs. Nosotros
split S2-VL based on papers rather than pages
to avoid exposing paper templates of test sam-
ples in the training data. Mixed precision training
(Micikevicius et al., 2018) is used to speed up the
training process.

Para

I-VILA models, we fine-tune several
BERT-variants with VILA-enhanced text inputs,
and the models are initialized from pre-trained
weights available in the transformers library. El
H-VILA models are initialized as mentioned in
Sección 3.3, y, by default, positional information
is injected for each group.

5.2 Competing Methods

We consider three approaches that compete with
the proposed methods from different perspectives:

is the main baseline method.

1. Baselines The LayoutLM (Xu et al., 2020)
modelo
Él
to our
is the closest model counterpart
VILA-augmented models as it also injects
disposición
information and achieves previous
SOTA performance on the Scientific PDF
parsing task (Le et al., 2020).

2. Sentence Breaks For I-VILA models, ser-
sides using VILA-based indicators, nosotros también
compare with indicators generated from sen-
tence breaks detected by PySBD (Sadvilkar
and Neumann, 2020). Cifra 2(a) muestra
the inserted sentence-break indica-
eso
tors may have both ‘‘false-positive’’ or
‘‘false-negative’’ hints for token semantic
category changes, making it less helpful for
the end task.

3. Simple Group Classifier For hierarchical
modelos, we consider another baseline ap-
proach, where the group texts are separately
fed into a LayoutLM-based group classi-
fier. It doesn’t require complicated model
diseño, and uses a full LayoutLM to model
each group’s text, as opposed to the lg = 1
layer used in the H-VILA models. Sin embargo,

8We try to keep gradient update steps the same for
the GROTOAP2 and the DocBank dataset. As DocBank
contains 4× examples, the number of DocBank models’
training epochs is reduced by 75%.

382

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
4
6
6
2
0
0
6
9
9
3

/

/
t

yo

a
C
_
a
_
0
0
4
6
6
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

GROTOAP2
Macro F1 (cid:2) h(GRAMO) (cid:3)

DocBank
Macro F1 (cid:2) h(GRAMO) (cid:3)

S2-VL1
Macro F1 (cid:2) h(GRAMO) (cid:3)

LayoutLMBASE (Xu et al., 2020)
LayoutLMBASE + Sentence Breaks
LayoutLMBASE + I-VILA(Text Line)
LayoutLMBASE + I-VILA(Text Block)

92.34
91.83

92.37
93.38

0.78
0.78

0.73
0.53

91.06
91.44

92.79
92.00

2.64
2.62

2.17
2.10

82.69(6.04)
82.81(5.21)
83.77(5.75)2
83.44(6.48)

4.19(0.25)
4.21(0.55)

3.28(0.35)
2.83(0.34)

1 For S2-VL, we show averaged scores with standard deviation in parentheses across the 5-fold cross validation subsets.
2 In this table, we report S2-VL results using VILA structures detected by visual layout models. When the ground-truth

VILA structures are available, both I-VILA and H-VILA models can achieve better accuracy, mostrado en la tabla 6.

Mesa 2: Performance of baseline and I-VILA models on the scientific document extraction task. I-VILA provides
consistent accuracy improvements over the baseline LayoutLM model on all three benchmark datasets.

this method cannot account for inter-group
interactions, and is far less efficient.9

inconsistency for G is the arithmetic mean of all
individual groups gi:

5.3 Métrica

Prediction Accuracy The token label distri-
bution is heavily skewed towards categories
el
corresponding to paper body texts (p.ej.,
‘‘BODY CONTENT’’ category in GROTOAP2
o
the ‘‘paragraph’’ category in S2-VL and
DocBank). Por lo tanto, we choose to use Macro
F1 as our primary evaluation metric for prediction
exactitud.

Group Category Inconsistency To better char-
acterize how different models behave with respect
to group structure, we also report a diagnostic
metric that calculates the uniformity of the token
categories within a group. Hipotéticamente, tokens
t (j) in the j-th group gj share the same category c,
and naturally the group inherits the semantic label
C. We use the group token category entropy to
measure the inconsistency of a model’s predicted
token categories within the same group:

h(GRAMO) =

1
metro

metro(cid:4)

i

h(gi).

(4)

h(GRAMO) acts as an auxiliary metric for evaluating
prediction quality with respect to the provided
VILA structures. In the remainder of this paper,
we report the inconsistency metric for text blocks
GRAMO(B) by default, and scale the values by a factor
de 100.

Measuring Efficiency We report the inference
time per sample as a measure of model efficiency.
We select 1,000 pages from the GROTOAP2 test
colocar, and report the average model runtime for 3
runs on this subset. All models are tested on an
isolated machine with a single V100 GPU. Nosotros
report the time incurred for text classification;
time costs associated with PDF-to-text conversion
or VILA structure detection are not included (estos
are treated as pre-processing steps, which can be
cached and re-used when processing documents
with different content extractors).

h(gramo) = -

(cid:4)

C

pc log pc,

(3)

6 Resultados

where pc denotes the probability of a token in
group g being classified as category c. Cuando
all tokens in a group have the same category,
the group token category inconsistency is zero.
h(gramo) reaches the maximum when pc is a uniform
distribution across all possible categories. El

9Despite the group texts being relatively short, this method
causes extra computational overhead as the full LayoutLM
model needs to be run m times for all groups in a page.
The simple group classifier models are only trained for 5,
2, y 5 epochs for GROTOAP2, DocBank, and S2-VL for
tractability.

6.1 I-VILA Achieves Better Accuracy

inserting layout

Mesa 2 shows that I-VILA models lead to con-
sistent accuracy improvements without further
pretraining. Compared to the baseline LayoutLM
modelo,
indicators results in
+1.13%, +1.90%, y +1.29% Macro F1 im-
provements across the three benchmark datasets.
I-VILA models also achieve better token pre-
diction consistency;
the corresponding group
category inconsistency is reduced by 32.1%,
21.7%, y 21.7% compared to baseline. Más-
encima, VILA information is also more helpful

383

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
4
6
6
2
0
0
6
9
9
3

/

/
t

yo

a
C
_
a
_
0
0
4
6
6
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
4
6
6
2
0
0
6
9
9
3

/

/
t

yo

a
C
_
a
_
0
0
4
6
6
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Cifra 4: Model predictions for the 10th page of our paper draft. We present the token category and text block
bounding boxes (highlighted in red rectangles) based on the (a) ground-truth annotations and model predictions
from both I-VILA and H-VILA models (the three results happen to be identical) y (b) model predictions from
the LayoutLM model. When VILA is injected, the model achieves more consistent predictions for the example,
as indicated by arrows (1) y (2) en la figura. Best viewed in color.

GROTOAP2

DocBank

S2-VL

Macro F1 (cid:2) h(GRAMO) (cid:3)

Macro F1 (cid:2) h(GRAMO) (cid:3)

Macro F1 (cid:2) h(GRAMO) (cid:3)

Inference Time (EM)

LayoutLMBASE
Simple Group Classifier

H-VILA(Text Line)
H-VILA(Text Block)

92.34
92.65

91.65
92.37

0.78
0.00

0.32
0.00

91.06
87.01

91.27
87.78

2.64
0.00

1.07
0.00

82.69(6.04)
–1

83.69(2.92)
82.09(5.89)

4.19(0.25)

1.70(0.68)
0.36(0.12)

52.56(0.25)
82.57(0.30)
28.07(0.37)2
16.37(0.15)

1 The simple group classifier fails to converge for one run. We do not report the results for fair comparison.
2 When reporting efficiency in other parts of the paper, we use this result because of its optimal combination of accuracy and efficiency.

Mesa 3: Content extraction performance for H-VILA. The H-VILA models significantly reduce the inference
time cost compared to LayoutLM, while achieving comparable accuracy on the three benchmark datasets.

than language structures: I-VILA models based
on text blocks and lines all outperform the sen-
tence boundary-based method by a similar margin.
Cifra 4 shows an example of the VILA model
predicciones.

6.2 H-VILA is More Efficient

Mesa 3 summarizes the efficiency improvements
of the H-VILA models with lg = 1 and lp = 12.
As block-level models perform predictions di-
rectly at the text block level, the group category
inconsistency is naturally zero. Compared to Lay-
outLM, H-VILA models with text lines brings
a 46.59% reduction in inference time, sin
heavily penalizing the final prediction accuracies
(−0.75%, +0.23%, +1.21% Macro F1). When text
blocks are used, H-VILA models are even more

efficient (68.85% y 80.17% inference time re-
duction compared to the LayoutLM and simple
group classifier baseline), and they also achieve
similar or better accuracy compared to the simple
group classifier (−0.30%, +0.88% Macro F1 for
GROTOAP2 and DocBank).

Sin embargo, in H-VILA models, the inductive
bias from the group uniformity assumption also
has a drawback: Models are often less accurate
than their I-VILA counterparts, and performing
block level classification may sometimes lead to
worse results (−3.60% and −0.73% Macro F1
in the DocBank and S2-VL datasets compared to
LayoutLM). Además, como se muestra en la figura 5, cuando
the injected layout group is incorrect, the H-VILA
method lacks the flexibility to assign different
token categories within a group, leading to lower

384

Base Model Baseline Text Line G(l) Text Block G(B)

DistilBERT
BERT
RoBERTa
LayoutLM

90.52
90.78
91.64
92.34

91.14
91.65
92.04
92.37

92.12
92.31
92.52
93.38

Mesa 4: Content extraction performance (Macro
F1 on the GROTOAP2 dataset) for I-VILA using
different BERT model variants. I-VILA can be
applied to both standard BERT-based models and
layout-aware ones, and consistently improves the
classification accuracy.

tokens is a novel and effective way of incorpo-
rating layout information into language models.

8 VILA in Practice: The Impact of

Layout Group Detectors

Applying VILA methods in practice requires run-
ning a group layout detector as a critical first step.
En esta sección, we analyze how the accuracy of
different block and line group detectors affects the
accuracy of H-VILA and I-VILA models.

Los resultados se muestran en la tabla. 6. Nosotros informamos
on the S2-VL dataset using two automated group
detectors: the CERMINE PDF parser (Tkaczyk
et al., 2015) and the Mask R-CNN vision model
trained on the PubLayNet dataset (Zhong et al.,
2019). We also report on using ground truth blocks
como límite superior. The ‘‘Group-uniform Oracle’’
illustrates how well the different group detectors
reflect the group uniformity assumption; en el
oracle setting, one is given ground truth labels
but is restricted to assigning the same label to all
tokens in a group.

When using text blocks, the performance of
H-VILA hinges on the accuracy of group detec-
ción, while I-VILA shows more reliable results
when using different group detectors. Esta sugerencia-
gests that improvements in vision models for block
detection could be a promising avenue for improv-
ing content extraction performance, especially
when using H-VILA, and I-VILA may be the bet-
ter choice when block detection accuracy is lower.
We also observe that text line-based methods
tend to be higher performing for both group de-
tectors, by a small margin for I-VILA and a larger
one for H-VILA. The group detectors in our ex-
periments are trained on data from PubLayNet,
and applied to a different dataset, S2-VL. Este

385

Cifra 5: Illustration of models trained and evaluated
with incorrect text block detections (only the top half
of the page is shown). The blocks are created by vision
predicciones, which fails to capture the correct caption
text structure (arrow 1). Because the I-VILA model can
generate different token predictions within a group, él
maintains high accuracy, whereas H-VILA assigns the
same category for all tokens in the incorrect block,
leading to lower accuracy.

exactitud. Additional analysis of the impact of the
layout group predictions is detailed in Section 8.

7 Ablation Studies

7.1 I-VILA is Effective Across

BERT Variants

To test the applicability of the VILA methods,
we adapt I-VILA to different BERT variants and
train them on the GROTOAP2 dataset. Shown in
Mesa 4, I-VILA leads to consistent improvements
on DistilBERT (Sanh et al., 2019), BERT, y
RoBERTa (Liu et al., 2019),10 leading to up to
+1.77%, +1.69%, y 0.96% Macro F1 compared
to non-VILA counterparts.

7.2 I-VILA Improves Accuracy

without Pretraining

En mesa 5, we fine-tune a series of I-VILA models
based on BERT, and compare their performance
with LayoutLM and LayoutLMv2 (Xu et al.,
2021) which require additional large-scale pre-
training on corpora with layout. BERT+I-VILA
achieves comparable accuracy to LayoutLM
(0.00%, −0.89%, −1.05%), con solo 5% de
the training cost.11 I-VILA also closes the gap
with the latest multimodal method LayoutLMv2
(Xu et al., 2021) con solo 1% of the training cost.
This further verifies that injecting layout indicator

10Positional embeddings are not used in these models.
11Se necesita 10.5 hours to finish fine-tuning I-VILA on the
GROTOAP2 dataset using a 4 RTX 8000 machine, equivalente
to around 60 V100 GPU hours, aproximadamente 5% del 1280
hours of the pretraining time for LayoutLM.

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
4
6
6
2
0
0
6
9
9
3

/

/
t

yo

a
C
_
a
_
0
0
4
6
6
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

GROTOAP2
F1 (cid:2) h(GRAMO) (cid:3) F1 (cid:2) h(GRAMO) (cid:3)

DocBank

S2-VL

F1 (cid:2)

h(GRAMO) (cid:3)

Training Cost1

90.78
BERTBASE (Devlin et al., 2019)
BERTBASE + I-VILA(Text Line)
91.65
BERTBASE + I-VILA(Text Block) 92.31
92.34
LayoutLMBASE (Xu et al., 2020)

1.58
1.13
0.63

0.78

87.24
90.25
89.49

91.06

3.50
2.56
2.25

2.64

78.34(6.53)1 7.17(0.95)
4.76(1.28)
81.15(4.83)
3.65(0.26)
81.82(4.88)

82.69(6.04)

4.19(0.25)

LayoutLMv2BASE (Xu et al., 2021)

–2

93.33

1.93

83.05(4.51)

3.34(0.82)

40 hr fine-tuning
40 hr fine-tuning
40 hr fine-tuning

1.2k hr pretraining
+ 50 hr fine-tuning
9.6k hr pretraining3
+ 130 hr fine-tuning

1 We report the equivalent V100 GPU hours on the GROTOAP dataset in this column.
2 LayoutLMv2 cannot be trained on the GROTOAP2 dataset because almost 30% of its instances do not have compatible

PDF images.

3 The authors do not report the exact cost in the paper. The number is a rough estimate based on our experimental results.

Mesa 5: Comparison between I-VILA models and other layout-aware methods that require expensive pretraining.
I-VILA achieves comparable accuracy with less than 5% of the training cost.

Group-uniform Oracle

I-VILA

H-VILA

Experimento

Group Source Max Macro F1

h(GRAMO)

Macro F1

h(GRAMO)

Macro F1

h(GRAMO)

Varying GB

Varying GL

Ground-Truth
Vision Model
PDF Parsing

Vision Model
PDF Parsing

100.00(0.00)
99.31(0.23)
96.91(1.09)

99.57(0.13)
99.70(0.12)

0.00(0.00)
1.09(0.30)
2.06(0.86)
0.42(0.18)1
0.38(0.26)

86.50(4.52)
83.44(6.48)
83.95(4.45)

83.77(5.75)
82.97(5.56)

1.86(0.29)
2.83(0.34)
3.93(0.93)

1.20(0.16)
1.28(0.13)

85.91(3.13)
82.09(5.89)
78.69(4.90)

83.69(2.92)
82.61(4.10)

0.35(0.19)
0.36(0.12)
0.02(0.01)

0.20(0.12)
0.00(0.00)

1 For text line detector experiments, we report H(GRAMO) based on text lines rather than blocks.

Mesa 6: VILA model performance when using different layout group detectors for text blocks G(B) and lines G(l)
on the S2-VL dataset.

domain transfer affects block detectors more than
line detectors, because the two datasets define
blocks differently. This setting is realistic because
ground truth blocks from the target dataset may
not always be available for training (even when
labeled tokens are). Training a group detector on
S2-VL is likely to improve performance.

9 Conclusión

en este documento, we introduce two new ways to
integrate Visual Layout (VILA) structures into
the NLP pipeline for structured content extrac-
tion from scientific paper PDFs. Nosotros mostramos que
inserting special indicator tokens based on VILA
(I-VILA) can lead to robust improvements in to-
ken classification accuracy (hasta +1.9% Macro
F1) and consistency (up to −32% group category
inconsistency). Además, we design a hierarchi-
cal transformer model based on VILA (H-VILA),
which can reduce inference time by 46% with less
than 0.8% Macro F1 reduction compared to previ-
ous SOTA methods. These VILA-based methods

can be easily incorporated into different BERT
variants with only fine-tuning, achieving compa-
rable performance against existing work with only
5% of the training cost. We ablate the influence of
different visual layout detectors on VILA-based
modelos, and provide suggestions for practical use.
We release a benchmark suite, along with a newly
curated dataset S2-VL, to systematically evaluate
the proposed methods.

Our study is well-aligned with the recent explo-
ration of injecting structures into language models,
and provides new perspectives on how to incorpo-
rate documents’ visual structures. The approach
shows how explicitly modeling task structure can
help achieve ‘‘green AI’’ goals, dramatically re-
ducing computation and energy costs without
significant loss in accuracy. While we evaluate on
scientific documents, related visual group struc-
tures also exist in other kinds of documents, y
adapting our techniques to those domains could
offer improvements in corporate reports, historical
archives, or legal documents, and this is an item
of future work.

386

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
4
6
6
2
0
0
6
9
9
3

/

/
t

yo

a
C
_
a
_
0
0
4
6
6
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Expresiones de gratitud

We thank the anonymous reviewers and TACL ed-
itors for their comments and feedback on our draft,
and we thank Ruochen Zhang, Mark Neumann,
Rodney Kinney, Dirk Groeneveld, and Mike
Cafarella for the helpful discussion and sugges-
ciones. This project is supported in part by NSF
grant OIA-2033558, NSF RAPID award 2040196,
ONR grant N00014-21-1-2707, and the University
of Washington WRF/Cable Professorship.

Referencias

2008–2021. Grobid. https://github.com
/kermitt2/grobid. Accedido: 2021-04-30.

2021. ICDAR2021 competition on mathematical
formula detection. http://transcriptorium
.eu/htrcontest/MathsICDAR2021/.
Accedido: 2021-04-30.

Waleed Ammar, Dirk Groeneveld, Chandra
Bhagavatula,
Iz Beltagy, Miles Crawford,
Doug Downey, Jason Dunkelberger, ahmed
Elgohary, Sergey Feldman, Vu Ha, Rodney
Kinney, Sebastian Kohlmeier, Kyle Lo, tyler
Murray, Hsu-Han Ooi, Matthew Peters, Joanna
Fuerza, Sam Skjonsberg, Lucy Wang, cris
Guillermo, Zheng Yuan, Madeleine van Zuylen,
and Oren Etzioni. 2018. Construction of the
literature graph in semantic scholar. En profesional-
el
cesiones de
North American Chapter of
the Associa-
ción para la Lingüística Computacional: Humano
Language Technologies, Volumen 3 (Industria
Documentos),

Luisiana. Asociación de Lin Computacional-
guísticos. https://doi.org/10.18653
/v1/N18-3011

el 2018 Conference of

84–91, Nueva Orleans

paginas

He Bai, Peng Shi, Jimmy Lin, Yuqing Xie, Luchen
Broncearse, Kun Xiong, Wen Gao, and Ming Li.
2021. Segatron: Segment-aware transformer for
language modeling and understanding. En profesional-
ceedings of the AAAI Conference on Artificial
Inteligencia, volumen 35, pages 12526–12534.

Iz Beltagy, Kyle Lo, and Arman Cohan. 2019.
SciBERT: A pretrained language model for
scientific text. En procedimientos de
el 2019
Jornada sobre Métodos Empíricos en Natural
El procesamiento del lenguaje y la IX Internacional
Conferencia conjunta sobre lenguaje natural Pro-
cesando (EMNLP-IJCNLP), pages 3615–3620,

387

Hong Kong, Porcelana. Asociación de Computación-
lingüística nacional. https://doi.org/10
.18653/v1/D19-1371

Leo Breiman. 2001. Random forests. Machine
Aprendiendo, 45(1):5–32. https://doi.org
/10.1023/A:1010933404324

Jacob Devlin, Ming-Wei Chang, Kenton Lee, y
Kristina Toutanova. 2019. BERT: Pre-entrenamiento
de transformadores bidireccionales profundos para el lenguaje
comprensión. En procedimientos de
el 2019
Conferencia del Capítulo Norteamericano
de la Asociación de Linguis Computacional-
tics: Tecnologías del lenguaje humano, Volumen
1 (Artículos largos y cortos), páginas 4171–4186,
Mineápolis, Minnesota. Asociación para Com-
Lingüística putacional.

Kaiming He, Georgia Gkioxari, Piotr Doll´ar, y
Ross Girshick. 2017. Mask R-CNN. En profesional-
ceedings of the IEEE international conference
on computer vision, pages 2961–2969.

Sepp Hochreiter y Jürgen Schmidhuber. 1997.
Memoria larga a corto plazo. Neural Computa-
ción, 9(8):1735–1780. https://doi.org
/10.1162/neco.1997.9.8.1735

Diederik P. Kingma and Jimmy Ba. 2015. Adán:
A method for stochastic optimization. en 3ro
Conferencia Internacional sobre Aprendizaje Repre-
sentaciones, ICLR 2015, San Diego, California, EE.UU,
Puede 7-9, 2015, Conference Track Proceedings.

John D. Lafferty, Andrew McCallum, y
Fernando C. norte. Pereira. 2001. Conditional ran-
dom fields: Probabilistic models for segmenting
and labeling sequence data. En procedimientos
of the Eighteenth International Conference on
Machine Learning (ICML 2001), Williams Col-
lege, Williamstown, MAMÁ, EE.UU, Junio 28 – July
1, 2001, pages 282–289. Morgan Kaufmann.

Haejun Lee, Drew A. Hudson, Kangwook Lee,
and Christopher D. Manning. 2020. SLM:
Learning a discourse language representation
with sentence unshuffling. En procedimientos de
el 2020 Conferencia sobre métodos empíricos
en procesamiento del lenguaje natural (EMNLP),
pages 1551–1562, En línea. Asociación para
Ligüística computacional.

Minghao Li, Yiheng Xu, Lei Cui, Shaohan
Huang, Furu Wei, Zhoujun Li, and Ming
zhou. 2020. DocBank: A benchmark dataset

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
4
6
6
2
0
0
6
9
9
3

/

/
t

yo

a
C
_
a
_
0
0
4
6
6
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

layout analysis. En curso-
for document
the 28th International Conference
cosas de
on Computational Linguistics, COLECCIONAR 2020,
Barcelona, España (En línea), December 8–13,
2020, pages 949–960. International Committee
on Computational Linguistics.

Peizhao Li, Jiuxiang Gu, Jason Kuen, Vlad I.
Morariu, Handong Zhao, Rajiv Jain, Varun
Manjunatha, and Hongfu Liu. 2021. Self-
Doc: Self-supervised document representation
aprendiendo. En procedimientos de
the IEEE/CVF
Conference on Computer Vision and Pattern
Reconocimiento, pages 5652–5660.

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei
Du, Mandar Joshi, Danqi Chen, Omer Levy,
mike lewis, Lucas Zettlemoyer, and Veselin
Stoyanov. 2019. RoBERTa: A robustly op-
timized BERT pretraining approach. CORR,
cs.CL/1907.11692v1.

Nikolaos Livathinos, Cesar Berrospi, Maksym
Lysak, Viktor Kuropiatnyk, Ahmed Nassar,
Andre Carvalho, Michele Dolfi, Christoph
Auer, Kasper Dinkla, and Peter W. j. Staar.
2021. Robust PDF document conversion us-
ing recurrent neural networks. In Thirty-Fifth
Conferencia AAAI sobre Inteligencia Artificial,
AAAI 2021, Thirty-Third Conference on In-
novative Applications of Artificial Intelligence,
IAAI 2021, The Eleventh Symposium on Ed-
ucational Advances in Artificial Intelligence,
EAAI 2021, Virtual Event, Febrero 2-9, 2021,
pages 15137–15145. AAAI Press.

Kyle Lo, Lucy Lu Wang, Mark Neumann, Rodney
Kinney, and Daniel Weld. 2020. S2ORC:
The semantic scholar open research corpus.
In Proceedings of the 58th Annual Meeting
de la Asociación de Linguis Computacional-
tics, pages 4969–4983, En línea. Asociación para
Ligüística computacional.

Ilya Loshchilov and Frank Hutter. 2019. De-
coupled weight decay regularization. In 7th
Conferencia Internacional sobre Aprendizaje Repre-
sentaciones, ICLR 2019, Nueva Orleans, LA, EE.UU,
May 6–9, 2019. OpenReview.net.

Paulius Micikevicius, Sharan Narang,

Jonah
Alben, Gregory F. Diamos, Erich Elsen, David
Garc´ıa, Boris Ginsburg, Michael Houston,
Oleksii Kuchaiev, Ganesh Venkatesh, y
Hao Wu. 2018. Mixed precision training.
In 6th International Conference on Learning

Representaciones, ICLR 2018, vancouver, BC,
Canada, Abril 30 – Puede 3, 2018, Conferencia
Track Proceedings. OpenReview.net.

Mark Neumann, Zejiang Shen,

and Sam
Skjonsberg. 2021. PAWLS: PDF annotation
with labels and structure. En procedimientos de
the 59th Annual Meeting of the Association
for Computational Linguistics and the 11th
International Joint Conference on Natural
Procesamiento del lenguaje: Demostraciones del sistema,
pages 258–264, En línea. Asociación para Com-
Lingüística putacional. https://doi.org
/10.18653/v1/2021.acl-demo.31

Adam Paszke, Sam Gross, Francisco Massa,
Adam Lerer,
James Bradbury, Gregory
Chanan, Trevor Killeen, Zeming Lin, Natalia
Gimelshein, Luca Antiga, Alban Desmaison,
Andreas Kopf, Edward Yang, Zachary
DeVito, Martin Raison, Alykhan Tejani, Sasank
Chilamkurthy, Benoit Steiner, Lu Fang, Junjie
Bai, and Soumith Chintala. 2019. PyTorch:
An imperative style, high-performance deep
learning library. In Advances in Neural Infor-
mation Processing Systems, volumen 32. Curran
Associates, Cª.

Shaoqing Ren, Kaiming He, Ross Girshick,
and Jian Sun. 2015. Faster R-CNN: Towards
real-time object detection with region proposal
redes. Avances en Neurología
Información
Sistemas de procesamiento, 28:91–99.

Nipun Sadvilkar and Mark Neumann. 2020.
PySBD: Pragmatic sentence boundary disam-
biguation. In Proceedings of Second Workshop
for NLP Open Source Software (NLP-OSS),
pages 110–114, En línea. Asociación para Com-
Lingüística putacional. https://doi.org
/10.18653/v1/2020.nlposs-1.15

Víctor Sanh, Debut de Lysandre, Julien Chaumond,
and Thomas Wolf. 2019. Distilbert, a distilled
version of BERT: Menor, faster, cheaper and
lighter. CORR, cs.CL/1910.01108v4.

Roy Schwartz, Jesse Dodge, Noah A. Herrero, y
Oren Etzioni. 2020. Green AI. Comunicaciones
of the ACM, 63(12):54–63. https://doi
.org/10.1145/3381831

Zejiang Shen, Ruochen Zhang, Melissa Dell,
Benjamin Charles Germain Lee, Jacob Carlson,
and Weining Li. 2021. LayoutParser: A
for deep learning based
unified toolkit

388

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
4
6
6
2
0
0
6
9
9
3

/

/
t

yo

a
C
_
a
_
0
0
4
6
6
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

imagen

análisis.

In Document
documento
Analysis and Recognition – ICDAR 2021,
pages 131–146, cham. Springer International
Publishing. https://doi.org/10.1007
/978-3-030-86549-8_9

Noah Siegel, Nicholas Lourie, Russell Power,
and Waleed Ammar. 2018. Extracting sci-
entific figures with distantly supervised neu-
the 18th
ral networks.
ACM/IEEE on joint conference on digital
libraries, pages 223–232. https://doi
.org/10.1145/3197026.3197040

En procedimientos de

Pedro W.. j. Staar, Michele Dolfi, Christoph
Auer, and Costas Bekas. 2018. Corpus con-
version service: A machine learning platform
to ingest documents at scale. En curso-
ings of the 24th ACM SIGKDD International
Conference on Knowledge Discovery & Datos
Minería, pages 774–782. https://doi.org
/10.1145/3219819.3219834

Dominika Tkaczyk, Pawel Szostek, and Lukasz
Bolikowski. 2014. GROTOAP2 — the method-
large ground truth
ology of
dataset of scientific articles. D-Lib Mag-
azine, 20(11/12). https://doi.org/10
.1045/november14-tkaczyk

creating a

Dominika Tkaczyk, Paweł Szostek, Mateusz
Fedoryszak, Piotr Jan Dendek, and Łukasz
Bolikowski. 2015. CERMINE: Automatic ex-
traction of structured metadata from scientific
International Journal on Docu-
literature.
ment Analysis and Recognition (IJDAR),
https://doi.org/10
18(4):317–335.
.1007/s10032-015-0249-8

Ashish Vaswani, Noam Shazeer, Niki Parmar,
Jakob Uszkoreit, Leon Jones, Aidan N.. Gómez,
lucas káiser, y Illia Polosukhin. 2017. En-
La atención es todo lo que necesitas.. En avances en neurología
Sistemas de procesamiento de información, volumen 30.
Asociados Curran, Cª.

Lucy Lu Wang, Isabel Cachola, Jonathan Bragg,
Evie Yu-Yen Cheng, Chelsea Haupt, Mate
Latzke, Bailey Kuehl, Madeleine van Zuylen,
Linda Wagner, y Daniel S.. Weld. 2021.
Improving the accessibility of scientific doc-
uments: Current state, user needs, and a system
solution to enhance scientific PDF accessi-
bility for blind and low vision users. CORR,
cs.DL/2105.00076v1.

389

Lucy Lu Wang, Kyle Lo, Yoganand Chan-
drasekhar, Russell Reas, Jiangjiang Yang,
Darrin Eide, Kathryn Funk, Rodney Kinney,
Ziyang Liu, William Merril, Paul Mooney,
Dewey A. Murdick, Devvret Rishi, Jerry
Sheehan, Zhihong Shen, Brandon Stilson, Alex
D. Wade, Kuansan Wang, Chris Wilhelm,
Boya Xie, Douglas Raymond, Daniel S.. Weld,
Oren Etzioni, and Sebastian Kohlmeier. 2020.
CORD-19: The covid-19 open research dataset.
CORR, cs.DL/2004.10706v4.

Tomás Lobo, Debut de Lysandre, Víctor Sanh,
Julien Chaumond, Clemente Delangue, Antonio
moi, Pierric Cistac, Tim Rault, Remi Louf,
Morgan Funtowicz, Joe Davison, Sam Shleifer,
Patrick von Platen, Clara Ma, Yacine Jernite,
Julien Plu, Canwen Xu, Teven Le Scao,
Sylvain Gugger, Mariama Drama, Quintín
Lhoest, y Alejandro Rush. 2020. Transform-
ers: State-of-the-art natural language process-
En g. En Actas de la 2020 Conferencia
on Empirical Methods
in Natural Lan-
Procesamiento de calibre: Demostraciones del sistema,
páginas 38–45, En línea. Asociación de Computación-
lingüística nacional. https://doi.org/10
.18653/v1/2020.emnlp-demos.6

Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang,
Furu Wei, y Ming Zhou. 2020. Layoutlm:
Pre-training of text and layout for document
image understanding. In KDD ’20: The 26th
ACM SIGKDD Conference on Knowledge Dis-
covery and Data Mining, Virtual Event, California,
EE.UU, August 23–27, 2020, pages 1192–1200.
ACM.

Yang Xu, Yiheng Xu, Tengchao Lv, Lei Cui,
Furu Wei, Guoxin Wang, Yijuan Lu, Dinei A.
F. Florˆencio, Cha Zhang, Wanxiang Che, mín.
zhang, and Lidong Zhou. 2021. LayoutLMv2:
Multi-modal pre-training for visually-rich doc-
ument understanding. En Actas de la
59ª Reunión Anual de la Asociación de
Computational Linguistics and the 11th Inter-
national Joint Conference on Natural Language
Procesando, ACL/IJCNLP 2021, (Volumen 1:
Artículos largos), Virtual Event, August 1–6, 2021,
pages 2579–2591. Asociación de Computación-
lingüística nacional.

Liu Yang, Mingyang Zhang, Cheng Li,
Michael Bendersky, and Marc Najork. 2020.
Beyond 512 tokens: Siamese multi-depth

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
4
6
6
2
0
0
6
9
9
3

/

/
t

yo

a
C
_
a
_
0
0
4
6
6
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

transformer-based hierarchical encoder
para
long-form document matching. En curso-
ings of the 29th ACM International Conference
on Information & Knowledge Management,
pages 1725–1734. https://doi.org/10
.1145/3340531.3411908

Xingxing Zhang, Furu Wei, y Ming Zhou.
2019. HIBERT: Document level pre-training
of hierarchical bidirectional transformers for
document summarization. En procedimientos de
the 57th Annual Meeting of the Association for
Ligüística computacional, pages 5059–5069,
Italia. Asociación de Computación-
Florencia,
lingüística nacional. https://doi.org/10
.18653/v1/P19-1499

Xu Zhong, Jianbin Tang, and Antonio Jimeno
Yepes. 2019. PubLayNet: Largest dataset ever
for document layout analysis. En 2019 Enterrar-
national Conference on Document Analysis
and Recognition (ICDAR), pages 1015–1022.
IEEE. https://doi.org/10.1109/ICDAR
.2019.00166

A Model Performance Breakdown

In Tables 7, 8, y 9, we present model accuracies
on GROTOAP2, DocBank, and S2-VL of each
category for the results reported in the main paper.

B Improvements of the DocBank Dataset

We implement several fixes for the public version
of the DocBank dataset to improve its accuracy
and create faithful VILA structures.

B.1 Dataset Artifacts

As the DocBank dataset is automatically generated
via parsing LaTeX source from arXiv, it will
inevitably include noise. Además, the authors
only release the document screenshots and token
information parsed using PDFMiner12 instead of
the source PDF files, which causes additional
issues when using the dataset. We identify some
major error categories during the course of our
proyecto, detailed as follows:

fonts,13 which are often used for rendering spe-
cial symbols in PDFs. Por ejemplo, the software
may incorrectly parse 25◦C as 25(cid:176)
C. Including such (cid:*) tokens in the in-
put text is not reasonable, because they break
the natural flow of the text and most pre-trained
language model tokenizers cannot appropriately
encode such tokens.

texto

it will

label all

Erroneous Label Generation Token labels in
DocBank are extracted by parsing latex com-
mands. Por ejemplo,
en
the command \abstract{*} as ‘‘abstract’’.
Though theoretically this approach may work well
for ‘‘standard’’ documents, we find the resulting
label quality is far from ideal when processing
real-world documents at scale. One major issue is
that it cannot appropriately handle user-created
macros, which are often used for compiling
complex math equations. It leads to very low
(label) accuracy in the ‘‘equation’’ category in the
dataset—in fact, we manually inspected 10 paginas,
and found 60% of the math equation tokens are
wrongly labeled as other classes. This approach
also fails to appropriately label some document
texts that are passively generated with the La-
TeX commands, Por ejemplo, the ‘‘Figure *’’
produced by the \caption command is treated
as ‘‘paragraph’’.

Lack of VILA Structures As the DocBank
dataset generating method solely operates on the
document TeX sources, it does not include visual
layout information. The missing VILA structures
leads to low label accuracy for layout-sensitive
categories like figure and tables—for example,
when a figure contains selectable text (es decir., es
not stored in a format like PNG or JPG, pero
instead contains text tokens returned by the PDF
parser), the method cannot recognize such tokens
and thus it assigns incorrect labels (other than
‘‘figure’’). Though the authors tried to create
layout group structures by applying connected
component analysis method to PDF tokens,14 nosotros
observed different types of errors in the generated
grupos, Por ejemplo, mis-identifying paragraph
breaks (combining multiple paragraph blocks into

Incorrect PDF Parsing The PDFMiner soft-
ware does not work perfectly when parsing CID

12https://github.com/euske/pdfminer (last

accessed Jan. 1, 2022).

13https://en.wikipedia.org/wiki/Post

Script fonts (last accessed Jan. 1, 2022).

14The algorithm iteratively selects and groups adjacent
tokens with the same category, and ultimately produces a list
of token collections that approximate the layout groups.

390

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
4
6
6
2
0
0
6
9
9
3

/

/
t

yo

a
C
_
a
_
0
0
4
6
6
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Abstracto

Acknowledgment

Affiliation

Author

BERTBASE
BERTBASE + I-VILA(Text Line)
BERTBASE + I-VILA(Text Block)

LayoutLMBASE
LayoutLMBASE + Sentence Breaks
LayoutLMBASE + I-VILA(Text Line)
LayoutLMBASE + I-VILA(Text Block)

Simple Group Classifier
H-VILA(Text Line)
H-VILA(Text Block)

97.42
97.65
97.67

98.05
97.92
97.99
98.12

96.10
98.47
98.01

95.83
95.89
96.46

96.29
96.32
96.41
96.81

95.53
95.88
96.45

96.12
96.61
96.80

96.64
96.68
96.72
96.93

97.10
96.21
96.14

96.91
97.17
97.23

97.49
96.74
97.29
96.96

97.48
97.46
97.38

# Tokens in Class

395788

88531

90775

26742

Author
Título

96.09
96.48
97.73

96.51
95.42
95.98
97.52

97.94
95.26
96.31

7083

Bib
Info

95.00
95.78
96.29

96.74
96.77
96.66
96.87

96.68
96.68
96.33

Body
Contenido

Conflicto
Statement

98.80
98.93
98.99

99.06
99.11
99.11
99.14

98.94
99.16
99.08

88.66
88.28
91.88

91.16
90.42
90.75
91.43

93.25
89.67
91.67

223739

7567934

22289

contd.

Derechos de autor

Correspondence

Editor

Ecuación

Cifra

Glossary

Palabras clave

BERTBASE
BERTBASE + I-VILA(Text Line)
BERTBASE + I-VILA(Text Block)

LayoutLMBASE
LayoutLMBASE + Sentence Breaks
LayoutLMBASE + I-VILA(Text Line)
LayoutLMBASE + I-VILA(Text Block)

Simple Group Classifier
H-VILA(Text Line)
H-VILA(Text Block)

# Tokens in Class

contd.

BERTBASE
BERTBASE + I-VILA(Text Line)
BERTBASE + I-VILA(Text Block)

LayoutLMBASE
LayoutLMBASE + Sentence Breaks
LayoutLMBASE + I-VILA(Text Line)
LayoutLMBASE + I-VILA(Text Block)

Simple Group Classifier
H-VILA(Text Line)
H-VILA(Text Block)

# Tokens in Class

97.34
97.38
97.85

97.63
97.62
97.47
97.66

97.56
97.78
97.98

57419

Página
Número

98.32
98.82
98.92

98.94
98.90
99.05
99.05

99.02
98.96
99.16

46884

89.66
89.57
91.29

89.99
90.07
90.97
91.04

92.11
89.96
90.37

26653

Dates

94.56
94.60
94.99

94.80
94.73
95.20
95.13

95.47
94.98
94.92

23702

Referencias

Mesa

99.60
99.60
99.64

99.62
99.61
99.63
99.65

99.61
99.63
99.68

94.11
94.53
94.31

95.30
95.63
95.61
95.73

93.94
96.02
95.00

99.71
99.93
99.95

99.90
99.95
99.93
100.00

100.00
99.91
100.00

2937

Título

97.60
97.77
98.19

97.91
98.13
97.80
98.39

98.18
97.76
98.36

2340796

558103

22110

17.60
25.00
29.46

30.78
20.73
26.42
39.28

33.17
15.60
30.64

761

Tipo

87.62
93.70
93.09

91.24
91.68
94.59
95.17

94.91
93.61
95.07

4543

94.05
94.84
95.52

95.52
95.83
95.67
95.74

95.77
95.63
95.86

581554

80.18
81.35
80.45

83.83
84.99
84.16
87.00

80.35
84.01
78.29

2807

93.42
94.34
95.40

94.95
93.88
94.82
96.23

95.64
93.69
96.15

7012

Unknown

Macro F1

88.60
88.14
88.81

89.19
89.14
89.86
90.47

89.60
90.00
89.23

54639

90.78
91.65
92.31

92.34
91.83
92.37
93.38

92.65
91.65
92.37

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
4
6
6
2
0
0
6
9
9
3

/

/
t

yo

a
C
_
a
_
0
0
4
6
6
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Mesa 7: Prediction F1 breakdown for all models on the GROTOAP2 dataset.

Abstract Author Caption Date

Cifra

Footer

List

Paragraph Reference

Sección

Mesa

Title Macro F1

BERTBASE
BERTBASE + I-VILA(Text Line)
BERTBASE + I-VILA(Text Block)

LayoutLMBASE
LayoutLMBASE + Sentence Breaks
LayoutLMBASE + I-VILA(Text Line)
LayoutLMBASE + I-VILA(Text Block)
LayoutLMv2BASE

Simple Group Classifier
H-VILA(Text Line)
H-VILA(Text Block)

97.82
97.99
98.15

98.63
98.48
98.57
98.68
98.68

93.85
98.68
98.57

89.96
90.67
90.66

92.25
92.70
92.64
92.31
93.04

84.68
90.95
86.81

93.91
95.74
96.56

96.88
96.93
97.35
97.44
97.49

96.55
95.46
95.76

87.33
88.12
87.83

87.13
88.06
87.87
87.69
89.55

71.04
80.99
70.33

71.97
88.85
79.49

76.56
77.65
90.78
83.41
85.60

80.63
88.79
80.29

84.76
88.29
88.40

94.26
94.35
94.37
94.03
95.30

91.58
93.84
91.23

75.99
80.20
80.72

89.67
90.46
90.77
90.56
93.63

83.84
90.77
79.82

96.84
97.85
97.51

97.72
97.81
98.44
98.13
98.46

97.53
98.36
97.53

92.05
92.68
92.62

93.16
92.61
92.87
93.27
94.30

92.54
93.81
92.97

92.81
94.91
94.86

96.31
96.58
96.60
96.44
96.48

85.33
95.27
86.70

74.19
77.39
76.91

77.38
78.84
80.43
79.51
84.41

73.85
78.46
79.84

89.31
90.34
90.22

92.80
92.81
92.78
92.48
93.10

92.65
89.81
93.52

87.24
90.25
89.49

91.06
91.44
92.79
92.00
93.34

87.01
91.27
87.78

# Tokens in Class

461898

81061

858862

3275

932150

158176

684786

20630188

1813594

154062

235801

26355

Mesa 8: Prediction F1 breakdown for all models on the DocBank dataset.

uno) or overlapping layout groups (caused by
incorrect token labels), and chose not to use them.

B.2 Fixes and Enhancement

Based on the aforementioned issues, we imple-
ment the following fixes and enhance the DocBank
dataset with VILA structures.

Remove Incorrect PDF Tokens Provided that
there are no simple ways to recover the incorrect
(cid:*) tokens generated by PDFMiner, nosotros
simply remove them from the input text.

Generate VILA Structures We use pre-trained
Faster-RCNN models (Ren et al., 2015) desde el
LayoutParser (Shen et al., 2021) tool to iden-
tify both the text lines and blocks based on the
page images. Específicamente, for text blocks, we use
the PubLayNet/mask rcnn R 50 FPN 3x/
model to detect the body content regions (incluir-
ing title, párrafo, cifra, mesa, and list) y el
MFD/faster rcnn R 50 FPN 3x/ model to
detect the display math equation regions. Nosotros también
fine-tune a Fast RCNN model on the GROTOAP2
conjunto de datos (which has text line annotation), and use

391

Abstracto

Author

Bibliography

Caption

Ecuación

Cifra

Footer

Footnote

BERTBASE
BERTBASE + I-VILA(Text Line)
BERTBASE + I-VILA(Text Block)

LayoutLMBASE
LayoutLMBASE + Sentence Breaks
LayoutLMBASE + I-VILA(Text Line)
LayoutLMBASE + I-VILA(Text Block)
LayoutLMv2BASE

H-VILA(Text Line)
H-VILA(Text Block)

# Tokens in Class

91.67(5.51)
89.38(6.50)
90.45(3.61)

91.87(4.89)
92.01(4.79)
91.77(5.85)
92.91(4.02)
91.09(6.46)

93.90(5.16)
93.40(6.14)

71.38(18.79)
65.93(15.48)
64.97(16.11)

69.39(11.30)
69.22(11.02)
69.81(7.86)
70.42(13.38)
63.42(17.55)

70.86(9.78)
67.03(19.43)

97.90(1.59)
97.92(1.56)
97.21(1.27)

98.08(1.13)
98.57(1.24)
98.09(1.64)
98.19(1.57)
97.74(2.00)

97.71(1.26)
96.11(3.38)

94.64(1.38)
96.66(1.39)
96.82(0.94)

92.98(7.35)
95.74(1.36)
94.06(2.91)
97.19(1.16)
96.73(1.39)

92.86(3.89)
92.76(6.47)

76.23(4.36)
83.22(5.87)
83.56(5.59)

77.49(7.13)
77.94(9.68)
84.48(7.00)
83.76(6.61)
77.18(13.70)

60.14(24.13)
72.11(13.35)
70.57(11.56)

74.46(18.48)
67.80(25.61)
71.57(21.49)
68.38(26.11)
83.71(11.53)

61.99(17.04)
57.75(22.46)
59.79(23.18)

67.42(18.90)
69.67(20.06)
67.23(23.01)
68.03(19.11)
64.37(22.24)

62.91(7.23)
72.78(12.45)
80.17(10.48)

77.22(17.59)
78.57(16.45)
77.10(15.64)
76.77(17.64)
70.20(12.43)

81.38(7.79)
86.87(8.64)

77.86(10.65)
79.64(11.21)

65.95(23.44)
63.72(22.01)

81.76(15.03)
83.66(9.88)

2854(432)

543(118)

15681(3704)

4046(2119)

2552(1872)

1402(1316)

480(205)

2468(1254)

contd.

Header

Palabras clave

List

Párrafo

Sección

Mesa

Título

Macro F1

BERTBASE
BERTBASE + I-VILA(Text Line)
BERTBASE + I-VILA(Text Block)

LayoutLMBASE
LayoutLMBASE + Sentence Breaks
LayoutLMBASE + I-VILA(Text Line)
LayoutLMBASE + I-VILA(Text Block)
LayoutLMv2BASE

H-VILA(Text Line)
H-VILA(Text Block)

# Tokens in Class

76.47(8.51)
81.53(7.94)
83.99(8.74)

88.21(5.81)
88.08(5.71)
87.14(6.49)
88.39(6.20)
86.95(6.84)

87.89(6.45)
86.49(6.08)

90.16(6.44)
87.06(5.57)
87.86(7.51)

88.14(5.94)
88.80(3.23)
86.66(6.24)
90.92(3.97)
89.71(7.95)

51.00(16.90)
58.64(8.10)
62.01(13.25)

58.21(15.15)
60.61(11.80)
65.82(10.92)
59.06(17.99)
68.36(10.05)

86.34(5.02)
76.97(18.82)

65.76(10.26)
55.82(16.99)

96.07(1.37)
96.67(1.13)
96.65(1.21)

96.88(0.87)
97.01(0.85)
97.17(1.26)
97.17(1.14)
96.65(0.71)

96.90(0.75)
96.43(1.40)

79.72(3.46)
87.21(3.25)
86.71(3.23)

88.14(2.73)
88.05(2.79)
89.79(2.48)
88.67(3.57)
89.48(4.13)

85.45(2.02)
86.72(4.55)

79.93(16.26)
85.58(15.67)
80.44(16.35)

82.02(15.58)
81.59(16.22)
86.00(12.33)
81.84(15.77)
81.69(15.05)

85.19(7.55)
81.38(14.94)

84.81(8.52)
84.80(5.84)
86.14(5.23)

89.90(8.17)
88.52(5.92)
89.89(7.47)
89.95(6.32)
88.46(6.00)

85.62(6.00)
84.39(9.10)

78.34(6.53)
81.15(4.83)
81.82(4.88)

82.69(6.04)
82.81(5.21)
83.77(5.75)
83.44(6.48)
83.05(4.51)

83.69(2.92)
82.09(5.89)

1122(463)

130(27)

2274(593)

95732(8226)

882(113)

3887(2041)

240(26)

Mesa 9: Prediction F1 breakdown for all models on the S2-VL dataset. Similar to the results in the
main paper, we show averaged scores with standard deviation in parentheses across the 5-fold cross
validation subsets.

it to detect the text lines. All other regions (o
texts that are not covered by the detected blocks
or lines) are created by the connected component
analysis method.

Correct Label Errors Given the VILA structures,
we can easily correct some previously mentioned
errors like incorrect labels for ‘‘Figure *’’ by
applying majority voting for token labels in a text
block. Sin embargo, for the ‘‘equation’’ category,
given the low accuracy of the original DocBank
labels, neither majority voting nor other automatic

methods can easily recover the correct token cate-
gories. Por eso, we choose to discard this category
in the modeling phase, a saber, converting all
existing ‘‘equation’’ labels to the background
category ‘‘paragraph’’.

We update our methods for several rounds to
coordinate the fixes and enhancements, and ulti-
mately we can reduce more than 90% of the label
errors for figure and table captions. By using the
accurate pre-trained layout detection models, el
generated VILA structures are more than 95%
accurate.15

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
4
6
6
2
0
0
6
9
9
3

/

/
t

yo

a
C
_
a
_
0
0
4
6
6
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

15We randomly sample 30 pages from both the training
and test dataset, and annotate the number of the incorrect
text blocks for each page. A text block is considered as
incorrect when it wrongly merges multiple regions (p.ej., two
paragraphs or one paragraph and the adjacent section header)
or splits regions (p.ej., generating multiple blocks for one
párrafo). We report the average of page block accuracy.

392VILA: Improving Structured Content Extraction from Scientific PDFs image
VILA: Improving Structured Content Extraction from Scientific PDFs image
VILA: Improving Structured Content Extraction from Scientific PDFs image
VILA: Improving Structured Content Extraction from Scientific PDFs image
VILA: Improving Structured Content Extraction from Scientific PDFs image

Descargar PDF