Data Intelligence Just Accepted MS. - Am MIT spezialisierte KI-Forschung

Data Intelligence Just Accepted MS.
https://doi.org/10.1162/dint_a_00216 5

Predicting an Optimal Virtual Data Model for Uniform
Access to Large Heterogeneous Data

CHAHRAZED B.BACHIR BELMEHDI, ABDERRAHMANE KHIAT AND NABIL KESKES
LabRI-SBA, Enterprise Information Systems
ESI-SBA Institute; Fraunhofer IAIS
Algeria; Deutschland
Email: {cb.bachirbelmehdi;n.keskes}@esi-sba.dz; abderrahmane.khiat@iais.fraunhofer.de

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

:
/
/

D
ich
R
e
C
T
.

ich
T
.

e
D
u
D
N

ich

T
/

A
R
T
ich
C
e
–
P
D

F
/

D
Ö

ich
/

ich

1
0
1
1
6
2
D
N
_
A
_
0
0
2
1
6
2
1
2
7
0
2
9
D
N
_
A
_
0
0
2
1
6
P
D

ich

B
j
G
u
e
S
T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

von
Der
Wachstum
für
approaches
virtualization

generated
uniform
Systeme,

Die

Daten

In
Zugang
Daten
einschließlich
Daten

against
von
schema
als
solch
common

Die
Design
built
GRAPH
types

Sie
address
Wir
groß
predict

Weil
To
datasets,
join
Zu
OPTIMA
–
technologies,
Und

TABULAR,

solch

viele

sources

Modell
load,
Zu

original
use
fixed
A
on-the-fly
oder
von
An
depend
Problem
Die
neu
A
present
heterogeneous
optimal
Die

DOCUMENT,
als
Abfragen,
Kriterien,
von
Ansatz
In
Daten
A
virtual
Daten
von
Und
supports

implementation
Apache-Spark

Und

industry

requires
Zu
end-users
von
Ontology-Based
any
without
TABULAR

neu
perform
Data
prior
als
Und
join
flexible

z.B.,
verwandeln,
mehr
Sind
oder
join
solch

d ata

efficient
better
Access

b ig
business
(ODBA)
m aterialization.
Data
Während
Sei
Abfragen

Virtual
Daten.
can

mehr
Sind

daher,
Diese
Daten

Modell,

Daten

Die
nur
relevant
Und,
Abfragen.
plan,
virtual
Die
An
Und
manner

nested
als
Die

query
optimal
builds

Das

(1)

selecting

distributed
Modell
Ansatz
Und

unser
Graphx,
out-of-the-box
relational,

Merkmale

verwenden
currently
implements
d ata
tabular,

leverages
zwei
s ources

Daten
principal
(2)
calls
extracted

deep

Modell
von
A
aus
state-of-the-art
Daten
m odels:
In

virtual

five
Und
Extensive
ein
Modell

accuracy

stored
experiments
0.831,
von
Und

Neo4j,
show
A
30%

daher,
über

Auswahl

Größe,

integra-
Operationen.
Daten
query
Existing
Modell
–
A
Daten
andere
suitable
Zu
hart
Und
Abfragen
Zu
learn-
SPARQL
Big
Modelle,
Eigentum
Mo-
unser
reduction
Die
für

für
OBDA

Das

document-based,
Cassandra,

z.B.,
MySQL,
Die
Zeit
Auswahl.

returning

Ist
Ausführung

Modell

wide-columnar,
Und
optimal
über
von

CSV
virtual
für
40%

jeweils.
mit
tabular

Modell
Die

tion
Data
on-the-fly
approaches
uniform
Modelle,
für
manche
predict
Operationen.
groß
An
query
ing
Abfragen.
Data
GRAPH
graph,
goDB,
Ansatz
In
query
graph

Und
method

Schlüsselwörter:

Data

Virtualization,

Big

Data,

OBDA,

Deep

Learning.

EINFÜHRUNG

Massive

Daten

von

generated
[1].
Jahre
Die
leading
weniger
gebraucht,
von
consumed

über
Und
Daten

approaches

von
anders

Volumen
In
(relational,

[3,
Daten

4].
Das
platforms

Jedoch,
A
Zu
anders
Ist
Der
aim
In
stored
Ist
(cloud,

applications,
Die

transactions,
Information

machines
aus

extracted

oder

Wissen
applications
Zu
mehr
get
various
mainframes),
Modern

sources
Und

[2].

gap
raises
Die
insights

Folglich,
für
enabling

need
von
(Oracle,
Ist

represented
”Data

approaches

keep
Das

Die
effective
Die
MongoDB,

increasing
un-
Ist
Daten
wachsend
In-
Daten
von
Verfahren
Das
usw.),
für-
anders
[6]”
virtualization

drastically
exploited
Volumen
von
tegration
groß
A
Ist
resided
mats

graph,

no-relational

[5]).

© 2023 Chinesische Akademie der Wissenschaft. Veröffentlicht unter einer Creative Commons Namensnennung
4.0 International (CC BY 4.0) Lizenz.

Data Intelligence Just Accepted MS.
https://doi.org/10.1162/dint_a_00216 5

CHAHRAZED B.BACHIR BELMEHDI, ABDERRAHMANE KHIAT AND NABIL KESKES

tackle this challenge by creating a virtual data model under which the heterogeneous for-
mats are homogenized on-the-fly without data materialization [7], thus reducing cost, Und
simplifying data management, updates, and maintenance. Ontology-based data access
(OBDA) [8] also implements a virtual data model and addressed data integration chal-
lenges with practical knowledge representation models, ontology-based mappings, und ein
unique query language SPARQL1 [9].

Existing approaches [10, 11, 12] use by design only one virtual data model2 (z.B.,
TABULAR) to load and transform the requested data into a uniform model to be joined
and aggregated; while other data models, such as GRAPH or DOCUMENT, sind mehr
suitable [13]. Zum Beispiel, approaches using a fixed TABULAR virtual model (TABU-
LAR is a model that uses predefined structures, d.h., table definitions) can have downside
performances for SPARQL queries that involve many join operations on very large data.
Im Gegensatz, other data models such as GRAPH (a model that structures data into a set
of nodes, Beziehungen, properties, Und, am wichtigsten, stores relationships at the indi-
vidual record level) perform better for such queries. Andererseits, the TABULAR
model performs better for queries that involve selection or projection. The problem to be
addressed in this paper is defined as, given a query, ”which virtual data model is optimal
d.h., the model that has the lowest query execution time (cost)? and how to select it?”.

It is very challenging, Jedoch, to automatically select the optimal virtual model
based on queries since it is not realistic to compute the query execution time for all
SPARQL queries against all virtual data models to get the actual cost. Außerdem,
the query behavior on data virtualization is quite hard to predict since the behavior de-
pends not only on the virtual data model but also on query planning. To the best of our
Wissen, existing machine learning techniques [14, 15, 16] [17] were established in the
literature for cost estimation of SPARQL queries; most of them, Jedoch, are designed
for querying uniform data, z.B., RDF3 and not for distributed data sources.

To address these research questions, we developed OPTIMA – an OBDA extensible
framework that predicts the optimal virtual data model GRAPH or TABULAR, using a
deep learning algorithm to join data from sources databases that support Property Graph,
Relational, Tabular Document-based, and Wide-Columnar models. The proposed algo-
rithm uses one hot vector encoding to transform different SPARQL features into hidden
Darstellungen. Nächste, it embeds these representations into a tree-structured model, welche
is used to classify the virtual model GRAPH or TABULAR that has the lowest query
execution time.

Extensive experiments show that our approach is successfully running, returning the
optimal virtual model with an accuracy of 0.831, thus reducing the query execution time
of over 40% for the TABULAR model selection and over 30% for the GRAPH model
Auswahl.

The article is structured as follows. The underlying concepts about ontology-based
big data access are given in Section 2. Our approach is described in detail in Section 3.
Further description of deep learning model is presented in Section 4. Experimental results
are reported and explained in Section 5. Related Work is presented in Section 6. Abschnitt 7

1SPARQL is a query language for Resource Description Framework (RDF).
2We denote GRAPH and TABULAR when referring to the type of virtual data model; while we denote Property
Graph, Document-based, Wide-Columnar, Relational, and Tabular when addressing the source model.
3Resource Description Framework (RDF) is a standard designed as a data model for describing metadata.

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

:
/
/

D
ich
R
e
C
T
.

ich
T
.

e
D
u
D
N

ich

T
/

A
R
T
ich
C
e
–
P
D

F
/

D
Ö

ich
/

ich

1
0
1
1
6
2
D
N
_
A
_
0
0
2
1
6
2
1
2
7
0
2
9
D
N
_
A
_
0
0
2
1
6
P
D

ich

B
j
G
u
e
S
T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

Data Intelligence Just Accepted MS.
https://doi.org/10.1162/dint_a_00216 5

PREDICTING AN OPTIMAL VIRTUAL DATA MODEL FOR UNIFORM ACCESS TO LARGE HETEROGENEOUS DATA3

concludes with an outlook on possible future work.

2. Preliminaries

Our proposed approach requires the following inputs (1) data sources using different
Modelle. (2) Semantic Mapping that describes mapping in RDF Mapping Language, (3)
information about data sources (password, usw.), Und (4) a set of SPARQL queries. To
guide the subsequent description of our approach, we provide the following definitions:

Definition 1 (Data Source Schema) Dataset Schema is a set of Sd ∪ Sc ∪ Sr ∪ Sg ∪ St
considered by our approach; we introduce each model briefly as follows:

• Document-based Sd [18]: A document d is a JSON object o. An object is formed by a
set of key/value pairs (aka fields) o = { k1 . . . kn }; a key is a string, while a value can
be either a primitive value (z.B., a string), an array of values, an object, or null.

• Wide-Columnar Sc [19]: A table t is the unit of wide-column identified by name and
composed by a set of column-families f . The table’s rows are identified by a unique key.
Each row of the table can contain up to n records. The record is a pair of identifiers id
and a value. A wide-column is, in fact, a Hash structure expressed as: t = Hashtable < key, Hashrow < f , Hashrecord < id, value >>>.

• Relational Sr [20]: A relation schema R with a set Σ of PKs, FKs and attributes A =< A1, . . . , An > is denoted R(A1, . . . , Ein) is a set of n − tuples < d1, ..., dn > where each
di is an element of dom(Ai) or is null. The relation instance is the extension of the
relation. A value of null represents a missing or unknown value.

• Property Graph Sg [21] G = (V, E, λ , µ) is a directed, edge-labeled, attributed multi-
graph where V is a set of nodes, E ⊆ (V xV ) is a set of directed edges, λ : E → Σ is an edge
labeling function assigning a label from the alphabet Σ to each edge. Properties can be
assigned to edges and nodes by the function µ : (V ∪ E)xK → S where K is a set of property
keys and S the set of property values.

• Tabular St [22] is a set of tables T = { t1 . . . tn }. Each table tx integrates one or
more column groups, as tx = { GC1 . . . GCn }. Each column group integrates different
columns representing the atomic values to be stored in the table, GCx = { Cx

1 . . .Cx

N }.

We denote an entity of a data source by es
an object; where s is the schema entity, x its name and ax
either edges or columns. A data source consists of one or more entities, d = {NEIN}.

x = {ai}, representing either a node, a table or
i are its attributes representing

Definition 2 (Semantic Mapping) Semantic mappings are bridges (links) between the
ontology and sources schemata elements. We differentiate between two types of semantic
mappings [12]:

• Entity mapping: men = (e, C) a relation mapping an entity e from d onto an ontology

class c.

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

:
/
/

D
ich
R
e
C
T
.

ich
T
.

e
D
u
D
N

ich

T
/

A
R
T
ich
C
e
–
P
D

F
/

D
Ö

ich
/

ich

1
0
1
1
6
2
D
N
_
A
_
0
0
2
1
6
2
1
2
7
0
2
9
D
N
_
A
_
0
0
2
1
6
P
D

ich

B
j
G
u
e
S
T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

Data Intelligence Just Accepted MS.
https://doi.org/10.1162/dint_a_00216 5

CHAHRAZED B.BACHIR BELMEHDI, ABDERRAHMANE KHIAT AND NABIL KESKES

• Attribute mapping: mat =(A, P) a relation mapping an attribute a from an entity e onto

an ontology property p.

Definition 3 (Star-Shaped Query) A Star-Shaped Query (SSQ) is a set of triples (sub-
ject, predicate, Objekt) patterns – BGPs4 sharing the same subject [23]. We denote
SSQ by stx = (cid:8)ti = (X, pi, oi) | t ∈ BGPq
(cid:9) where x is the shared subject, whereas BGPq =
{(si, pi, oi) | pi ∈ O}, is the triple patterns of SSQ.

Definition 4 (Connection SSQ) The joins of data coming from different data sources
are represented actually by the connections between star-shaped queries i.e., two SSQs
sta, stb (Thema, predicate, Objekt) are connected if the object of sta is the subject of stb.
in Verbindung gebracht(sta, stb) → ∃ti = (si, pi, B) ∈ sta.

Definition 5 (Relevant Entities to SSQ) [24] An entity e is relevant to a SSQ st if it con-
tains attributes ai mapping to every triple property pi of the SSQ i.e., relevant(e, st) →
∀pi ∈ prop(st)∃a j ∈ e | (pi, a j) ∈ Mat , where prop is a relation returning the set of prop-
erties of a given SSQ.

Definition 6 (Entity Wrapping) it is a function wrap that takes one or more relevant
entities to SSQ and returns a Virtual Model [24]. It loads entity elements and organizes
them according to Virtual model schema wrap : En → PS.

Definition 7 (Virtual Data Model) Virtual Data Model is the data structure of the com-
putation unit of the query engine to load, transform and join only the relevant data. Es ist
built and populated on-the-fly and not materialized, d.h., used only during query process-
ing then cleared. Virtual Data Model has a schema that organizes data according to its
Struktur. We consider two types of schema, GRAPH or TABULAR.

• Structure of a GRAPH [25] (in-memory) is similar as Property Graph. A GRAPH
G = (V, E) is a set of vertices V = {1 . . . N} and a set of m directed edges E. Der
directed edge (ich, J) ∈ E connects the source vertex i ∈ V with the target vertex i ∈ V .
GRAPH stores relationships at the individual record level.

• Structure of a TABULAR (in-memory) [26] is the same structure as the Tabular model

defined above. TABULAR has predefined structures.

Definition 8 (Graph and Data Parallel) During the querying execution,
Modell, GRAPH or TABULAR is partitioned, distributed, and queried in parallel.

the Virtual

• GRAPH Parallel5 is executed after loading relevant entities into the DEE. Graph-
Parallel Systems consist of a property graph G = (V, E, P) and a vertex-program Q
that is instantiated simultaneously on all the vertices.

• Data Parallel [27] concerns the TABULAR model, which is executed after loading rele-
vant entities into the DEE. Data-Parallel computation derives parallelism by processing
independent data on separate resources.

4Basic Graph Pattern (BGP) is a set of Triple Patterns, where BGPs is set of BGP.
5https://gist.github.com/shagunsodhani/c72bc1928aeef40280c9

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

:
/
/

D
ich
R
e
C
T
.

ich
T
.

e
D
u
D
N

ich

T
/

A
R
T
ich
C
e
–
P
D

F
/

D
Ö

ich
/

ich

1
0
1
1
6
2
D
N
_
A
_
0
0
2
1
6
2
1
2
7
0
2
9
D
N
_
A
_
0
0
2
1
6
P
D

ich

B
j
G
u
e
S
T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

Data Intelligence Just Accepted MS.
https://doi.org/10.1162/dint_a_00216 5

PREDICTING AN OPTIMAL VIRTUAL DATA MODEL FOR UNIFORM ACCESS TO LARGE HETEROGENEOUS DATA5

Feige. 1: Predicting Optimal Virtual Model on top of OBDA

3. Predicting Optimal Virtual Model for Querying Large

Heterogeneous Data

To solve the problem of selecting the optimal virtual data model and thus efficiently
query large heterogeneous data, we propose an approach that leverages OBDA method-
ology and deep learning. Our Solution follows OBDA and supports two types of virtual
data models, GRAPH and TABULAR, to load and join data from sources with various
Modelle, d.h., property graph, document-based, wide-columnar, relational, and tabular. Wir
used a deep learning algorithm that predicts the optimal virtual model based on query
behavior. More precisely, the algorithm extracts and encodes significant features from
input SPARQL query into representations that are then embedded into a tree-structured
model to classify the virtual model, GRAPH or TABULAR, that has the lowest cost i.e.,
query execution time. Below we describe each part of our proposed approach illustrated
in Abbildung 1.

3.1 Virtual Data Model Prediction

Our distinctive deep learning model, built on top of OBDA layers, aims to select the
optimal virtual data model based on query behavior. Our algorithm analyzes and extracts
features from the input SPARQL query and uses One-Hot Vector encoding6 to transform
different features into hidden representations. Nächste, these representations are embedded
into a tree-structured model, which can effectively learn the representations of query plan
features and predicts the cost against each virtual data model. As an output, the proposed
algorithm returns the optimal virtual model, GRAPH or TABULAR, that has the lowest
query execution time. Our deep learning algorithm is detailed in section 4. Once the
optimal model is predicted, the rest of the OBDA layers (z.B., query decomposition, entity
detection, and operations, z.B., join, limit) follow the optimal virtual data model, GRAPH
or TABULAR.

6One-hot vector is a 1 × N matrix (vector) used to distinguish each word in a vocabulary from every other word in the vocabulary.

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

:
/
/

D
ich
R
e
C
T
.

ich
T
.

e
D
u
D
N

ich

T
/

A
R
T
ich
C
e
–
P
D

F
/

D
Ö

ich
/

ich

1
0
1
1
6
2
D
N
_
A
_
0
0
2
1
6
2
1
2
7
0
2
9
D
N
_
A
_
0
0
2
1
6
P
D

ich

B
j
G
u
e
S
T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

Data Intelligence Just Accepted MS.
https://doi.org/10.1162/dint_a_00216 5

CHAHRAZED B.BACHIR BELMEHDI, ABDERRAHMANE KHIAT AND NABIL KESKES

3.2 Query Decomposition & Relevant Entity Detection

Once the optimal virtual model is selected, our approach decomposes the input
SPARQL query into star-shaped queries to identify conjunctive queries [28]. More pre-
cisely, in SPARQL, the conjunction is expressed using shared variables across sets of
triple patterns, also called basic graph patterns (BGP). Based on this characterization, Wir
divide the query’s BGP into a set of sub-BGPs, where each sub-BGP contains all the
triple patterns sharing the same subject variable – called star-shaped query – SSQ (Defini-
tion 3). Most approaches for query decomposition in OBDA systems follow subject-based
method because triples sharing the same subject correspond to the same entity, z.B., table
or object in the data source, thus avoiding traversing data to find specific entities to be
joined and extra joins that can be very expensive.

Nächste, our approach analyzes each star-shaped query and retrieves semantic map-
pings that are already predefined i.e., correspondences between SSQ elements/variables
(d.h., ontology class or property) and data sources’ entities (z.B., table) or attributes (z.B.,
column name) in addition to data source type (z.B., relational) [see Definition 2]. A cor-
respondence that maps every triple property of a star-shaped query is called a relevant
entity (Definition 5). Endlich, loading those entities defined by data sources’ models into
the optimal virtual data model, GRAPH or TABULAR, requires data mapping and trans-
Formation, zum Beispiel, mapping and transforming a table from a relational model into a
GRAPH or TABULAR. Außerdem, star-shaped SPARQL operations (z.B., Projection,
filtering, grouping, usw.) are also translated into GRAPH or TABULAR operations.

3.3 Data Mapping and Transformation

Once the relevant entities and sources are identified using semantic mappings as
shown above, our approach maps and transforms relevant entities (z.B., a table) from their
original models (z.B., relational) [Definition 1] to data that comply with optimal virtual
data model predicted, GRAPH or TABULAR (Definition 7). This conversion occurs at
query-time, which allows for the parallel execution of expensive operations, z.B., join
(Definition 6).

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

:
/
/

D
ich
R
e
C
T
.

ich
T
.

e
D
u
D
N

ich

T
/

A
R
T
ich
C
e
–
P
D

F
/

D
Ö

ich
/

ich

1
0
1
1
6
2
D
N
_
A
_
0
0
2
1
6
2
1
2
7
0
2
9
D
N
_
A
_
0
0
2
1
6
P
D

ich

(A) Transforming Relational to GRAPH

(B) Transforming Document-based to TABULAR

Feige. 2: Transformation Process

Each star-shaped query corresponds to one relevant entity, and thus one single virtual
data model is created. This is the case when the relevant entity, according to the mapping,
could be retrieved only from one data source, z.B., one relational table. Ansonsten, wenn die
relevant entity according to the mapping could be retrieved from multiple sources, Dann
the virtual model for the entity is the union of temporary virtual models created for each
source (Figur 4).

B
j
G
u
e
S
T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

Data Intelligence Just Accepted MS.
https://doi.org/10.1162/dint_a_00216 5

PREDICTING AN OPTIMAL VIRTUAL DATA MODEL FOR UNIFORM ACCESS TO LARGE HETEROGENEOUS DATA7

Below we describe data source models transformation by wrappers into GRAPH and

TABULAR.

• For the virtual data model of type GRAPH, the structure returned of relevant data on
different data sources using existing data access methods [24] is schema-less data, z.B.,
RDD (Resilient Distributed Dataset). Then necessary structural adaptations are em-
ployed, which consist of converting schema-less to GRAPH following the mapping
Verfahren. The data is represented as a table with specific columns for the Tabular and
Relational models defined by CSV and MySQL. Then the mapping process is defined
as follows (see Figure 2a): for each table row, a vertex is created with the same label
as the table’s name (z.B., table ’Person’ corresponds to all vertices with the label ’Per-
son’) in addition to the root vertex. Edges are created between vertices and the root
Scheitel, whereas the properties of each vertex are the columns of the table (z.B., column
’name’ corresponds to property ’name’), and the values of the properties are the table’s
cell information. The same process is applied to property graphs defined by neo4j,
document-based, and Wide-Column models (z.B., an XML file) defined by MongoDB
and Cassandra.

• As for the virtual data model of type TABULAR, the structure returned of relevant
data on different data sources using existing data access methods is organized into
named columns, z.B., DataFrame. Adaptations are needed, which consist of convert-
ing DataFrame to TABULAR following a mapping process. Zum Beispiel, the selected
object as a relevant entity of documented-based and wide-columnar stored in MongoDB
and Cassandra is parsed to create a virtual TABULAR (see Figure 2a), which consists
of a table with a name similar to the root object’s name (z.B., a table ’Person’ from
object name ’Person’). A new row is inserted by iterating through object elements into
the corresponding table. The corresponding key-values are saved under the column
representing the cell information. The same process is applied to other models.

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

:
/
/

D
ich
R
e
C
T
.

ich
T
.

e
D
u
D
N

ich

T
/

A
R
T
ich
C
e
–
P
D

F
/

D
Ö

ich
/

ich

1
0
1
1
6
2
D
N
_
A
_
0
0
2
1
6
2
1
2
7
0
2
9
D
N
_
A
_
0
0
2
1
6
P
D

ich

(A) GRAPH Parallel [25]

Feige. 3: Parallel Mechanism for GRAPH and TABULAR

(B) TABULAR Parallel

We highlighted below how SPARQL and star-shaped queries operations are translated
into Virtual Data model operations in case of GRAPH and TABULAR.

3.4 Distributed Query Processing

Distributed Query Processing is where the virtual model is actually joined and ex-
ecuted. Our approach uses Big Data engines (z.B., SPARK) that offer users the ability

B
j
G
u
e
S
T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

Data Intelligence Just Accepted MS.
https://doi.org/10.1162/dint_a_00216 5

CHAHRAZED B.BACHIR BELMEHDI, ABDERRAHMANE KHIAT AND NABIL KESKES

(A) Union of TEMPORARY GRAPHs

(B) Union of TEMPORARY TABULARs

Feige. 4: Union Operation of TEMPORARY Virtual Model

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

:
/
/

D
ich
R
e
C
T
.

ich
T
.

(A) Join of GRAPHs

Feige. 5: Join Operation of Virtual Model

(B) Join of TABULARs

to manipulate the data model of its computation unit (d.h., virtual data model). This al-
lows the implementation of different data models that can be more suitable for various
Abfragen. We consider two types of data models, GRAPH and TABULAR, which allow
for graph-parallel (see Figure 3a) and data-parallel (see Figure 3b) Berechnung, thus af-
fecting the query performance. Our approach uses several different data models (Eigentum
graph, document-based, wide-columnar, relational, and tabular) to demonstrate its capa-
bility to cover and access various heterogeneous data sources. We should point out that
we did not employ any query optimization function to choose the most efficient query
execution plan; stattdessen, we focused on the join operation. Zum Beispiel, if our predictive
model predicts based on the input SPARQL query that the optimal virtual model is of type
GRAPH, then for each relevant entity, one virtual GRAPH model is generated, following
our proposed transformation process (see Subsection 3.3). Once generated, our approach
joins those GRAPHs or TABULARs (d.h., a virtual model for each relevant entity) into
a FINAL Virtual, GRAPH, or TABULAR (siehe Abbildung 5). Below we describe the join
process and operations using GRAPH or TABULAR virtual models.

Joining Virtual Data Model: The data join coming from different data sources are
represented actually by the connections between star-shaped queries i.e., two SSQs
sta, stb (Thema, predicate, Objekt) are connected if the object of sta is the subject
of stb. These connections are translated into an array of join pairs (see green SSQ

e
D
u
D
N

ich

T
/

A
R
T
ich
C
e
–
P
D

F
/

D
Ö

ich
/

ich

1
0
1
1
6
2
D
N
_
A
_
0
0
2
1
6
2
1
2
7
0
2
9
D
N
_
A
_
0
0
2
1
6
P
D

ich

B
j
G
u
e
S
T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

Data Intelligence Just Accepted MS.
https://doi.org/10.1162/dint_a_00216 5

PREDICTING AN OPTIMAL VIRTUAL DATA MODEL FOR UNIFORM ACCESS TO LARGE HETEROGENEOUS DATA9

(A) Multi-Join Algorithm of GRAPHs

(B) Incremental Join Algorithm of TABULARs [12]

Feige. 6: Join Algorithms for GRAPH and TABULAR

in Figure 4a).As for GRAPH, the FINAL Virtual GRAPH (Figure 5a) is created by
iterating through the GRAPHs join pairs following a multi-way join algorithm (Feige-
ure 6a) which has been proven beneficial in terms of performance based on research
Literatur [29]. The multi-way join algorithm can join two or more relations simul-
taneously, which is suitable for graph-parallel computation. In der Praxis, new edges
are created for each joined pair to link GRAPHs, such as an edge source point to
one of the GRAPH vertices and its destination points to the second GRAPH. Der
FINAL Virtual GRAPH is the result of the newly created edges and the union of the
joined pair vertices. Endlich, we filter out vertices’ identifiers that have no destina-
tion. Außerdem, to make the joining of GRAPHs faster, we selected only projec-
tion columns’ IDs before joining GRAPHs since it is heavy to scan over columns.
Ähnlich, the FINAL Virtual TABULAR i.e., joined TABULARs (Figure 5b) is cre-
ated by applying join between the respective tables following incrementally joined
(Figure 6b), which is revealed to be very efficient [30]. This is done by using a prede-
fined method ’join’ that takes the joined pairs’ names and the name of the foreign
key column as an argument. Außerdem, we adopted the same strategy proposed
In [24], which employs a filter before data transformation, thus reducing the number
of the values of the attributes to be transformed and then joined which revealed high
efficiency.

––• Star-Shaped/SPARQL Operations to GRAPH/TaABULAR Operations GRAPH
and TABULAR have different structures; daher, the interaction with GRAPH is
possible through Graph Pattern Matching operations (z.B., Cypher-like), while the in-
teraction with TABULAR is possible through SQL-like functions. We highlighted be-
low how SPARQL and star-shaped operations are translated into Virtual Data model
Operationen, GRAPH, and TABULAR.

– Projection:

this operation requires accessing FINAL Virtual GRAPH and TABU-
LAR. For GRAPH, we used the hash map method to get the properties’ indexes by
iterating over the projected vertices and collecting the linked vertices into one vertex.
This helps reduce the operations (z.B., limit) execution time by executing operations

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

:
/
/

D
ich
R
e
C
T
.

ich
T
.

e
D
u
D
N

ich

T
/

A
R
T
ich
C
e
–
P
D

F
/

D
Ö

ich
/

ich

1
0
1
1
6
2
D
N
_
A
_
0
0
2
1
6
2
1
2
7
0
2
9
D
N
_
A
_
0
0
2
1
6
P
D

ich

B
j
G
u
e
S
T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

Data Intelligence Just Accepted MS.
https://doi.org/10.1162/dint_a_00216 5

CHAHRAZED B.BACHIR BELMEHDI, ABDERRAHMANE KHIAT AND NABIL KESKES

on a single vertex instead of multiple vertices. Contrary to the FINAL TABULAR,
which is projected using a predefined method ’project’ that takes as an argument
the projection variables and returns a projected FINAL TABULAR.

– Filtering: Performing filtering on a given property of Virtual GRAPHS needs access-
ing data through an index rather than the property name. daher, we used a hash
map that stores the property name and index. We get the right property index by
matching the property name from the filter with the one from the hash map. As for
the Virtual TABULAR model, filters are executed over the TABULAR columns. Wir
use a predefined method ’filter’ that takes as an argument the filter statement
and returns a filtered virtual TABULAR model.

– Ordering and Limit:

to be able to sort or show a limited number of data of the
GRAPH, we extracted triples from the FINAL GRAPH. Nächste, we used a prede-
fined ordering method, e.g.,’sortBy’ and limited method ’take’, that takes the
vertex property value as input and outputs sorted or limited FINAL GRAPH. Als
for the TABULAR model, it can be sorted and limited using predefined methods
’orderBy’ and ’limit’ respectively. These methods take the ordering column
or number of needed rows in case of Limit as an argument and return an ordered or
limited FINAL TABULAR.

3.4.1 Query Execution

Optimizing query execution time is a very crucial step when it comes to loading and
joining data. Jedoch, time optimization depends not only on the virtual data model, d.h.,
GRAPH or TABULAR, but also on the execution plan of operations, z.B., applying a filter
before joining data. We disabled any query optimization by engine Apache SPARK and
Graphx to emphasize the join operation when querying multiple data sources.

Optimization Strategy for GRAPH. To join GRAPHs, we applied a multi-way join
Algorithmus (Figure 6a) which has been proven beneficial in terms of performance based on
research literature [29]. The multi-way join algorithm can join two or more relations at
die selbe Zeit, which is suitable for graph-parallel computation. Außerdem, to make the
join of GRAPHs faster, we selected only projection columns and their ID before joining
GRAPHs since it is heavy to scan over columns (unlike the TABULAR strategy given
nächste).

Optimization Strategies for TABULAR. To join TABULARs, research has proven
that incremental data processing approaches [30] for data-parallel achieve better perfor-
mance since they rely on updating the results of a query when updates are streamed rather
than re-computing these queries and may require less memory than batch processing.
daher, we followed the incremental join; if TABULAR is selected as an optimal vir-
tual data model based on query behavior, the FINAL Virtual TABULAR is created by
iterating through the TABULARs that are created from the relevant entities and incre-
mentally joined (see Figure 6b). Außerdem, we adopted the same strategy as described
von [24] where we applied a filter before data transformation, thus reducing the number
of the values of the attributes to be transformed and then joined, which revealed very
efficient.

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

:
/
/

D
ich
R
e
C
T
.

ich
T
.

e
D
u
D
N

ich

T
/

A
R
T
ich
C
e
–
P
D

F
/

D
Ö

ich
/

ich

1
0
1
1
6
2
D
N
_
A
_
0
0
2
1
6
2
1
2
7
0
2
9
D
N
_
A
_
0
0
2
1
6
P
D

ich

B
j
G
u
e
S
T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

Data Intelligence Just Accepted MS.
https://doi.org/10.1162/dint_a_00216 5

PREDICTING AN OPTIMAL VIRTUAL DATA MODEL FOR UNIFORM ACCESS TO LARGE HETEROGENEOUS DATA11

4. Deep Learning Model

This section describes our deep learning model to predict the virtual data model of

type GRAPH or TABULAR.

4.1 SPARQL Features Analysis

Our model breaks down the SPARQL query plan into nodes (Figure 7a). Each node
includes a set of query features that significantly affect the query cost (z.B., filter). Der
different features are then encoded using different encoding models. Below, we list those
features and their encoding:

• MetaData: is the set of attributes and entities used in the SPARQL query (z.B., entity
names ’producer’). We encode both attributes and entities using a one-hot vector. Dann
we concatenate each attribute vector with its entities vectors to have a final MetaData
vector.

• Operation: is the set of physical operations used in the SPARQL query, such as Join,
BGP, Projection, OrderBy, and Limit. Each operation is composed of an operator
(z.B., ”>=”) and a list of operands (entities or attributes e.g., [operator=’project’, bei-
tributes=’price, delivery-days’]). Both the operator and its operands are encoded using
a one-hot vector. Endlich, each operation vector in the SPARQL query is the concatena-
tion of an operator vector and its operands vectors.

• Filter: is the set of query filters. A filter is considered a special operation since it
could be either atomic or compound. Each atomic filter is composed of an attribute, ein
operator, and an operand. The filter operand could be either a float or a string value.
Both the attribute and the operator are encoded using a one-hot vector. To encode the
operand, we use a normalized float if its value is numeric; ansonsten, we use a String
representation. The String representation makes use of a Char Embedding model and a
CNN (Convolutional Neural Network [31]) to have a fixed-length dense String vector.
The three resulting vectors are concatenated to form one single filter vector.
The compound filter is a combination of multiple atomic filters using either AND or
OR operator. Zum Beispiel, ’price > 4000 (atomic) AND price < 20 000 (atomic)’, in this case, the filter is considered as a compound. To obtain the vector of the compound filter, we encode each logical operator and atomic filter using one-hot encoding. Next, a tree filter is created where the root is the one-hot vector of a logical operator (e.g., AND), and the nodes are the one-hot vectors of atomic filters (e.g., left node 000111 representing price > 400). Endlich, each node (one-hot vector) is transformed into a
sequence using the Depth First Search algorithm (DFS). At the end of each sequence,
we add an empty node. The sequences are then concatenated following the visited
Befehl.

4.2 Proposed Tree-structured Model

Tree-structured models have been proven more powerful than neural networks at pre-
dictive tasks using tabular data [32]. Inspired by the work presented in [33], we propose
our deep learning model (Figure 7b) that takes as input the encoded features of SPARQL

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

:
/
/

D
ich
R
e
C
T
.

ich
T
.

e
D
u
D
N

ich

T
/

A
R
T
ich
C
e
–
P
D

F
/

D
Ö

ich
/

ich

1
0
1
1
6
2
D
N
_
A
_
0
0
2
1
6
2
1
2
7
0
2
9
D
N
_
A
_
0
0
2
1
6
P
D

ich

B
j
G
u
e
S
T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

Data Intelligence Just Accepted MS.
https://doi.org/10.1162/dint_a_00216 5

CHAHRAZED B.BACHIR BELMEHDI, ABDERRAHMANE KHIAT AND NABIL KESKES

(A) Features Extractor

Feige. 7: Deep Learning: Feature Extraction and Tree-Structured Model

(B) Tree-Structured

query and outputs the optimal virtual data model, GRAPH or TABULAR that has the
lowest cost. Our model consists of an embedding layer to condense the features’ vec-
tors and an estimation layer to estimate the optimal virtual data model. Zusätzlich, Die
model includes an intermediate representation layer to capture the correlation between the
joined star-shaped queries. Im Folgenden, we give a detailed explanation of the model
architecture.

4.2.1 SPARQL Embedding Layer

The SPARQL Query Embedding Layer (Figure 8a) embeds a sparse vector into a
dense vector. It takes as inputs three types of feature vectors: MetaData, Operation, Und
Filter. Erste, the MetaData vector along with the Operation vector are embedded using a
fully connected neural network layer7 with ReLU (Rectified Linear Unit) activator, welche
is a piecewise linear function that outputs the input directly if it is positive. The structure
of the Filter vector is more complicated in the case of compound filters. daher, Wir
adapted a Min-Max-Pooling operation to embed the Filter vector.
The Min-Max-Pooling model is a tree-structured model that takes the structure of the
Filter tree. For leaf nodes, we use a fully connected neural network. For conjunction
Knoten, we use the max pooling layer for ‘OR’ operator and the min pooling layer for
‘AND’ operator. The max pooling layer is the maximum number of estimated results
satisfying the atomic predicates, while the min pooling layer is the minimum number of
estimated results satisfying the atomic predicates. Thus representing the SPARQL query
filters explicitly.

4.2.2 SPARQL Representation Layer

Learning complex structure representations such as tree structure using classic neu-
ral networks has many challenges. Erste, the neural networks can learn much information
from the leaf nodes but fails at capturing the correlation among upper nodes. Das ist
known as the vanishing gradient problem. Zweite, capturing correlations between mul-
tiple tree nodes requires storing a lot of intermediate results, which leads the space to
grow exponentially. This is known as the gradient explosion problem. To handle those

7Fully connected network: a linear regression a0x +..+ anx, 2 linear regressions connected means the output of the first one is
the input of the second one.

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

:
/
/

D
ich
R
e
C
T
.

ich
T
.

e
D
u
D
N

ich

T
/

A
R
T
ich
C
e
–
P
D

F
/

D
Ö

ich
/

ich

1
0
1
1
6
2
D
N
_
A
_
0
0
2
1
6
2
1
2
7
0
2
9
D
N
_
A
_
0
0
2
1
6
P
D

ich

B
j
G
u
e
S
T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

Data Intelligence Just Accepted MS.
https://doi.org/10.1162/dint_a_00216 5

PREDICTING AN OPTIMAL VIRTUAL DATA MODEL FOR UNIFORM ACCESS TO LARGE HETEROGENEOUS DATA13

two problems, we designed an intermediate layer (detailed in Figure 8b) that captures
the global cost information from leaf nodes to the root by training representations for
nodes recursively. We use fully connected networks that have the same structure and
share common parameters. Each layer has three inputs: an embedding vector, a repre-
sentation vector of the right child, and a representation vector of the left child. Wir verwendeten
Long Short-Term Memory (LSTM) [34] as a recurrent model. The LSTM model uses the
concept of ’memory’ to store information of previous nodes, which makes them capable
of learning order dependence in the tree structure. This helps prevent the information loss
Problem. Andererseits, the forget gate of Sigmoid helps LSTM to address the space
explosion problem.

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

:
/
/

D
ich
R
e
C
T
.

ich
T
.

(A) Tree Representation

Feige. 8: Deep Learning: Tree and Representation Model

(B) Representation Model

4.2.3 Virtual Model Classification Layer

It is a binary classification model that takes the representation vector of query tree
nodes as input and outputs the optimal virtual data model, GRAPH or TABULAR, mit
the lower cost (d.h., we set GRAPH with value 1 for SPARQL queries that are faster than
TABULAR and label TABULAR with value 0 for SPARQL queries that are faster than
GRAPH). The classification layer includes two fully connected neural networks with a
ReLU activator. The output layer is a Sigmoid function that returns a number from 0.0
Zu 1.0, representing the probability that the input belongs to. If the output is closer to 1.0
then the predicted virtual data model is of type GRAPH; ansonsten, if the output is closer
Zu 0.0, then the predicted virtual data model is of type TABULAR.

Product

database type Cassandra MongoDB
# of tuples
50000
data size
˜4MB

50000
˜90MB

Offer Review Person
Neo4j
CSV
50000
50000
˜3MB
70MB

Producer
MySQL
50000
14MB

Tisch 1: table 1a: Data & Queries Characteristics

e
D
u
D
N

ich

T
/

A
R
T
ich
C
e
–
P
D

F
/

D
Ö

ich
/

ich

1
0
1
1
6
2
D
N
_
A
_
0
0
2
1
6
2
1
2
7
0
2
9
D
N
_
A
_
0
0
2
1
6
P
D

ich

B
j
G
u
e
S
T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

Data Intelligence Just Accepted MS.
https://doi.org/10.1162/dint_a_00216 5

CHAHRAZED B.BACHIR BELMEHDI, ABDERRAHMANE KHIAT AND NABIL KESKES

Q7
✓
✓
✓

Q8
✓
✓
✓

Q9
✓
✓

Q10 Q11 Q12 Q13 Q14 Q15
✓
✓
✓

✓
✓
✓

✓

Q5
✓
✓

Q2 Q3
✓

✓

✓
✓
✓
✓

Q4
✓
✓
✓
✓
✓

Q6
✓
✓
✓
✓
✓

Q1
✓
✓
✓

Product
Offer
Rezension
Person
Producer
PROJECT
FILTER
ORDERBY
LIMIT
DISTINCT ✓

✓
✓16
✓16
✓1
✓1
✓300 ✓2
✓

✓
✓
✓5 ✓29 ✓45 ✓24 ✓45 ✓38 ✓38 ✓24 ✓34 ✓4

✓

✓12 ✓1

✓1
✓20 ✓4
✓
✓

✓

✓5

✓1

✓1
✓1
✓20 ✓20 ✓80
✓
✓

✓

✓1

✓

✓10
✓

✓

✓
✓6 ✓32 ✓34 ✓4
✓4

✓1
✓1

✓1
✓1
✓13 ✓19 ✓1000 ✓1000
✓
✓

✓

Q16
✓
✓

✓5

✓

Q17 Q18 Q19 Q20
✓
✓

✓
✓
✓
✓
✓
✓9 ✓45 ✓45 ✓5

✓
✓
✓
✓
✓

✓
✓
✓

✓2

✓3

✓1

✓

Tisch 2: Tables and Operations involved in Queries.

Implementation and Experimental Setup

OPTIMA – an implementation of our approach, is an OBDA system that calls
Graphx and Apache-Spark8 to implement two virtual data model, GRAPH and TABU-
LAR. The virtual data model is the model defined by the computation unit of these two
query engines9. Graphx and Apache-Spark already implement wrappers called connec-
tors, of which we used five types to load data that is stored in Neo4j (property graph),
MongoDB (document-based), Cassandra (wide-column), MySQL (relational), and CSV
(tabular). As for transformation, we used Graphx and Apache-Spark functions10 e.g.,
flaMap(x=>y). OPTIMA calls a deep learning model to get the predicted optimal
virtual data; it uses NumPy for encoding data and PyTorch for the prediction model. OP-
TIMA is available on GitHub at https://github.com/chahrazedbb/OPTIMA.

We conducted an empirical study to evaluate OPTIMA performance with respect to
the following sub-research questions of our problem: RQ1: What is the query perfor-
mance using OPTIMA? RQ2: Is the time of prediction plus the time of query execution
using an optimal virtual model equal to the fixed one? RQ3: What is the query perfor-
mance when using TABULAR versus GRAPH? RQ4: What is the accuracy of OPTIMA
and machine learning? RQ5: What is the query performance of OPTIMA compared to
the state-of-the-art, z.B., Squerall [12]? RQ6: What is the impact of involving more data
sources in a join query? RQ7: What is the resource consumption (CPU, Erinnerung) von
OPTIMA while running various queries? RQ8: What is the time taken by each transfor-
mation process?

5.1 Benchmark, Queries, and Environment

There is no benchmark dedicated to assessing ontology-based big data access sys-
Systeme. We end up using BSBM* [12] to evaluate the performance of OPTIMA. BSBM* is
an adapted version of BSBM benchmark [35] where five tables, Product, Offer, Rezension,
Person, and Producer, are distributed among different data storage. To test OPTIMA,
we use the five tables to enable up to 4-chain joins. These tables are loaded in five dif-
ferent data sources Neo4j, MongoDB, Cassandra, MySQL, and CSV. Tisch 1 zeigt die
described information about data. We generated 5150 queries with 0-4 joins, 0-45 se-
lection, Und 0-16 for the filter, limit, and orderBy. The characteristics of these queries

8for Apache-Spark, a small part of OPTIMA is based on Squerall’s code (https://github.com/EIS-Bonn/Squerall)
9RDD is an immutable distributed collection of elements, while DataFrame is an immutable distributed collection of data
organized into named columns. RDD is distinct from DataFrame in that the former is considered schema-less.
10https://spark.apache.org/docs/latest/graphx-programming-guide.html, https://spark.apache.org/docs/2.2.0/rdd-
programming-guide.html

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

:
/
/

D
ich
R
e
C
T
.

ich
T
.

e
D
u
D
N

ich

T
/

A
R
T
ich
C
e
–
P
D

F
/

D
Ö

ich
/

ich

1
0
1
1
6
2
D
N
_
A
_
0
0
2
1
6
2
1
2
7
0
2
9
D
N
_
A
_
0
0
2
1
6
P
D

ich

B
j
G
u
e
S
T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

Data Intelligence Just Accepted MS.
https://doi.org/10.1162/dint_a_00216 5

PREDICTING AN OPTIMAL VIRTUAL DATA MODEL FOR UNIFORM ACCESS TO LARGE HETEROGENEOUS DATA15

Query
SELECT DISTINCT ?productLabel ?producerLabel
WHERE { product rdfs:label ?productLabel .
?producer rdfs:label ?producerLabel .
?product rdf:type bsbm:Product .
?product bsbm:producer ?producer .}

OPTIMA

Sqyerall

[’Bar Mix Lemon’,’Coke Classic 355 Ml’]

(A) Query Result Returned by OPTIMA & Squer-
alle

Metrics
CPU average (%)
Max memory (GB)

OPTIMA
0.21
1.0

Squerall
0.20
0.97

Time (MS)

System
OPTIMA 2400
4200
Squerall

(B) Avg Time

Tisch 3: OPTIMA Performance

are presented in Table 2. We take 4120 queries for training the model and 1030 Abfragen
for validation. We run the evaluation on Ubuntu Version 20.04 64-bit with an Intel(R)
Core(TM) i7-8550U CPU @ 1.80GHz, allocating 8GB of RAM.

Training paradigm

In diesem Abschnitt, we provide a detailed description of the training paradigm of our deep

learning model. The training data typically involves the following steps:

• Data collection and preprocessing: To the best of our knowledge, no large datasets of
SPARQL queries exist. daher, we generated more than 5000 SPARQL queries that
combine all possible elements of a SPARQL query, as described in Table 1. Diese
queries are then preprocessed (see an example of SPARQL query in Appendix A.1)
to extract features (see Appendix A.2) and then convert them into a tree-structured
representation (see Appendix A.3) suitable for input into our deep-learning model. Wir
run each query on both GRAPH and TABULAR. We set GRAPH with a value of 1 für
SPARQL queries that are faster than TABULAR and label TABULAR with a value of
0 for SPARQL queries that are faster than GRAPH.

• Tree construction: The tree structure is constructed based on the query plan, in other
Wörter, into query result clause and query pattern and query. Zum Beispiel, the tree’s
root node represents the query plan, while the child nodes represent the query result
clause and query pattern and query, and the leaf nodes of the query result clause would
represent clause type such as the ”SELECT” operation (see Appendix A.3).

• Supervised learning: To enable the model to learn the relationships between the
SPARQL query elements (z.B., plan, Betreiber, usw.) and the execution time of each
data model GRAPH or TABULAR. We trained our deep learning model using feed-
forward neural network with multiple hidden layers and non-linear activation functions
ReLU and Sigmod, including two fully connected neural networks, each with 16 neu-
rons. We trained the model on 80% of queries using the mean squared error as the loss
function and the Adam optimization algorithm. The model is trained for 100 Epochen,
and the validation loss is monitored to prevent overfitting.

• Model evaluation: We evaluated our trained model’s accuracy, and we obtained good

results after iterations.

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

:
/
/

D
ich
R
e
C
T
.

ich
T
.

e
D
u
D
N

ich

T
/

A
R
T
ich
C
e
–
P
D

F
/

D
Ö

ich
/

ich

1
0
1
1
6
2
D
N
_
A
_
0
0
2
1
6
2
1
2
7
0
2
9
D
N
_
A
_
0
0
2
1
6
P
D

ich

B
j
G
u
e
S
T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

Data Intelligence Just Accepted MS.
https://doi.org/10.1162/dint_a_00216 5

CHAHRAZED B.BACHIR BELMEHDI, ABDERRAHMANE KHIAT AND NABIL KESKES

System
OPTIMA
Squerall
Time Difference

Q1
1291
4098
2807

Q2
1254
2519
1265

Q3
730
3091
2361

Q4
10299
10283
16

Q5
10199
10191
8

Q6
1553
7984
6431

Q7
7104
7089
15

Q8
8442
8427
15

Q9
10094
10088
6

Q10
4694
4684
10

Q11
2575
2561
14

Q12
233
1400
1167

Q13
4673
4644
29

Q14
4487
4469
18

Q15
2397
3885
1488

Q16
2881
2875
6

Q17
1698
3314
1616

Q18
4607
8742
4135

Q19
2804
9059
6255

Q20
5648
7407
1759

Tisch 4: Time in ms per Query of OPTIMA & Squerall

5.2 Metrics

To evaluate OPTIMA, we use the following metrics:

• OPTIMA Accuracy. We compare the results returned by OPTIMA against the results

returned by Squerall.

• Classification. We use two metrics to evaluate the OPTIMA classification model:
Cross-entropy loss and Accuracy function. Assuming the real result is denoted as r = ri,
the predicted result is denoted as p = pi, and the correctly predicted results as t p = t pi,
Wo 1 <= i <= N, we compute these metrics as follows: CE(r, p) = ∑ ri ∗ log(pi), Acc(t p, p) = ∑t pi/ ∑ pi • Memory and CPU consumption as described in [36]. Specifically, we measure how much the memory and CPU are active during the computation. • Execution Time. We measure the time OPTIMA takes from query submission to the delivery of the answer. The time is measured using the absolute wall-clock system time reported by the Scala time() function. 5.3 Method We consider two studies: • In the first study, we compare OPTIMA’s results with SPARK-based Squerall’s results. Our comprehensive literature review did not reveal any single work except Squerall that is available and that supports most data sources. Squerall uses two big data engines, Presto and SPARK: Presto-based, where the virtual model of presto engine (which can- not be controlled by users) is used for query processing, and SPARK-based, where DataFrames are created as a virtual data model. To make the results comparable, we choose SPARK-based Squerall and extend it to support Neo4j. We assess the accuracy of OPTIMA in terms of (1) results (accuracy), (2) time, and (3) CPU and memory usage compared to SPARK-based Squerall. We should note that comparing the overall exe- cution time of OPTIMA against an original system, e.g., relational for a given query, is impossible because we are querying various heterogeneous formats and models. • In the second study, we inspect OPTIMA’s main components: machine learning, data wrappers, and query execution. We observed the behavior of query execution for GRAPH and TABULAR in terms of time. For the data wrapper, we investigate the time taken for the transformation process from data sources to GRAPH or TABULAR. As for the machine learning component, we compare our model with the LSTM model in terms of accuracy and time. The LSTM model takes as input the encoded features vectors without any correlation and outputs the data model. l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u d n / i t / l a r t i c e - p d f / d o i / i . / t / 1 0 1 1 6 2 d n _ a _ 0 0 2 1 6 2 1 2 7 0 2 9 d n _ a _ 0 0 2 1 6 p d t / . i f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Data Intelligence Just Accepted MS. https://doi.org/10.1162/dint_a_00216 5 PREDICTING AN OPTIMAL VIRTUAL DATA MODEL FOR UNIFORM ACCESS TO LARGE HETEROGENEOUS DATA17 5.4 Experiment 1: OPTIMA vs SPARK-based Squerall In this experiment, we load BSBM* as described above to obtain the results from OPTIMA and SPARK-based Squerall. Then, we run 5150 SPARQL queries and compare the results. • Validation of Results and Overall Execution Time: this comparison allows us to confirm the correctness of the results returned by OPTIMA. Table 3a shows the results of OPTIMA and SPARK-based Squerall of a complex SPARQL query Q21. The results are the same for both systems, which confirms that OPTIMA is able to support and join large data coming from different datasets. Table 4 illustrates the execution time returned by both systems. As can be observed, OPTIMA excels Squerall for queries that involve multiple joins. The time difference ranges from 0 to 80000 milliseconds (ms). This difference is due to the predicted virtual data model e.g., Q19, Q20, in which deep learning predicted that the Virtual model of type GRAPH is optimal. We also observe a small difference in the execution time (ranging from 0 to 30 ms) in favor of Squerall compared to OPTIMA for queries that involve multiple projections e.g., Q7, Q10. This is explained by the fact that the optimal virtual model is identical to Squerall’s, and both Squerall and OPTIMA used the same APIs to call data (wrapper); however, the data model prediction time added to OPTIMA makes it slightly slower than Squerall. Furthermore, the average execution time of Squerall is greater than 4000 ms compared to the average execution time of OPTIMA 2400 ms as shown in table 3b. These results illustrate the benefits of OPTIMA over existing systems; thus, RQ1 and RQ5 are answered. • Data Model Execution Time. As shown in Table 5, the analysis of experimental results indicates that GRAPH is faster than TABULAR in most cases, except for queries like Q8 and Q10. It has comparable to slightly lower performance in Q16. This confirms that the optimal model is very important in reducing the execution time of queries. The total execution time ranges from 50 to 90000 ms, with 90% of all cases being about or below 3000 ms. OPTIMA virtual data model of type GRAPH is faster in queries that involve joins (ranging from 50 to 40000 ms), while the TABULAR model outperforms the GRAPH model in queries involving more projections (ranging from 200 to 90000 ms). This is explained by the fact that the GRAPH is designed to store connections between data. Therefore, queries do not scan the entire graph to find the nodes that meet the search criteria. It looks only at nodes that are directly connected to other nodes, while SQL-like methods used by the TABULAR model require expensive join operations be- cause they traverse all data to find the data that meets the search criteria. On the other hand, the TABULAR model is faster when handling projections because the data struc- ture is already known, and data can be easily accessed by column names. Conversely, the GRAPH model does not have a predefined structure for the data, and each node attribute has to be examined individually during the projection query. The number of joins has a decisive impact on query performance; it should be taken into consideration with other factors, e.g., size of involved data, presence of filters, and selected variables. For example, Q2 joins only two data sources, Product and Review l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u d n / i t / l a r t i c e - p d f / d o i / i / t . / 1 0 1 1 6 2 d n _ a _ 0 0 2 1 6 2 1 2 7 0 2 9 d n _ a _ 0 0 2 1 6 p d / . t i f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Data Intelligence Just Accepted MS. https://doi.org/10.1162/dint_a_00216 5 18 CHAHRAZED B.BACHIR BELMEHDI, ABDERRAHMANE KHIAT AND NABIL KESKES Prediction Time GRAPH TABULAR Q1 3 1143 4098 Q2 3 1161 2519 Q3 4 1239 3091 Q4 6 1243 10283 Q5 4 306 10191 Q6 5 3181 7984 Q7 5 7168 7089 Q8 6 12237 8427 Q9 2 4977 10088 Q10 4 16681 4684 Q11 5 1211 2561 Q12 1 3567 1400 Q13 5 482 4644 Q14 5 1285 4469 Q15 4 766 3885 Q16 3 2883 2875 Q17 2 6639 3314 Q18 4 1366 8742 Q19 4 3370 9059 Q20 4 1723 7407 Table 5: Time in ms per Query of Prediction, GRAPH & TABULAR (1254 ms) but has comparable performance with Q1 (1291 ms), which joins four entities (Product, Offer, Review, and Producer). This may be due to filtering in Q1 (16 filters), significantly reducing intermediate results to join. Q3 involves four data sources, yet it is among the fastest queries. This is because it involves the small entities Person and Producer, which is another reason to reduce intermediate results to join. With five data sources to join, Q4 is among the most expensive queries (10299 ms). This can be attributed to the fact that the filter on Product is selective (?language = ”en”), which results in large intermediate results to join, in contrast to Q6 (?price < 8000). Although the four-source join Q7 and Q8 involve the small entity Producer, they are the most expensive queries that execute over the GRAPH model; this can be attributed to a large number of projections (38 attributes). Thus, we answer RQ3 and RQ6 and suggest that operations can affect query execution time. • Resource Consumption: finally, we record the Resource Consumption (i.e., Memory and CPU) taken by OPTIMA and SPARK-based Squerall. The results reported in Ta- ble 3c show that the CPU is not fully used by OPTIMA and SPARK-based Squerall (around 0.21% was used). This means that the complexity of queries does not impact CPU consumption. As for the total memory reserved, OPTIMA consumed around 1GB over 8GB per node, while SPARK-based Squerall used at most 1GB. Having the same CPU and memory could be explained by the fact that both are using the same query engine - SPARK, and the distribution of CPU between the nodes for loading and trans- formation. This answers RQ7. 5.5 Experiment 2: Performance of OPTIMA’s Predictive Model In this study, we evaluate the main components of OPTIMA. • Deep Learning Accuracy. We evaluated our model with LSTM and Regression models to assess our encoding techniques and prediction model. We used 5150 queries; 80% for training and 20% for validation. We trained all models on the same dataset and computed the accuracy and Cross-entropy loss function. Results in Table 6a show that our tree-structure-based method outperforms the LSTM and Regression models with an average accuracy of 0.831 for our model against 0.708 and 0.717 for LSTM and Regression, respectively. The cross-entropy loss is equal to 0.00018 for our model compared to 1.92027 and 6.51098 for LSTM and Re- gression, respectively. This is explained by the fact that both models, LSTM and Regression, rely on the independent assumption among different operations and attributes, while our model achieves the best performance as it captures more cor- relations. Thus answering RQ4. • Deep learning reduces the overall execution time. l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u d n / i t / l a r t i c e - p d f / d o i / i . t / / 1 0 1 1 6 2 d n _ a _ 0 0 2 1 6 2 1 2 7 0 2 9 d n _ a _ 0 0 2 1 6 p d / t . i f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Data Intelligence Just Accepted MS. https://doi.org/10.1162/dint_a_00216 5 PREDICTING AN OPTIMAL VIRTUAL DATA MODEL FOR UNIFORM ACCESS TO LARGE HETEROGENEOUS DATA19 Cost Regression model LSTM Our Model Loss 1.92027 6.51098 0.00018 Accuracy 0.708 0.717 0.831 Condition Machine Learning Only GRAPH Only TABULAR Avg. time (ms) 12 1320 2862 (a) Loss & Accuracy of Deep Learning Models (b) Time of Deep Learning, GRAPH & TABULAR Table 6: Deep Learning Performance To check if deep learning is reducing the overall execution time of OPTIMA by selecting the optimal virtual data model. We illustrate first the time taken by OP- TIMA’s components: machine learning algorithm, query execution over GRAPH model, and query execution over TABULAR against SPARK-based Squerall. We run OPTIMA and Squerall over 1030 queries. Results are shown in table 6b. The average execution time of the machine learning component is a very short 12 ms, while the average time for GRAPH is 1320 ms and TABULAR is 2862 ms. Results show that for most queries, GRAPH is faster than TABULAR, even with prediction time. In summary, only 14% of the queries were initially faster for OPTIMA (using GRAPH as a virtual model) compared to Squerall and become in the later favor. This is explained by the fact that for those queries, there is a slight difference in execution time using GRAPH compared to Squerall. This answers RQ2. Model GRAPH TABULAR Neo4j 138 3275 JDBC CSV Cassandra MongoDB 954 199 196 255 7695 5319 188 330 Loading 4.327 7.141 Table 7: Time (ms) of Data transformation to GRAPH & TABULAR 5.5.1 Data wrapper Time To answers RQ8, we evaluate, in this study, the time needed to load the data from data sources to the virtual data model of type GRAPH or TABULAR (see Table 7). Since the transformation process is different, we expect different behavior from the wrappers. In the table, we illustrate the time needed by each wrapper with the following observations: • Neo4j connector loads 50000 nodes from Neo4j within 138 ms into GRAPH, compared to 3275 ms in TABULAR. This is explained by the fact that the graph property used by Neo4j has the same exact structure as the GRAPH model. • CSV connector loads 50000 rows within 196 ms from CSV files into GRAPH, com- pared to 255 ms in TABULAR, even though CSV files save data into tables. This can be explained by the fact that GRAPH virtual model is a schema-less model that loads data directly without the need to preserve data structure, while TABULAR takes time to build the data schema. • JDBC connector loads 50000 rows from MySQL database within 954 ms into GRAPH, compared to 199 ms in TABULAR. This can be explained by the fact that MySQL uses a relational model, which has the same data structure as the Virtual TABULAR model. l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u d n / i t / l a r t i c e - p d f / d o i / i t / / . 1 0 1 1 6 2 d n _ a _ 0 0 2 1 6 2 1 2 7 0 2 9 d n _ a _ 0 0 2 1 6 p d t / . i f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Data Intelligence Just Accepted MS. https://doi.org/10.1162/dint_a_00216 5 20 CHAHRAZED B.BACHIR BELMEHDI, ABDERRAHMANE KHIAT AND NABIL KESKES • MongoDB connector loads 50000 rows from MongoDB within 188 ms into GRAPH, compared to 330 ms in TABULAR. This can be explained by the fact that MongoDB is document-based i.e., it is schema-less, the same as the GRAPH Virtual model, unlike the TABULAR model, which needs to build a data schema. • Cassandra connector loads 50000 rows within 7695 ms into GRAPH, compared to 5319 ms in TABULAR. This can be explained by the fact that Cassandra uses a columnar data model, which is more close to the TABULAR model even though it is a NoSQL database. 6. Related Work Our literature review reveals two categories addressing data virtualization. These two categories are namely ”ontology-based data access” and ”non-ontology-based data access” [12]. Non-ontology-based data access approaches mostly use SQL-like as query language and implement a virtual relational model [37,38], defining views of relevant data from sources having a relational model. Those views are generated based on mapping as- sertions that associate the general relational schema with the data source schemata. The shortcomings of these approaches are that the schema modifications and extensions are very rigid due to mappings and may depend on complex constraints. Furthermore, these approaches use Self-Contained Query [24] where users cannot control the structure of the virtual data model. OBDA [39] approaches use SPARQL as a unified access language and detect relevant data from sources to be joined through ontologies and standardized map- pings. This provides flexibility in modifying and extending the ontology and mappings with semantic differences found across the data schemata. Exiting Systems implemented OBDA over relational databases, e.g., Ontop [40], Stardog (http://www.stardog.com), which are using virtual knowledge graphs. These so- lutions are not designed to query large-scale data sources, e.g., NoSQL stores or HDFS. Our study’s scope focuses on works that query large-scale data sources using OBDA. Optique [10] is an OBDA platform that accesses both static and streaming data. It im- plements a relational model (implicitly a TABULAR) as a virtual model while querying data sources such as SQL databases and other sources e.g., CSV, and XML. There was no clear description of how Optique accesses NoSQL stores and distributed file systems (e.g., HDFS). Ontario [11] focuses on query rewriting, planning, and federation, with a strong stress on RDF data as input. Query plans are built and optimized based on a set of heuristics. The virtual model used by Ontario is the GRAPH model (explicitly an RDF). Squerall [12], recent and close work to OPTIMA leverages Big Data engines SPARK and Presto to query on-the-fly heterogeneous large data sources. The virtual data model imposed by Presto is TABULAR and does not offer users to control it, while SPARK can offer control over the virtual data model, Squerall uses DataFrame as a virtual model which is TABULAR. However, the decision behind the virtual data model implemented by all these systems is rather based on use and flexibility and not on solid evidence to improve query processing. There is no work that (1) implements the different optimal virtual models, and (2) selects the optimal one based on query behavior. For machine learning, some works [14,15,16] addressed the cost estimation of SPARQL queries to op- l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u d n / i t / l a r t i c e - p d f / d o i / i . / / t 1 0 1 1 6 2 d n _ a _ 0 0 2 1 6 2 1 2 7 0 2 9 d n _ a _ 0 0 2 1 6 p d t . / i f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Data Intelligence Just Accepted MS. https://doi.org/10.1162/dint_a_00216 5 PREDICTING AN OPTIMAL VIRTUAL DATA MODEL FOR UNIFORM ACCESS TO LARGE HETEROGENEOUS DATA21 timize query execution plan e.g., performance prediction, however, all these approaches are designed for a single query on one single data source. 7. Conclusion We presented a new approach that reduces the time execution of querying large het- erogeneous data by predicting the optimal virtual data model based on query behavior. OPTIMA - a realization of our approach, implements two virtual models, GRAPH and TABULAR within the query engine SPARK (Graphx and Apache-Spark). The effective deep learning model built on top of OPTIMA’s architecture estimates the cost of the query against both virtual models to select the optimal one for the given query. It extracts signif- icant features such as the query plan and query operation and returns the optimal virtual data model. Once selected, OPTIMA gets a unified view of the data from multiple data sources on-the-fly by decomposing the input SPARQL query into star-shaped queries. Next, it uses ontology-based mappings to detect relevant entities from original sources. Those relevant entities are then loaded by the wrappers into the predicted virtual model, GRAPH or TABULAR to be joined. In the case of GRAPH, a set of vertex and edges are joined while for TABULAR, a set of tables is combined. Finally, the results are returned by applying the operation on the FINAL joined GRAPH or TABULAR. Extensive experi- ments showed a reduction in query execution time of over 40% for the TABULAR model and over 30% for the GRAPH model. A. Appendix: Training Data In this appendix, we present an example of the data collection and preprocessing of the training model. SELECT DISTINCT WHERE{ ? v e n d o r ? c o u n t r y ? p r o d u c e r P u b l i s h e r ? p r o d u c e r edm : c o u n t r y ? c o u n t r y . ? p r o d u c e r bsbm : p u b l i s h e r ? r o d u c e r P u b l i s h e r . ? o f f e r bsbm : p r o d u c e r ? p r o d u c e r . ? o f f e r bsbm : v e n d o r ? v e n d o r . ? o f f e r FILTER ( ? c o u n t r y = ”DE” ) . FILTER ( ? p r o d u c e r P u b l i s h e r = ” 1 ” FILTER ( ? v e n d o r >= 50 ) .

r d f : t y p e schema : O f f e r .

) .

}

Listing A.1: SPARQL query

A . [ DISTINCT , B , H, ICH , J , K]
B . [ PROJECT , ? v e n d o r , ? c o u n t r y , p r o d u c e r P u b l i s h e r ]
C . [ BGP , D, E ]
D . [ TRIPLE , ? p r o d u c e r , ? c o u n t r y ]
E . [ TRIPLE , ? p r o d u c e r , ? p r o d u c e r P u b l i s h e r ]

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

:
/
/

D
ich
R
e
C
T
.

ich
T
.

e
D
u
D
N

ich

T
/

A
R
T
ich
C
e
–
P
D

F
/

D
Ö

ich
/

ich

1
0
1
1
6
2
D
N
_
A
_
0
0
2
1
6
2
1
2
7
0
2
9
D
N
_
A
_
0
0
2
1
6
P
D

ich

B
j
G
u
e
S
T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

Data Intelligence Just Accepted MS.
https://doi.org/10.1162/dint_a_00216 5

CHAHRAZED B.BACHIR BELMEHDI, ABDERRAHMANE KHIAT AND NABIL KESKES

F . [ BGP , G]
G . [ TRIPLE , ? o f f e r , ? v e n d o r ]
H . [ JOIN , C , F ]
ICH . [ FILTER ,
J . [ FILTER ,
K . [ FILTER ,

=
=
>=

, ? c o u n t r y , ” DE” ]
, ? p r o d u c e r P u b l i s h e r , ” 1 ” ]

, ? v e n d o r , 5 0 ]

Listing A.2: Feature extraction from SPARQL queryA.1

{

” P l a n ” :

{

” DISTINCT ” : ” ”
, ” PROJECT ” :
[

” ? v e n d o r ”
, ” ? c o u n t r y ”
, ” ? p r o d u c e r P u b l i s h e r ”

]
, ” JOIN ” :
{

”TYPE” : ” l e f t ”
, ”BGP” :
{

[ ” p r o d u c e r ” ]

”TABLE” :
, ” TRIPLE ” :
, ” TRIPLE ” :

[ ” ? p r o d u c e r ” , ” ? c o u n t r y ” ]
[ ” ? p r o d u c e r ” , ” ? p r o d u c e r P u b l i s h e r ” ]

}
, ”BGP” :
{

”TABLE” :
, ” TRIPLE ” :

[ ” o f f e r ” ]

[ ” ? o f f e r ” , ” ? v e n d o r ” ]

}

}
, ” FILTER ” :
{

” o p t y p e ” : ” Compare ”
, ” o p e r a t o r ” : ”=”
, ” l e f t v a l u e ” : ” ? c o u n t r y ”
, ” r i g h t v a l u e ” : ”DE”

}
, ” FILTER ” :
{

” o p t y p e ” : ” Compare ”
, ” o p e r a t o r ” : ”=”
, ” l e f t v a l u e ” : ” ? p r o d u c e r P u b l i s h e r ”
, ” r i g h t v a l u e ” : ” 1 ”

}
, ” FILTER ” :
{

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

:
/
/

D
ich
R
e
C
T
.

ich
T
.

e
D
u
D
N

ich

T
/

A
R
T
ich
C
e
–
P
D

F
/

D
Ö

ich
/

ich

1
0
1
1
6
2
D
N
_
A
_
0
0
2
1
6
2
1
2
7
0
2
9
D
N
_
A
_
0
0
2
1
6
P
D

ich

B
j
G
u
e
S
T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

Data Intelligence Just Accepted MS.
https://doi.org/10.1162/dint_a_00216 5

PREDICTING AN OPTIMAL VIRTUAL DATA MODEL FOR UNIFORM ACCESS TO LARGE HETEROGENEOUS DATA23

” o p t y p e ” : ” Compare ”
, ” o p e r a t o r ” : ”>=”
, ” l e f t v a l u e ” : ” ? v e n d o r ”
, ” r i g h t v a l u e ” : ” 5 0 ”

}

Listing A.3: Tree Represenation of SPARQl QueryA.1

Danksagungen

The authors acknowledge the financial support of Fraunhofer Cluster of Excellence (CCIT) and Dr. Mo-

hamed Najib Mami for the valuable comments that helped to implement our work.

VERWEISE

1. D. Alter, “2025: The digitization of the world–from edge to core,” Farmingham, MA.,

2020.

2. C. Snijders, U. Matzat, and U.-D. Reips, “” big data”: big gaps of knowledge in the
field of internet science,” International journal of internet science, Bd. 7, NEIN. 1, 2012,
S. 1–5.

3. A. Gandomi and M. Haider, “Beyond the hype: Big data concepts, Methoden, Und
Analytik,” International journal of information management, Bd. 35, NEIN. 2, 2015,
S. 137–144.

4. A. Cuzzocrea, L. Bellatreche, and I. Song, “Data warehousing and OLAP over big
Daten: current challenges and future research directions,” in Proceedings of the six-
teenth international workshop on Data warehousing and OLAP, DOLAP 2013, San
Francisco, CA, USA, Oktober 28, 2013. ACM, 2013, S. 67–70.

5. M. N. Mami, “Strategies for a semantified uniform access to large and heterogeneous

data sources,„Ph.D. dissertation, University of Bonn, Deutschland, 2021.

6. M. Rouse, “What is data virtualization,” 2011.
7. N. Miloslavskaya and A. Tolstoy, “Big data, fast data and data lake concepts,” Pro-

cedia Computer Science, Bd. 88, 2016, S. 300–305.

8. A. Poggi, D. Lembo, D. Calvanese, G. De Giacomo, M. Lenzerini, and R. Rosati,

“Linking data to ontologies,” Journal on Data Semantics X. Springer, 2008.

9. H. Dehainsala, G. Pierra, and L. Bellatreche, “Ontodb: An ontology-based database
for data intensive applications,” in Advances in Databases: Concepts, Systems and
Applications, 12th International Conference on Database Systems for Advanced Ap-
plications, DASFAA Thailand. Springer, 2007, S. 497–508.

10. M. Giese, A. Soylu, G. Vega-Gorgojo, A. Waaler, P. Haase, E. Jim´enez-Ruiz,
D. Lanti, M. Rezk, G. Xiao, ¨O. ¨Ozc¸ep et al., “Optique: Zooming in on big data,”
Computer, Bd. 48, NEIN. 3, 2015.

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

:
/
/

D
ich
R
e
C
T
.

ich
T
.

e
D
u
D
N

ich

T
/

A
R
T
ich
C
e
–
P
D

F
/

D
Ö

ich
/

ich

1
0
1
1
6
2
D
N
_
A
_
0
0
2
1
6
2
1
2
7
0
2
9
D
N
_
A
_
0
0
2
1
6
P
D

ich

B
j
G
u
e
S
T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

Data Intelligence Just Accepted MS.
https://doi.org/10.1162/dint_a_00216 5

CHAHRAZED B.BACHIR BELMEHDI, ABDERRAHMANE KHIAT AND NABIL KESKES

11. K. M. Endris, P. D. Rohde, M.-E. Vidal, and S. Auer, “Ontario: Federated query
processing against a semantic data lake,” in International Conference on Database
and Expert Systems Applications. Springer, 2019, S. 379–395.

12. M. N. Mami, D. Graux, S. Scerri, H. Jabeen, S. Auer, and J. Lehman, “Squerall:
Virtual ontology-based access to heterogeneous and large data sources,” Proceedings
of 18th International Semantic Web Conference, 2019.

13. S. T. Al-Amin, C. Ordonez, and L. Bellatreche, “Big data analytics: Exploring graphs
with optimized SQL queries,” in Database and Expert Systems Applications – DEXA
2018 International Workshops, BDMICS, BIOKDD, and TIR, Regensburg, Deutschland,
September 3-6, 2018, Verfahren. Springer, 2018, S. 88–100.

14. W. E. Zhang, Q. Z. Sheng, Y. Qin, K. Taylor, and L. Yao, “Learning-based SPARQL
query performance modeling and prediction,” World Wide Web, Bd. 21, NEIN. 4, 2018,
S. 1015–1035.

15. R. Hasan and F. Gandon, “A machine learning approach to sparql query performance
prediction,” in International Joint Conferences on Web Intelligence and Intelligent
Agent Technologies, 2014, S. 266–273.

16. R. Singh, “Inductive learning-based sparql query optimization,” Data Science and

Intelligent Applications, 2021, S. 121–135.

17. ICH. Zouaghi, A. Mesmoudi, J. Galicia, L. Bellatreche, and T. Aguili, “Query optimiza-
tion for large scale clustered RDF data,” in Proceedings of the 22nd International
Workshop on Design, Optimization, Languages and Analytical Processing of Big
Data co-located with EDBT/ICDT 2020 Joint Conference, DOLAP@EDBT/ICDT
2020,Denmark, 2020, S. 56–65.

18. E. Gallinucci, M. Golfarelli, and S. Rizzi, “Schema profiling of document-oriented

databases,” Information Systems, Bd. 75, 2018.

19. A. Senk, M. Valenta, and W. Benn, “Distributed evaluation of xpath axes queries over
large XML documents stored in mapreduce clusters,” in 25th International Workshop
on Database and Expert Systems Applications, Deutschland, 2014, 2014, S. 253–257.
20. J. F. Sequeda, M. Arenas, and D. P. Miranker, “On directly mapping relational
databases to rdf and owl,” in Proceedings of the 21st international conference on
World Wide Web, 2012, S. 649–658.

21. M. A. Rodriguez and P. Neubauer, “The graph traversal pattern,” in S. Sakr and
E. Pardede, (Hrsg.), Graph Data Management: Techniques and Applications.
IGI
Global, 2011, S. 29–46.

22. M. Y. Santos and C. Costa, “Data warehousing in big data: From multidimensional
to tabular data models,” in Proceedings of the Ninth International C* Conference on
Computer Science & Software Engineering, Portugal, 2016. ACM, 2016, S. 51–60.
23. M. Vidal, E. Ruckhaus, T. Lampo, A. Mart´ınez, J. Sierra, and A. Polleres, “Efficiently

joining group patterns in SPARQL queries,” 2010, S. 228–242.

24. M. N. Mami, D. Graux, S. Scerri, H. Jabeen, S. Auer, and J. Lehmann, “Uniform
access to multiform data lakes using semantic technologies,” in Proceedings of the
21st International Conference iiWAS2019. ACM, 2019, S. 313–322.

25. J. E. Gonzalez, R. S. Xin, A. Dave, D. Crankshaw, M. J. Franklin, and I. Stoica,
„{GraphX}: Graph processing in a distributed dataflow framework,” in 11th USENIX
symposium on operating systems design and implementation (OSDI 14), 2014, S.
599–613.

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

:
/
/

D
ich
R
e
C
T
.

ich
T
.

e
D
u
D
N

ich

T
/

A
R
T
ich
C
e
–
P
D

F
/

D
Ö

ich
/

ich

1
0
1
1
6
2
D
N
_
A
_
0
0
2
1
6
2
1
2
7
0
2
9
D
N
_
A
_
0
0
2
1
6
P
D

ich

B
j
G
u
e
S
T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

Data Intelligence Just Accepted MS.
https://doi.org/10.1162/dint_a_00216 5

PREDICTING AN OPTIMAL VIRTUAL DATA MODEL FOR UNIFORM ACCESS TO LARGE HETEROGENEOUS DATA25

26. S. Salloum, R. Dautov, X. Chen, P. X. Peng, and J. Z. Huang, “Big data analytics
on apache spark,” International Journal of Data Science and Analytics, Bd. 1, NEIN. 3,
2016, S. 145–164.

27. D. Crankshaw, A. Dave, R. S. Xin, J. E. Gonzalez, M. J. Franklin, and I. Stoica, “The

graphx graph processing system,” UC Berkeley AMPLab.

28. M.-E. Vidal, E. Ruckhaus, T. Lampo, A. Mart´ınez, J. Sierra, and A. Polleres, “Effi-
ciently joining group patterns in sparql queries,” in Extended Semantic Web Confer-
enz. Springer, 2010, S. 228–242.

29. M. Henderson and R. Lawrence, “Are multi-way joins actually useful?.” in ICEIS (1),

2013, S. 13–22.

30. ICH. Elghandour, A. Kara, D. Olteanu, and S. Vansummeren, “Incremental techniques
for large-scale dynamic query processing,” in Proceedings of the 27th ACM Inter-
national Conference on Information and Knowledge Management, 2018, S. 2297–
2298.

31. K. Er, X. Zhang, S. Ren, and J. Sun, “Identity mappings in deep residual networks,”

in European conference on computer vision. Springer, 2016, S. 630–645.

32. Y. Yang, ICH. G. Morillo, and T. M. Hospedales, “Deep neural decision trees,” CoRR,

Bd. abs/1806.06988, 2018.

33. J. Sun and G. Li, “An end-to-end learning-based cost estimator,” Proc. VLDB Endow.,

Bd. 13, NEIN. 3, 2019, S. 307–319.

34. S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation,

Bd. 9, NEIN. 8, 1997, S. 1735–1780.

35. C. Bizer and A. Schultz, “The berlin SPARQL benchmark,” International Journal on
Semantic Web and Information Systems (IJSWIS), Bd. 5, NEIN. 2, 2009, S. 1–24.
36. D. Graux, L. Jachiet, P. Geneves, and N. Laya¨ıda, “A multi-criteria experimental
ranking of distributed sparql evaluators," In 2018 IEEE International Conference on
Big Data.

IEEE, 2018, S. 693–702.

37. R. F. van der Lans, “Architecting the multi-purpose data lake with data virtualization,”

Denodo whitepapers, 2018.

38. D. Chatziantoniou and V. Kantere, “Just-in-time modeling with datamingler,” in Pro-
ceedings of the ER Demos and Posters 2021 co-located with 40th International Con-
ference on Conceptual Modeling (ER 2021), Kanada, Bd. 2958. CEUR-WS.org,
2021, S. 43–48.

39. D. Calvanese, G. D. Giacomo, D. Lembo, M. Lenzerini, A. Poggi, M. Rodriguez-
Muro, and R. Rosati, “Ontologies and databases: The dl-lite approach,” in Reasoning
Web International Summer School. Springer, 2009, S. 255–356.

40. D. Calvanese, B. Cogrel, S. Komla-Ebri, R. Kontchakov, D. Lanti, M. Rezk,
M. Rodriguez-Muro, and G. Xiao, “Ontop: Answering sparql queries over relational
databases,” Semantic Web, Bd. 8, NEIN. 3, 2017, S. 471–487.

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

:
/
/

D
ich
R
e
C
T
.

ich
T
.

e
D
u
D
N

ich

T
/

A
R
T
ich
C
e
–
P
D

F
/

D
Ö

ich
/

ich

1
0
1
1
6
2
D
N
_
A
_
0
0
2
1
6
2
1
2
7
0
2
9
D
N
_
A
_
0
0
2
1
6
P
D

ich

B
j
G
u
e
S
T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

Data Intelligence Just Accepted MS.
https://doi.org/10.1162/dint_a_00216 5

CHAHRAZED B.BACHIR BELMEHDI, ABDERRAHMANE KHIAT AND NABIL KESKES

Chahrazed BACHIR-BELMEHDI received the BSc de-
gree in Information System, In 2017 and the MSc degree in In-
formation Systems engineering, In 2019 from the Djillali Liabes
Universität, Algeria. She is working toward the doctoral degree
in LabRI laboratory at ESI-SBA, Algeria. Her research inter-
ests include Big Data Analysis, Machine Learning and Software
Maschinenbau.

Abderrahmane Khiat holds a PhD in Knowledge Engi-
neering from the University of Oran1, Algeria (2017) and cur-
rently works as a Senior Researcher at Fraunhofer IAIS, Ger-
viele. He ranked fifth among the top young inventors in the
Middle East and Africa in the ”Stars of Science-2021”. He ob-
tained the price of The Best Paper Awards for Young Scientists
Researchers in 2014. His research includes Knowledge Graphs,
Big Data Integration and Data Mining.

Nabil Keskes has completed Prof (2020), M.Sc.(2006),
Ph.D.(2012). Currently he is a full professor in high School of
Computer Science in Algeria. His research interests are geared
towards web service selection and pragmatic web.

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

:
/
/

D
ich
R
e
C
T
.

ich
T
.

e
D
u
D
N

ich

T
/

A
R
T
ich
C
e
–
P
D

F
/

D
Ö