Survive the Schema Changes: Integration of
Unmanaged Data Using Deep Learning
Zijie Wang
, Lixi Zhou
, Amitabh Das
, Valay Dave
, Zhanpeng Jin
, and Jia Zou
Arizona State University,
University at Buffalo, State University of New York
Email:
{zijiewang, lixi.zhou, adas59, vddave, jia.zou}@asu.edu,
zjin@buffalo.edu
Abstract—Data is the king in the age of AI. However data
integration is often a laborious task that is hard to automate.
Schema change is one significant obstacle to the automation of
the end-to-end data integration process. Although there exist
mechanisms such as query discovery and schema modification
language to handle the problem, these approaches can only work
with the assumption that the schema is maintained by a database.
However, we observe diversified schema changes in heterogeneous
data and open data, most of which has no schema defined. In
this work, we propose to use deep learning to automatically deal
with schema changes through a super cell representation and
automatic injection of perturbations to the training data to make
the model robust to schema changes. Our experimental results
demonstrate that our proposed approach is effective for two real-
world data integration scenarios: coronavirus data integration,
and machine log integration.
Keywords—data integration, deep learning, schema changes,
schema evolution, perturbation, intermediate representation
I. INTRODUCTION
It was reported in 2018 that data scientists spent 80-90%
efforts in the data integration process [8], [69], [74]. The
schema change, which impacts applications and causes system
downtimes, is always a main factor leading to the tremen-
dous human resource overhead required for data integration.
Schema changes are often caused by software evolution that
is pervasive and persistent in agile development [36], or the
diversity in data formats due to the lack of standards [13].
Example: coronavirus disease 2019 (COVID-19) data inte-
gration. To predict the coronavirus outbreak, we integrate the
coronavirus data repository at Johns Hopkins University (JHU)
[4] and the Google mobility data [3]. The JHU’s data reposi-
tory maintains the world coronavirus cases on a daily basis as
a set of CSV files. However, we find the schema of the data has
frequent changes. As illustrated in Fig. 1(a), the changes in-
clude attribute name changes (e.g., Longitude Long ), addi-
tion and removal of attributes (e.g., from six attributes initially
to 15 attributes), attribute type changes (e.g., date formats),
key changes (e.g., from (country/region, province/state) to
(combined key), (FIPS) and (country region, province state,
Admin2). The data scientist’s Python code for parsing these
files easily breaks at each schema change, and requires manual
efforts to debug and fix the issues. This becomes a ubiquitous
pain for the users of this JHU data repository, and people even
launched a project to periodically clean the JHU coronavirus
data into a stable R-friendly format
1
. However such cleaning
1
https://github.com/Lucas-Czarnecki/COVID-19-CLEANED-JHUCSSE
is purely based on manual efforts and obviously not scalable.
Prior arts. Schema evolution for data that was managed
in relational databases, NoSQL databases, and multi-model
databases are well-established research topics. The fundamen-
tal idea is to capture the semantic mappings between the
old and the new schemas, so that the legacy queries can be
transformed and/or legacy data can be migrated to work with
the new schemas. There are two general approaches to capture
the semantics mappings: (1) To search for the queries that can
transform the old schema to the new schema [9], [25], [30],
[51], [61], [66]. (2) To ask the database administrators (DBAs)
or application developers to use a domain-specific language
(DSL) to describe the schema transformation process [11],
[17], [31]–[33], [41], [53], [55], [56], [64].
However, these approaches are not applicable to unmanaged
data, including open data such as publicly available CSV,
JSON, HTML, or text files that can be downloaded from a
URL, and transient data that is directly collected from sensor
devices or machines in real-time and discarded after being
integrated. That’s because the history of schema changes for
these data is totally lost or become opaque to the users. It is
an urgent need to automatically handle schema changes for
unmanaged data without interruptions to applications and any
human interventions. Otherwise, with the rapid increase of the
volume and diversity of unmanaged data in the era of Big Data
and Internet of Things (IoT), it is unavoidable to waste a huge
amount of time and human resources in manually handling the
system downtimes incurred by schema changes.
A deep learning approach. In this work, we argue for
a new data integration pipeline that uses deep learning to
avoid interruptions caused by the schema changes. In the
past few years, deep learning (DL) has become the most
popular direction in machine learning and artificial intel-
ligence [46], [65], and has transformed a lot of research
areas, such as image recognition, computer vision, speech
recognition, natural language processing, etc.. In recent years,
DL has been applied to database systems and applications to
facilitate parameter tuning [47], [71], [76], [81], indexing [21],
[43], partitioning [34], [86], cardinality estimation and query
optimization [39], [44], and entity matching [24], [37], [42],
[57], [73], [82]. While predictions based on deep learning
cannot guarantee correctness, in the Big Data era, errors in
data integration are usually tolerable as long as most of the
data is correct, which is another motivation of our work. To the
best of our knowledge, we are the first to apply deep learning
arXiv:2010.07586v1 [cs.DB] 15 Oct 2020
to learn the process of join/union/aggregation-like operations
with schema changes occurring in the data sources. However,
it’s not an easy task and the specific research questions include:
(1) It is not straightforward to formulate a data integration task,
which is usually represented as a combination of relational
or dataflow operators such as join, union, filter, map,
flatmap, aggregate, into a prediction task. What are
effective representations for the features and labels?
(2) How to design the training process to make the model
robust to schema changes?
(3) Different model architectures, for example, simple and
compact sequence models such as Bi-LSTM and complex
and large transformer such as GPT-2 and BERT, may strike
different trade-offs among accuracy, latency, and resource
consumption. What are the implications for model architecture
selection in different deploying environments?
(4) Annotating data to prepare for training data is always a ma-
jor bottleneck in the end-to-end lifecycle of model deployment
for production. Then how to automate training data preparation
for the aforementioned prediction tasks?
Uninterruptible integration of fast-evolving data. In this
work, we first formulate a data integration problem as a deep
learning model that predicts the position in the target dataset
for each group of related data items in the source datasets. We
propose to group related items in the same tuple or object that
will always be processed together, and abstract each group into
a super cell concept. We further propose to use source keys and
attributes as features for describing the context of each cell,
and use the target keys and attributes as labels to describe the
target position where the super cell is mapped to. The features
and labels can be represented as sentences or sentences with
masks so that the representation can be applicable to state-of-
art language models, including sequence models like Bi-LSTM
and transformers like GPT-2 and BERT.
Then, to seamlessly handle various schema changes, in-
spired by adversarial attacks [28], [45], which is a hot topic
in DL, we see most types of schema changes as obfuscations
injected to testing samples at inference time, which may
confuse the model that is trained without noises. Therefore,
just like adversarial training [28], [45], [68], we address the
problem by adding specially designed noises to the training
samples to make the model robust to schema changes. The
techniques we employ to do so include replacing words by
randomly changed words and synonyms that are sampled from
Google Knowledge Graph. In addition, we propose to add an
aggregation mode label to indicate how to handle super cells
that are mapped to the same position, which can well handle
the schema change of the type key expansion in the example
we give earlier.
Based on the above discussions, we propose a fully au-
tomated end-to-end process for uninterruptible integration of
fast-evolving data sources, as illustrated in Fig. 1(b). Step 1,
the system will leverage our proposed Lachesis intermediate
representation (IR) [86] to automatically translate the user’s
initial data integration code into the executable code that
automatically creates training data based on our proposed
representation. Step 2, obfuscations will be automatically
injected to the training data to make the model robust to
various schema changes. Step 3, different model architectures
will be chosen to train the predictive model that will perform
the data integration task depending on the model deploying
environment. Due to the space limitation, this paper will focus
on Step 2 and 3, while we will also discuss Step 1 as well
as other techniques for reducing training data preparation
overhead such as crowdsourcing and model reuse.
The contributions of this work include:
(1) As to our best knowledge, we are the first to systematically
investigate the application of deep learning and adversarial
training techniques to automatically handle schema changes
occurring in the data sources.
(2) We propose an effective formulation of the data integration
problem into a prediction task as well as flexible feature
representation based on our super cell concept (Sec. II). We
also discuss how to alleviate the human costs involved in
preparing training data for the representation (Sec. V).
(3) We represent the common schema changes as various types
of obfuscations, which can be automatically injected to the
training data to make the model training process robust to
these types of schema changes. (Sec. III)
(4) We compare and evaluate various trade-offs made by
different model architectures, including both simple sequence
model and complex transformer model for two different data
integration tasks involving semi-structured and non-structured
data respectively. (Sec. IV and Sec. VI)
II. PROBLEM ANALYSIS AND FORMULATION
A. Assumptions
We assume that in a typical data integration scenario that
converts a set of source datasets into one target dataset, each
of the source datasets may have heterogeneous formats such
as CSV, JSON, text, etc., which may not be managed by any
relational or NoSQL data stores; However, the target dataset
must be tabular, so that each cell in the target dataset can be
uniquely identified by its tuple identifier and attribute name.
B. Representation Analysis and Comparison
Given a set of objects, where each object may represent a
row in a CSV file, a JSON object, a time series record, or a file,
there are a few candidate representations for formulating the
predictive task, including dataset-level representation, object-
level representation, attribute-level representation, cell-level
representation, and our proposed super cell based represen-
tation. The coarser-grained of the representations, the fewer
times of inferences are required, and the more efficient of
the prediction. However, a coarser-grained representation also
indicates that the prediction task is more complex and harder
to train a model with acceptable accuracy, because a training
sample will be larger and more complex than other finer-
grained representations, and the mapping relationship to learn
will naturally become more complicated.
Various levels of representations for the motivating example
are illustrated in Fig. 2. We find it almost impossible to
Confirmed(country, state,
20200201, 20200202, …)
Recovery(country, state,
20200201, 20200202, …)
pd.read_csv(confirmed,
dtype={‘country’:str, ‘state’:str,
dates[0]:int, dates[1]:int, …}); …
new_jhu_ser=reformat(jhu_ser)
new_google_mob=reformat(google_mob)
new_jhu_ser.join(new_google_mob, …)
20200201(region, subregion,
confirmed, recovery, …)
20200202(region, subregion,
confirmed, recovery, …)
02012020 (region, subregion,
county, confirmed, …)
02022020 (region, subregion,
county, confirmed, ….)
pd.read_csv(dates[0], dtype={‘region’:str,
‘subregion’:str, ‘confirmed’:int,
‘recovery’:int,…}); …
new_jhu_ser=reformat1(jhu_ser)
new_google_mob=reformat1(google_mob)
new_jhu_ser.join(new_google_mob, …)
pd.read_csv(dates_newformat[0],
dtype={‘region’:str, ‘subregion’:str,
‘county’:str, ‘confirmed’:int,
‘recovery’:int,…}); …
new_jhu_ser=reformat2(jhu_ser)
new_google_mob=reformat2(google_mob)
new_jhu_ser.join(new_google_mob, …)
JHU data at time
0
, location
0
JHU data at time
0
, location
k
JHU data at time
k
, location
k
Users initial code: rev
0
Users modified code: rev
1
Users modified code: rev
2
Tremendous
human effort
spent in
handling
schema
evolution.
Target
Data
(a) A human-centered approach
Data at time
0
,
location
0
User’s initial code: rev
0
IR and Code Generator
Training Data
Perturbation Injection
Training
Augmented
Training Data
Data at time
0
,
location
k
Data at time
k
,
location
k
Target
Data
Data at time
0
,
location
0
Fully
Automated
by our
proposed
work
(b) A deep learning approach (our proposal)
Fig. 1. Motivation and Overview of Our Proposed Approach.
collect sufficient training data represented at the dataset-level.
The object-level representation groups all attributes and is not
expressive enough in describing the different transformations
applied to each attribute. Similarly, the attribute-level repre-
sentation assembles all values in the same attribute and is
not efficient in expressing logics like filtering or aggregation.
The cell-level representation, referring to a value of a single
attribute in one tuple, is at the finest granularity, which sim-
plifies the learning process. However, it may incur too many
inferences and waste computational resources, particularly if
there exist multiple cells in one object that will always be
mapped/transformed together.
Therefore we propose and argue for a super cell represen-
tation. A super cell is a group of cells in an object that will
always be mapped to the target table together in a similar
way, such as the values of the confirmed attribute and the
recovery attribute, as shown in Fig. 2. While the flexible
granularity of a super cell is between the object-level and
cell-level, it can well balance the expressiveness and the
performance regarding the training and testing process.
Use scenario: COVID-19 outbreak prediction
Global_Mobility_Report.csv
Country_
Region
Sub_Re
gion_1
Sub_Re
gion_2
...
Retail
Grocery
US
South
Carolina
Abbeville
county
-
35
19
US
South
Carolina
Aiken
county
-
19
15
03-20-2020.csv
Date
Admin2
Province_
State
Country
_Region
Confirmed
Recovery
3/20/20
20
Abbeville
SC
USA
63
0
3/20/20
20
Aiken
SC
USA
13
0
Target Table
Time
Subregion
Region
Confirmed
_cases
Recovery
_cases
Grocery
_mobility
Retail_
mobility
03
-20-
2020
South
Carolina
US
76
0
17
-
27
Integration
3
Dataset-level
Attribute-level
Object-level
Cell-level
SuperCell-level
Fig. 2. Various representations at different granularities
For each super cell, we represent its features as a sentence
that concatenates the keys, as well as the value and attribute of
each cell in the super cell. Particularly the keys in the context
features should include both of the join key if applicable
and the tuple/object identifier of the local table. We find
that for 1-1 join, the join key usually also serves as a local
tuple/object identifier; for 1-N and N -M joins, the join key
is not necessarily the local key; and for non-join operations
(e.g., union, filter, aggregate), no join key is required. We
assume the target dataset is always tabular and propose a label
representation that includes the tuple identifier (i.e., key of
the target table) of the super cell, the target attribute name
of each cell, and an aggregation mode to specify how values
that are mapped to the same position should be aggregated
into one value, e.g., add, avg, max, min, count, replace old,
discard new, etc..
Super cell formulation. Suppose there are m source datasets,
represented as D = {d
i
}(0 i < m), each source dataset is
modeled as a set of n
i
super cells, denoted as d
i
= {s
ij
}(0
i < m, 0 j < n
i
). We further describe each super cell
s
ij
d
i
as a triplet that consists of three vectors for the
keys shared by each cell in the super cell, attribute names of
each cell, and values of each cell, respectively, represented
as s
ij
= (
~
key
ij
,
~
attribute
ij
,
~
value
ij
). We can further define
a super set to describe the current state of the entire data
repository: S = {s
ij
d
i
|∀d
i
D}.
A super cell may be mapped to zero, one, or more than one
positions in the target dataset, depending on the operations
involved in the data integration task. For example, as illustrated
in Fig. 3, in a 1-N join operation, a super cell in the table
at the left-hand side may be mapped to many positions in
the target table. Thus we can represent the target positions
where the super cell is mapped to as a list of triples: f(s
ij
) =
{(
~
key
T
ij
,
~
attribute
T
ij
, agg mode)}, where each triple refers
to one position that is indexed by the target keys
~
key
T
shared
by all cells in the super cell, as well as attribute name of each
cell in the super cell, denoted as
~
attribute
T
.
OrderKey CustKey TotalPrice OrderDate
0001 0001 10.0 08012020
0002 0002 1000.0 08012020
0003 0004 50.0 08012020
0004 0002 2000.0 08012020
CustKey Name Address Zipcode
0001 A E 85248
0002 B F 85249
0003 C G 85248
0004 D H 85249
OrderKey CustKey TotalPrice OrderDate Name Address Zipcode
0001 0001 10.0 08012020 A E 85248
0002 0002 1000.0 08012020 B F 85249
0003 0004 50.0 08012020 D H 85249
0004 0002 2000.0 08012020 B F 85249
Fig. 3. Example of 1-N join
C. Problem Definition
Given a fast evolving data repository S = {s
ij
d
i
|∀d
i
D}, for a data integration request, the user input should specify
the schema of the expected target table. The schema includes
a list of p attributes denoted as A = {a
k
}(0 k < p), where
a
k
represents the k-th attribute in the target table, as well
as a list q attributes that serve as minimum tuple identifier
(i.e. key) denoted as A
key
= {a
l
}(0 l < q). We further
denote a set of all possible key values in the target table as
R = {r
l
|r
l
a
l
}(0 l < q). Then we need find a model
f
S(A∪{NU LL})×(R∪{NULL})
that predicts a set of target
positions denoted as {(
~
key
T
ij
,
~
attribute
T
ij
, agg mode)} for
each super cell s
ij
S, where each element of the key
x [0, q),
~
key
T
ij
[x] R, and each element of the attribute
vector y [0, |s
ij
|),
~
attribute
T
ij
[y] A, where |s
ij
|
denotes the number of cells in the super cell s
ij
. If a super
cell doesn’t belong to the target table and should be discarded,
we define that r
l
= N ULL and a
k
= N ULL in this case.
III. TRAINING DATA AUGMENTATIONS FOR SCHEMA
CHANGES
We identify five basic types of data schema changes, which
cover all of the relational or NoSQL schema changing patterns
[11], [17], [31]–[33], [41], [53], [55], [56], [64] as well as
schema changes that we have discovered from open data. We
first discuss the impact of each type and then propose our
approaches to handle these changes by adding perturbations
to the training process.
(1) Domain pivoting. For example, originally the dataset was
stored as three CSV files, describing the daily confirmed coro-
navirus cases, daily recovery cases, and daily death cases; later
the schema changed to a new set of CSV files so that each file
described all information (confirmed, death, recovery cases)
on that specific date. We observed such changes prevalent in
the coronavirus data [3], [4], [6] and the weather database
hosted by National Oceanic and Atmospheric Administration
(NOAA) [7]. Such changes will easily break a conventional
Python-based data integration pipeline.
(2) Key expansion. For example, the key of the dataset
is lowered down from the state level (country/region,
province/state) to the county level (combined key), (FIPS)
and (country region, province state, Admin2), which means a
tuple in the original table (that describes statistics for a state)
is broken down into multiple tuples with each describes the
statistics for a county. Such changes cannot be easily handled
by conventional data integration methodologies.
(3) Attribute name and ordering change. For example, in the
first CSV file added to the JHU COVID-19 daily report data
repository on Jan 22nd, 2020, the third column name is “Last
Update”. But in the CSV file added to the same repository
on Sept 24th, 2020, the same column is moved to the fifth
position, and the name is slightly changed to “Last Update”.
Such changes may interrupt a conventional program that joins
two datasets on the “Last Update” column.
(4) Value type/format change. For example, in the daily
COVID-19 file created on Jan 22nd, the ”Last Update” has
values in the format of “1/22/2020 17:00”. However, in the
file on Sept 24th, the value format has changed to “2020-
09-25 04:23:21”. A conventional exact join operation using
”Last Update” as the join key cannot handle such value
format change, unless the programmer chooses to convert it
into a similarity join, which is more complicated and much
slower than an exact join and will result in significantly higher
development costs [79], [84].
(5) Addition or removal of non-key attributes. For example,
the JHU COVID-19 global daily report has changed from six
attributes initially to 14 attributes after a few months. This
change may not make much influence, if the affected attributes
are not used by users’ data integration workloads. On the
contrary, if a required column is totally removed, there is no
way to handle such a situation without interruption, even if
using a deep learning approach, so we mainly focus on the
first four types of schema changes.
Perturbation based on schema changes. First, our super
cell based representation will not be affected by dimension
pivoting, attribute ordering change, and addition or removal of
irrelevant attributes, because the context for any cell remains
the same despite of these schema changes. Therefore a model
trained with our proposed representations is robust to these
types of schema changes.
Second, schema changes such as renaming of an attribute
and reformatting of cell values, are similar to adversarial
attacks, which confuse the pre-trained models by adding
noises to the expected testing samples. Adding perturbations
to training data is an effective way of training robust models
against adversarial attacks [28]. Inspired by this analogy,
we add specially designed perturbations to training data to
handle these parts of schema changes. If we see each super
cell representation as a sequence of words (i.e., a sentence),
the training data is a corpus of sentences. Then we can
augment the training data by adding new sentences (i.e.,
perturbations), which are changed from existing sentences by
randomly replacing a word using synonyms extracted from
Google Knowledge Graph [3], [4], [6], or randomly modified
words by removing one or more letters. Then we train a
character-based embedding on this augmented corpus using
fastText [5], which maps related words to vectors that are close
to each other so that the model can recognize the similarity of
such words. We observe through experiments that character-
based embedding can achieve better accuracy and reliability
than word-based embedding and can smoothly handle out-
of-vocabulary words. It also shows that our locally trained
embedding significantly outperforms pre-trained embeddings
with Google News or Wikipedia.
Third, to make the deep learning model robust to key expan-
sion, as mentioned, we add a new label to the representation
called “aggregation mode”. Each value of the label represents
an aggregation operator such as sum, avg, min, max, which
will be applied to the cells that are mapped to the target
position; replace, which means the cell will replace the
old cell that exists in the same position of the target table; or
discard, which means the new cell will be discarded if an
older cell has been mapped to the same target position.
IV. MODEL TRAINING AND INFERENCE
A. Model Architectures
In recent several years, Natural Language Processing (NLP)
has experienced several major advancements including the
bi-directional mechanism, attention mechanism, transformer
mechanism, and so on. Existing works show that the final
hidden state in Bi-LSTM networks cannot capture all im-
portant information in a long sentence. Therefore, the atten-
tion mechanism was introduced to address the problem by
preserving information from all hidden states from encoder
cells and aligning them with the current target output. Later
such idea was integrated into the transformer architectures,
so that encoders had self-attention layers, while decoders had
encoder-decoder attention layers. Most recently, to make the
transformer architecture more flexible to applications other
than language translation, GPT-2 that only uses the decoders’
part and BERT that only uses encoders’ part are invented and
achieve great success in a broad class of NLP problems. Our
assumption is that on one hand, more complicated models like
GPT-2 and BERT may naturally achieve better accuracy than
a simpler model like Bi-LSTM; but on the other hand, these
complex models may require significantly higher storage and
computational resources, as well as more training data. It is
important to know the trade-offs among accuracy, latency, and
resource consumption, made by different model architectures.
We mainly consider two types of language model architec-
tures: (1) simple and compact sequence models based on
customized local character-based embedding and Bi-LSTM;
and (2) complex and large pre-trained transformer models,
such as GPT-2 [62] and BERT [20].
1) Sequence Model (Bi-LSTM): Our Bi-LSTM model ar-
chitecture, includes an embedding layer that has 150 neurons;
a Bi-LSTM layer that consists of 512 neurons; and a fully-
connected layer that has 256 neurons.
2) Transformer Model: Moreover, we also consider trans-
former models based on GPT-2 [62] and BERT [20]. We use
a pre-trained GPT-2 small model or a pre-trained BERT base
model as the backend, which connects to a frontend classifier
composed of four convolutional layers and a fully connected
layer. During the training process, the parameters of the GPT-2
small model and the BERT base model are freezed, and only
the parameters of the frontend will be updated.
The pre-trained GPT-2 small has 117 millions of parameters,
including 12 layers of transformers, each with 12 independent
attention mechanisms, called “heads”, and an embedding size
of 768 dimensions. The hidden vector output from the GPT-
2 small model is reshaped to add a channel dimension and
then passed to four convolutional layers, including two max-
pooled 2D convolution layer and two average-pooled 2D
convolution layer respectively, the output is applied with a
hadamard product, and then sent to a fully connected layer.
The BERT base model has 110 millions of parameters, with
12 transformer blocks, and each has 768 hidden neurons and
12 self-attention heads. It uses the same architecture of the
frontend classifier with the GPT-2 small model. Although
GPT-2 and BERT are both based on the transformer model,
they use different units of the transformer. GPT-2 is built using
transformer decoder blocks constructed by the masked self-
attention layers, while the BERT utilizes transformer encoder
blocks with self-attention layers.
Although transformer models usually achieve better accu-
racy than sequence models through its attention mechanism,
they also require significantly more storage space. For exam-
ple, GPT-2 small, which is the smallest variant of GPT-2 model
requires more than 500 megabytes of storage space; The BERT
base model
2
that we use takes 450 megabytes of storage space.
In contrast, the Bi-LSTM model is smaller than 1 megabyte.
B. Assembling of Inference Results
For each super cell, the model will predict a set of target
positions in the form of {(
~
key
T
ij
,
~
attribute
T
ij
, agg
mode)},
as we mentioned in Sec. II. Then based on each super cell
and its predicted positions, a general data assembler will put
each value to the right places in the target table. Based on the
configuration, the assembler can work in either local mode
by buffering and writing one file to store the target dataset
in local or dispatch the assembled tuples to users’ registered
deep learning workers (i.e., target data is consumed by a deep
learning application) once an in-memory buffer is full. In
the latter case, in each deep learning worker’s side, a client
is responsible for receiving and assembling tuples into the
target dataset. During the dispatching process, the output table
will be partitioned in a way to guarantee load balance and
ensure the independent identical distribution (i.e., IID) to avoid
introducing bias.
V. AUTOMATION OF TRAINING DATA PREPARATION
A. Code Generation Based on Intermediate Representation
An important objective of this work is to free human experts
from all dirty works of wrangling with schema changes.
Therefore it’s critical to reduce the human efforts required
in training data preparation, such as parsing and annotating
data. We propose to automate the training data creation by
utilizing conventional Python code developed for integrating
2
https://tfhub.dev/google/bert uncased L-12 H-768 A-12/1
an initial set of data sources. The users’ Python codes specify
how to transform the data sources (usually with heterogeneous
formats) to a target table (usually in tabular format), which is
exactly the information needed for creating the training data.
This gives us an opportunity to translate users’ data integration
code to training data preparation code.
For relational data, the integration logic can be fully ex-
pressed in SQL, which maps to relational algebra. Then it
is easy to generate code for training data creation process
based on the relational algebra. First, all key and join key
information are well maintained and can be directly retrieved.
Second, it is easy to identify which attributes of a table
will always be processed similarly by analyzing the relational
algebra expression, so that the values of these attributes in the
same tuple can be grouped into a super cell. For example, by
analyzing a query coded up for a data integration task such as
SELECT COVID-19.Date, COVID-19.State,
COVID-19.Country, COVID-19.Confirmed,
COVID-19.Recovered, Mobility.Workplace,
Mobility.Recreation, Mobility.Grocery
FROM COVID-19, Mobility
WHERE COVID-19.Date = Mobility.Time
AND COVID-19.Country = Mobility.Region
AND COVID-19.State = Mobility.SubRegion;
, we will know that COVID-19.Confirmed and COVID-
19.Recovered should be grouped into one super cell; and Mo-
bility.Workplace, Mobility.Recreation, and Mobility.Grocery
should be grouped into another super cell. Third, it is pos-
sible to rewrite each relational operator, so that the input
and output of the relational operators are all based on the
super cell representation. For example, the input of the Join
operator in the above example has the form of ((2020-10-
06, AZ, US), (COVID-19.Confirmed, COVID-19.Recovered),
(3103, 2214)). Then the join outputs are a set of pairs of
super cells that match the join predicate, such as h((2020-10-
06, AZ, US), (COVID-19.Confirmed, COVID-19.Recovered),
(3103, 2214)), ((2020-10-06, AZ, US), (Mobility.Workplace,
Mobility.Recreation, Mobility.Grocery), (21, 5, 17))i. This
output can be easily transformed into a base set of training
data, into which the perturbations will be injected.
However, because the integration code of open data, is
usually written in an object-oriented language such as Python,
Java, C++, the code after compilation is opaque to the system,
and it is hard to modify the code directly. One solution is
to map the integration code to an intermediate representation
(IR), such as Weld IR [60] that is integrated with libraries like
numpy and SparkSQL; and our proposed Lachesis IR [86].
Such IR is usually a directed acyclic graph (DAG), and can
be reasoned by the system. In this DAG, each node is an
atomic computation, and each edge represents a data flow or
a control flow from the source node to the destination node.
The atomic computations useful to data integration workloads
usually can be composed by three categories of operators:
(1) Lambda abstraction functions such as a function that
returns a literal (a constant numerical value or string), a
member attribute or a member function from an object; unary
functions such as exp, log, sqrt, sin, cos, tan, etc..
(2) Higher-order lambda composition functions such as binary
operators: &&, ||, &, |, <,>, ==, +, -,
*
, /; conditional
operator like condition? on_true:on_false; etc..
(3) Set-based operators such as scan and write that read-
s/writes a set of objects from/to the storage; map, join,
aggregate, flatten, filter, etc..
We propose to modify existing intermediate representations,
so that a super cell based processor can be derived from each
atomic computation. We assume that each source dataset can
be represented as Pandas dataframes. Then by traversing the IR
graph, the system can understand the keys and the super cell
mapping relationship. The super cell based processor of each
of atomic computations transforms each super cell represen-
tation accordingly. For example, map operator that transforms
a date cell “2020-10-06” to “Oct 6, 2020” as an example, the
processor takes a super cell {“keys”: [“2020-10-06”, “AZ”,
“US”], “attributes”: [“Date”], “cells”:[“2020-10-06”]} as
input, and outputs {“keys”: [“2020-10-06”, “AZ”, “US”],
“attributes”: [“Date”], “cells”:[“Oct 6, 2020”]} so that
the contextual relationship between “Oct 6, 2020” and its
source key and attribute name is preserved. The writes
processor transforms each super cell into a hf eature, labeli
representation, such as {“source super cell”:{“keys”: [“2020-
10-06”, “AZ”, “US”], “attributes”: [“Date”], “cells”:[“Oct
6, 2020”]}, “target position”: {“keys”: [“Oct 6, 2020”,
“Arizona”, “United States”], “attributes”:[“datetime”]}}. In
this way, we can obtain training data automatically.
However, the limitation of above approach is that it may not
work if the input object is totally nested and opaque and cannot
be represented as a set of cells like Pandas dataframes or Spark
dataframes. For example, a corpus of totally unstructured
text files, unavoidably requires human pre-processing efforts.
Thereby, we design following approaches to further alleviate
the problem: model reusing and crowdsourcing.
B. Crowdsourcing
According to Sec. V-A, if we are able to convert an
unstructured dataset, e.g., a set of opaque and nested objects or
a set of unstructured texts, into a Pandas dataframe or similar
structures, the code generation approach maybe applicable to
automate the training data creation process. However, it is non-
trivial to identify the parsing logic, perform such conversion
and identify the keys. All these tasks are hard to automate. We
consider crowdsourcing as a potential approach to alleviate the
burden from the data scientists or domain experts for these
tasks. However, based on our experiments of crowdsourcing
160 key identification tasks to 8 graduate students, and 164
undergraduate students from an introductory database course,
requesting to identify all keys. We find that the accuracy is
merely 65.7%. First, some of the datasets, particularly these
scientific datasets, require domain-specific knowledge to tell
the tuple identifier, because these attribute names are acronyms
or terms that are not understandable to most people who
are not in the domain, and usually datasets are not shipped
with detailed explanations for each attribute. Second, for large
datasets, it is impossible for a person who are not familiar
with the datasets to tell the keys. Third, it is not easy to find
a lot of people who has database knowledge. Other tasks such
as identifying super cells and parsing unstructured datasets are
even more challenging for crowdsourcing platforms due to the
expert knowledge required in nature.
C. Model Reuse
Another approach to reduce human efforts involved in
preparing training data is to reuse models for similar data
integration tasks. For this purpose, we design and develop a
system, called as ModelHub, which searches for reusable mod-
els for a new data integration task by comparing the attributes
of the target dataset (that is created by the programmer’s initial
data integration code) with the target dataset of each existing
data integration models. We leverage locality sensitive hashing
(LSH) based on MinWise hash [19], [85] for text-based data
and LSH based on JS-divergence [15], [48] for numerical data
to accelerate the attribute-matching process. Another benefit
of utilizing the LSH is that, in the ModelHub platform, each
model only needs to be uploaded with LSH signatures of the
target dataset’s attributes, while the target dataset does not
need to be submitted, which saves the storage overhead and
also addresses privacy concerns.
VI. EVALUATION
We mainly answer following questions in this section:
(1) How effective is our proposed deep learning representation
for different data integration tasks?
(2) How effective are the perturbations added to the training
data for handling various types of schema changes?
(3) How will different super cell granularities affect the
accuracy, and the overheads for the training, testing, and
assembling process?
(4) How will different model architectures (complex and large
models vs. simple and compact models) affect the accuracy
and latency for different types of data integration tasks?
(5) How will our approach of handling schema changes
improve productivity and alleviate programmers’ efforts?
A. Environment Setup
Based on the proposed training data representation and
training data perturbation methodology, we have created train-
ing data to train Bi-LSTM model, GPT-2 small model, and
BERT base model for two scenarios: coronavirus data in-
tegration and heterogeneous machine data integration. The
first scenario mainly involves tabular source datasets in CSV
formats with aforementioned schema changes. However, the
source datasets for the second scenario are mainly unstructured
text data, in which most of the similar terms in different
platforms are expressed very differently (e.g., CPU user time
is logged as “CPU usage: 14.90% user” in MacOS, “%Cpu(s):
14.9 us” in Ubuntu, and 400%cpu 86%user” in Android).
Model Architectures. We compare three neural networks:
Bi-LSTM, GPT-2 small with a CNN frontend classifier, and
BERT base with the same CNN frontend classifier. The model
architectures are described in Sec. IV.
Model Training For the training process of both scenarios, Bi-
LSTM is relatively slower in converging, requiring around 50
epochs; while the models leveraging pre-trained GPT-2 small
and BERT base are much faster to converge, requiring only
around 5 epochs, as illustrated in Fig. 4.
(a) Bi-LSTM model (b) Transformer models
Fig. 4. Loss vs. epochs in the training process
Metrics. We evaluate and compare the accuracy, the storage
overhead, and the end-to-end training and inference latency,
with all types of schema changes as mentioned in Sec. III
applied at the inference stage. The accuracy of the data
integration model is defined as the ratio of the number of super
cells that has been predicted with correct target positions and
aggregation actions to the total number of super cells in the
testing data.
Hardware Platform. For all experiments, if without specifi-
cation, we use one NVIDIA Tesla V100 GPU from Google
Colab. All running times (e.g., training time, inference time)
are measured as the average of multiple repeated runs.
B. Coronavirus Data Integration Scenario
1) Experiment Setup: We evaluate our system in a COVID-
19 data integration scenario that is close to the example in
Sec. I. We predict COVID-19 trend using daily and regional
information regarding the number of vaqarious cases and
mobility factors. Given a set of raw data sources, we need
to create a 2-dimensional target dataset on daily basis. In the
target dataset, each row represents coronavirus and mobility
information for a state/province on the specific date, and each
column represents the state, country, number of confirmed
cases, recovery cases, death cases, and the mobility factors
regarding workplace, grocery, transit, etc.. The target dataset
can be used as inputs to various curve-fitting techniques [1],
[2] for COVID-19 prediction.
Datasets. We assume the user specifies/recommends a small
set of initial data sources. For the first scenario, the user
specifies the John Hopkins University’s COVID-19 github
repository [4] and Google mobility data [3]. The statistics
about the above source tables are illustrated in Tab. I. The
JHU dataset contains 258 files with each file representing
COVID-19 statistics on a specific date. These files have tens
of versions, growing from 36 rows and 6 attributes to 3956
rows and 14 columns.
Perturbations. We add perturbations such as random changes
to attribute names and values, and replacing attribute names
and value tokens by synonyms as described in Sec. III to
58.3% of the attributes in the training data. In addition, we add
key expansion changes, which accounts for 18.6% of the rows
in the training data. We test the model using JHU-COVID-19
data and Google mobility data collected from Feb 15, 2020 to
Oct 6, 2020, as illustrated in Tab. I.
TABLE I
STATISTICS OF RELEVANT DATA SOURCES.
numFiles numRows numCols
JHU-COVID19 258 36-3956 6-14
Google-Mobility 1 2, 526, 500 14
2) Overall Results: The overall results are illustrated in
Tab. II, which show that employing a complex transformer
like the pre-trained GPT-2 small and BERT base, we can
achieve better accuracy, though more complicated models
require significantly more storage space and computational
time for training one epoch or inference.
The results also show that with the increase of the granular-
ity of super cells , the required training and testing time will
be significantly reduced, while the accuracy will decrease.
TABLE II
COVID-19 DATA INTEGRATION WITH DIFFERENT SUPER CELL
GRANULARITY (NUMBER OF SUPER CELLS PER TUPLE OF TARGET
DATASET)
#supercell model accuracy T
train
/epoch T
predict
9
Bi-LSTM 96.6% 4.8 sec 18.9 sec
GPT-2 small 99.8% 28.1sec 21.7 sec
BERT base 99.8% 29.3 sec 20.6 sec
4
Bi-LSTM 92.4% 3.2 sec 8.6 sec
GPT-2 small 99.6% 9.2 sec 6.8 sec
BERT base 99.6% 9.6 sec 6.7 sec
2
Bi-LSTM Failed
GPT-2 small 99.4% 5.8 sec 3.8 sec
BERT base 99.4% 6.0 sec 3.8 sec
3) Ablation Study: Using the Bi-LSTM model with single-
cell representation, we also conducted detailed ablation study
as illustrated in Tab. III. It shows that handling value for-
mat changes (e.g., date format change like 10-06-2020 and
06102020; and different abbreviations of region and subdis-
tricts like AZ and Arizona.) is a main factor for accuracy
degradation. Using a customized synonymous dictionary to
encode these format changes for adding perturbations to the
training data can greatly improve the accuracy compared with
using synonyms extracted from Google Knowledge Graph, as
illustrated in Tab. IV. In addition, we also find that using
character-based embedding can significantly outperform word-
based embedding, as illustrated in Tab. V.
4) Human Productivity Comparison: We developed the
data integration code using Python and Pandas dataframe to
integrate the JHU COVID-19 data collected on Feb 15, 2020
and the time-series Google mobility data. After Feb 15, 2020,
the first schema evolution of the JHU COVID-19 data schema
that breaks the integration code and causes system downtime,
happened on Mar 22, 2020. we invite an experienced software
engineer, a Ph.D student, and an undergraduate student to
TABLE III
ABLATION STUDY FOR DIFFERENT SCHEMA CHANGE TYPES, CHANGES
ARE INCREMENTALLY ADDED.
Testing Cases Testing Accuracy
relevant data with no schema changes 99.9%
irrelevant data 99.9%
changes of two attributes 99.9%
changes of five attributes 99.9%
changes of six attributes and value format changes 97.5%
key expansion 96.4%
TABLE IV
ACCURACY COMPARISON FOR USING SYNONYMS RETRIEVED FROM
GOOGLE KNOWLEDGE GRAPH (GKG) AND USING SELF-CODED
SYNONYMS.
Testing Cases GKG Self-coded
changes of two attributes 94.5% 99.9%
changes of five attributes 97.5% 99.9%
changes of six attributes and region format changes 82.2% 97.5%
develop the revisions respectively and ask them to deliver
the task as soon as possible. We record the time between
the task assignment and code submission, as well as the
time they dedicated to fixing the issue as they reported. We
find that although the reported dedicated time ranges from
15 to 25 minutes; the time between the task assignment
and code submission ranges from one to three days. This
example illustrates the unpredictability of human resources. In
contrast, our proposed data integration pipeline can smoothly
handle schema changes without any interruptions or delays,
and requires no human intervention at all.
Performance of Python-based Integration Code. We run
our Python-based and human-coded data integration pipeline
on the aforementioned daily JHU COVID-19 data and Google
mobility data in a C4.xlarge AWS instance that has four CPUs
and eight gigabytes memory, and it takes 417 seconds of time
on average to integrate data for one day, without considering
the time required to fix the pipeline for schema changes.
97% of the time is spent on removing redundant county-level
statistics from the relatively large Google mobility file that has
2.5 millions of tuples. Otherwise the co-existing state-level and
county-level statistics in the Google mobility file will confuse
the join processing. This observation indicates that with the
acceleration of high-end GPU processor, the overall training
and inference latency of using a deep learning based pipeline
is lower than using the traditional human-centered pipeline.
Considering that the training process only needs to be carried
out at the beginning and when a concept drift [75] is detected.
C. Machine Log Integration
1) Environment Setup: Suppose a lab administrator devel-
oped a Python tool to integrate various performance metrics
of a cluster of MacOS workstations, such as CPU utiliza-
tion (user, system, idle, wait), memory utilization (cached,
buffered, swap), network utilization (input, output), disk uti-
lization (write, read), and so on. The tool collects these metrics
by periodically reading the output of an omnipresent shell
TABLE V
ACCURACY COMPARISON FOR USING DIFFERENT EMBEDDING
APPROACHES.
Testing Cases accuracy
Word2Vec pretrained on Wikipidia 71.2%
Word2Vec pretrained on Google News 67.5%
fastText pretrained on self-customized corpus 99.9%
tool “top”
3
and then perform a union operation for time-series
metrics collected from each machine. Now the lab purchased
four Ubuntu Linux servers. However, because the “top” tool’s
output in Ubuntu is very different from MacOS, the Python
tool cannot work with these new Linux machines without ad-
ditional coding efforts. Such problem is prevalent in machine
or sensor data integration, where different devices produced by
different manufacturers may use different schemas to describe
similar information.
2) Overall Results: The results are illustrated in Tab. VI and
Tab. VII, showing that our approach can achieve acceptable
accuracy. Particularly, the transformer models can achieve
significantly better accuracy than the Bi-LSTM model. For
this case, with the increase in super cell granularity (i.e.,
decrease in number of super cells per target tuple), the
accuracy of the Bi-LSTM network is improved, while the
accuracy of the transformer-based models is slightly degraded.
The transformer-based models can achieve significantly better
accuracy, while the computational time required for training
(per epoch) and inference is significantly higher. Also the
larger of the super cell granularity, the fewer number of
training and testing samples. Therefore, the time required
for training and testing is also significantly reduced with the
increase in the super cell granularity.
TABLE VI
MACHINE LOG DATA INTEGRATION WITH DIFFERENT SUPER CELL
GRANULARITY (NUMBER OF SUPER CELLS PER TUPLE OF TARGET
DATASET): UBUNTU AND MACOS DATA USED FOR TRAINING, AND
ANDROID DATA USED FOR TESTING
#supercell model accuracy T
train
/epoch T
predict
49
Bi-LSTM 82.1% 8.7 sec 8.1 sec
GPT-2 small 99.6% 52.4sec 21.3 sec
BERT base 99.7% 39.1 sec 15.3 sec
25
Bi-LSTM 85.3% 5.4 sec 3.6 sec
GPT-2 small 99.4% 29.3 sec 9.8 sec
BERT base 99.4% 22.7 sec 7.6 sec
3
Bi-LSTM 91.2% 2.8 sec 1.9 sec
GPT-2 small 99.0% 12.3 sec 4.0 sec
BERT base 99.1% 13.9 sec 3.7 sec
D. Output Assembling
In this section, we discuss the process of assembling pre-
diction results into tabular files. We mainly measure how
the sizes of source datasets, target datasets, and granularity
of super cells affect the overall latency of the assembling
process. The results are illustrated in Fig. 5, which show
that increasing super cell granularity will significantly reduce
the assembling latency. It indicates that if storage space is
3
https://linux.die.net/man/1/top
TABLE VII
MACHINE LOG DATA INTEGRATION WITH DIFFERENT SUPER CELL
GRANULARITY (NUMBER OF SUPER CELLS PER TUPLE OF TARGET
DATASET): ONLY UBUNTU DATA USED FOR TRAINING, MACOS AND
ANDROID DATA USED FOR TESTING
#supercell model accuracy
49
Bi-LSTM 63.7%
GPT-2 small 99.5%
BERT base 99.6%
25
Bi-LSTM 71.2%
GPT-2 small 99.3%
BERT base 99.3%
3
Bi-LSTM 75.6%
GPT-2 small 98.7%
BERT base 99.1%
not the bottleneck, using a transformer-based model and the
largest possible super cell granularity will achieve acceptable
accuracy while significantly reducing the computational time
required for training, inferences, and assembling.
(a) 100K-latency (b) 100K-breakdown
0
50
100
150
200
250
300
350
1 2 4 8 16 32
Seconds
Granularity: Number of attributes in one super cell
Assembling time with different granularity of
super cells for 500,000 rows
32 Attributes
16 Attributes
8 Attributes
4 Attributes
(c) 500K-latency
0
100
200
300
400
1 2 4 8 16 32
Seconds
Granularity: Number of attributes in one super cell
Assembling time breakdown with different
granularity of super cell for 500,000 rows and
32 attributes
Build Time
Transform Time
Write Time
(d) 500K-breakdown
(e) 1 million-latency (f) 1 million
Fig. 5. Influence of different super cell granularity to assembling latency
and breakdown analysis (build time: parse and insert cells involved in each
prediction to a dictionary. transform time: transform the dictionary to a Pandas
dataframe. write time: write the Pandas dataframe to a CSV file.). We use
different number of attributes and rows for the target table in each experiment.
VII. RELATED WORKS
A. Handling Schema Evolutions
Schema evolution in relational database, XML, JSON and
ontology has been an active research area for a long time [23],
[63]. One major approach is through model (schema) man-
agement [10], [12] and to automatically generate executable
mapping between the old and evolved schema [51], [77], [80].
While this approach greatly expands the theoretical foundation
of relational schema evolution, it requires application mainte-
nance and may cause undesirable system downtimes [18]. To
address the problem, Prism [18] is proposed to automate the
end-to-end schema modification process by providing DBAs
a schema modification language (SMO) and automatically
rewriting users’ legacy queries. However, Prism requires data
migration to the latest schema for each schema evolution,
which may not be practical for today’s Big Data era. Other
techniques include versioning [40], [54], [67], which avoids
the data migration overhead, but incurs version management
burden and significantly slows down query performance. There
are also abundant works discussing about the schema evolu-
tion problem in NoSQL databases, Polystore or multi-model
databases [33], [35], [53], [70]
Most of these works are mainly targeting at enterprise data
integration problems and require that each source dataset is
managed by a relational or non-relational data store. However
the open data sources widely used by today’s data science
applications are often unmanaged, and thus lack schemas
or metadata information [50]. A deep learning model, once
trained, can handle most schema evolution without any human
intervention, and does not require any data migration, or
version management overhead. Moreover, today’s data science
applications are more tolerant to data errors compared to
traditional enterprise transaction applications, which makes
a deep learning approach promising.
B. Data Discovery
Data discovery is to find related tables in a data lake. Au-
rum [26] is an automatic data discovery system that proposes
to build enterprise knowledge graph (EKG) to solve real-world
business data integration problems. In EKG, a node represents
a set of attributes/columns, and an edge connects two similar
nodes. In addition, a hyperedge connects any number of
nodes that are hierarchically related. They propose a two-step
approach to build EKG using LSH-based and TFIDF-based
signatures. They also provide a data discovery query language
SRQL so that users can efficiently query the relationships
among datasets. Aurum [26] is mainly targeting at enterprise
data integration. In recent, numerous works are proposed to
address open data discovery problems, including automatically
discover table unionability [59] and joinability [83], [85],
based on LSH and similarity measures. Nargesian and et
al. [58] propose a Markov approach to optimize the navigation
organization as a DAG for a data lake so that the probability
of finding a table by any of attributes can be maximized. In the
DAG, each node of navigation DAG represents a subset of the
attributes in the data lake, and an edge represents a navigation
transition. All of these works provide helpful insights from an
algorithmatic perspective and system perspective for general
data discovery problems. Particularly, Fernandez and et al. [27]
proposes a semantic matcher based on word embeddings to
discover semantic links in the EKG.
Our work has a potential to integrate data discovery and
schema matching into a deep learning model inference pro-
cess. We argue that in our targeting scenario, the approach we
propose can save significant storage overhead as we only need
store data integration models which are significantly smaller
than the EKG, and can also achieve better performance for
wide and sparse tables. We will prove in the paper that the
training data generation and labeling process can be fully
automated.
C. Schema/Entity Matching
Traditionally, to solve the data integration problem for data
science applications, once related datasets are discovered, the
programmer will either manually design queries to integrate
these datasets, or leverage a schema matching tool to automat-
ically discover queries to perform the data integration.
There are numerous prior-arts in schema matching [29],
[38], [51], [72], which mainly match schemas based on meta-
data (e.g., attribute name) and/or instances. Entity matching
(EM) [16], which is to identify data instances that refer to the
same real-world entity, is also related. Some EM works also
employ a deep learning-based approach [24], [37], [42], [49],
[57], [73], [82]. Mudgal and et al. [57] evaluates and compares
the performance of different deep learning models applied to
EM with three types of data: structured data, textual data, and
dirty data (with missing value, inconsistent attributes and/or
miss-placed values). They find that deep learning doesn’t
outperform existing EM solutions on structured data, but it
outperforms them on textual and dirty data.
In addition, to apply schema matching to heterogeneous
data sources, it is important to discover schemas from semi-
structured or non-structured data. We proposed a schema
discovery mechanism for JSON data [78], among other related
works [22], [52].
Our approach proposes a super cell data model to unify
open datasets. We train deep learning models to learn the
mappings between the data items in source datasets and their
positions as well as aggregation modes in the target table.
If we see the context of a super cell in the source as an
entity, and the target position of the super cell as another
entity, the problem we study in this work shares some similarity
with the entity matching problem. The distinction is that the
equivalence of two ”entities” in our problem is determined by
users’ data integration logic, while general entity matching
problem does not have such constraints.
D. Other Related Works
Thirumuruganathan and et al. [74] discuss various represen-
tations for learning tasks in relational data curation. Cappuzzo
and et al. [14] further propose an algorithm for obtaining local
embeddings using a tripartite-graph-based representation for
data integration tasks such as schema matching, and entity
matching on relational database. We are mainly targeting at
open data in CSV, JSON and text format and choose to use a
super cell based representation. These works can be leveraged
to improve the super cell representation and corresponding
embeddings proposed in this work.
VIII. CONCLUSION
In this work, we propose an end-to-end approach based
on deep learning for periodical extraction of user expected
tables from fast evolving data sources of open datasets. We
further propose a relatively stable super cell based represen-
tation to embody the fast-evolving source data and to train
models that are robust to schema changes by automatically
injecting schema changes (e.g., dimension pivoting, attribute
name changes, attribute addition/removal, key expansion/con-
traction, etc.) to the training data. We formalize the problem
and conduct experiments on integration of open COVID-19
data and machine log data. The results show that our proposed
approach can achieve acceptable accuracy. In addition, by
applying our proposed approach, the system will not be easily
interrupted by schema changes and no human intervention is
required for handling most of the schema changes.
REFERENCES
[1] Caltech covid-19 modeling. https://github.com/quantummind/ cal-
tech covid 19 modeling.
[2] Cdc covid-19 death forecasting models.
https://www.cdc.gov/coronavirus/2019-ncov/covid-data/forecasting-
us.html.
[3] Community mobility reports. https://www.google.com/covid19/mobility/.
[4] Covid-19 data repository by the center for systems
science and engineering (csse) at johns hopkins university.
https://github.com/CSSEGISandData/COVID-19.
[5] Fasttext. https://fasttext.cc/.
[6] Harvard covid-19 data: county age&sex with ann.
https://dataverse.harvard.edu.
[7] National oceanic and atmospheric administration. www.noaa.gov.
[8] D. Abadi, A. Ailamaki, D. Andersen, P. Bailis, M. Balazinska, P. Bern-
stein, P. Boncz, S. Chaudhuri, A. Cheung, A. Doan, et al. The seattle
report on database research. ACM SIGMOD Record, 48(4):44–53, 2020.
[9] Y. An, A. Borgida, R. J. Miller, and J. Mylopoulos. A semantic
approach to discovering schema mapping expressions. In 2007 IEEE
23rd International Conference on Data Engineering, pages 206–215.
IEEE, 2007.
[10] P. A. Bernstein. Applying model management to classical meta data
problems. In CIDR, volume 2003, pages 209–220. Citeseer, 2003.
[11] P. A. Bernstein and S. Melnik. Model management 2.0: manipulating
richer mappings. In Proceedings of the 2007 ACM SIGMOD interna-
tional conference on Management of data, pages 1–12, 2007.
[12] P. A. Bernstein and E. Rahm. Data warehouse scenarios for model
management. In International Conference on Conceptual Modeling,
pages 1–15. Springer, 2000.
[13] H. A. Campbell, F. Urbano, S. Davidson, H. Dettki, and F. Cagnacci. A
plea for standards in reporting data collected by animal-borne electronic
devices. Animal Biotelemetry, 4(1):1, 2016.
[14] R. Cappuzzo, P. Papotti, and S. Thirumuruganathan. Creating embed-
dings of heterogeneous relational datasets for data integration tasks. In
Proceedings of the 2020 ACM SIGMOD International Conference on
Management of Data, pages 1335–1349, 2020.
[15] L. Chen, H. Esfandiari, G. Fu, and V. Mirrokni. Locality-sensitive
hashing for f-divergences: Mutual information loss and beyond. In
Advances in Neural Information Processing Systems, pages 10044–
10054, 2019.
[16] P. Christen. Data matching: concepts and techniques for record linkage,
entity resolution, and duplicate detection. Springer Science & Business
Media, 2012.
[17] C. Curino, H. J. Moon, A. Deutsch, and C. Zaniolo. Automating the
database schema evolution process. The VLDB Journal, 22(1):73–98,
2013.
[18] C. A. Curino, H. J. Moon, and C. Zaniolo. Graceful database schema
evolution: the prism workbench. Proceedings of the VLDB Endowment,
1(1):761–772, 2008.
[19] M. Datar, N. Immorlica, P. Indyk, and V. S. Mirrokni. Locality-sensitive
hashing scheme based on p-stable distributions. In Proceedings of the
twentieth annual symposium on Computational geometry, pages 253–
262, 2004.
[20] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. Bert: Pre-training
of deep bidirectional transformers for language understanding. arXiv
preprint arXiv:1810.04805, 2018.
[21] J. Ding, U. F. Minhas, J. Yu, C. Wang, J. Do, Y. Li, H. Zhang,
B. Chandramouli, J. Gehrke, D. Kossmann, et al. Alex: an updatable
adaptive learned index. In Proceedings of the 2020 ACM SIGMOD
International Conference on Management of Data, pages 969–984, 2020.
[22] M. DiScala and D. J. Abadi. Automatic generation of normalized
relational schemas from nested key-value data. In Proceedings of the
2016 International Conference on Management of Data, pages 295–310,
2016.
[23] A. Doan and A. Y. Halevy. Semantic integration research in the database
community: A brief survey. AI magazine, 26(1):83–83, 2005.
[24] M. Ebraheem, S. Thirumuruganathan, S. Joty, M. Ouzzani, and N. Tang.
Deeper–deep entity resolution. arXiv preprint arXiv:1710.00597, 2017.
[25] R. Fagin, L. M. Haas, M. Hern
´
andez, R. J. Miller, L. Popa, and
Y. Velegrakis. Clio: Schema mapping creation and data exchange. In
Conceptual modeling: foundations and applications, pages 198–236.
Springer, 2009.
[26] R. C. Fernandez, Z. Abedjan, F. Koko, G. Yuan, S. Madden, and
M. Stonebraker. Aurum: A data discovery system. In 2018 IEEE 34th
International Conference on Data Engineering (ICDE), pages 1001–
1012. IEEE, 2018.
[27] R. C. Fernandez, E. Mansour, A. A. Qahtan, A. Elmagarmid, I. Ilyas,
S. Madden, M. Ouzzani, M. Stonebraker, and N. Tang. Seeping
semantics: Linking datasets using word embeddings for data discovery.
In 2018 IEEE 34th International Conference on Data Engineering
(ICDE), pages 989–1000. IEEE, 2018.
[28] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Lavi-
olette, M. Marchand, and V. Lempitsky. Domain-adversarial training
of neural networks. The Journal of Machine Learning Research,
17(1):2096–2030, 2016.
[29] G. Gottlob and P. Senellart. Schema mapping discovery from data
instances. Journal of the ACM (JACM), 57(2):1–37, 2010.
[30] M. A. Hern
´
andez, R. J. Miller, and L. M. Haas. Clio: A semi-automatic
tool for schema mapping. ACM Sigmod Record, 30(2):607, 2001.
[31] K. Herrmann, H. Voigt, A. Behrend, and W. Lehner. Codel–a relationally
complete language for database evolution. In East European Conference
on Advances in Databases and Information Systems, pages 63–76.
Springer, 2015.
[32] K. Herrmann, H. Voigt, A. Behrend, J. Rausch, and W. Lehner. Living
in parallel realities: Co-existing schema versions with a bidirectional
database evolution language. In Proceedings of the 2017 ACM Interna-
tional Conference on Management of Data, pages 1101–1116, 2017.
[33] A. Hillenbrand, M. Levchenko, U. St
¨
orl, S. Scherzinger, and M. Klettke.
Migcast: Putting a price tag on data model evolution in nosql data stores.
In Proceedings of the 2019 International Conference on Management
of Data, pages 1925–1928, 2019.
[34] B. Hilprecht, C. Binnig, and U. R
¨
ohm. Learning a partitioning advisor
for cloud databases. In D. Maier, R. Pottinger, A. Doan, W. Tan,
A. Alawini, and H. Q. Ngo, editors, Proceedings of the 2020 Interna-
tional Conference on Management of Data, SIGMOD Conference 2020,
online conference [Portland, OR, USA], June 14-19, 2020, pages 143–
157. ACM, 2020.
[35] I. Holubov
´
a, M. Klettke, and U. St
¨
orl. Evolution management of
multi-model data. In Heterogeneous Data Management, Polystores, and
Analytics for Healthcare, pages 139–153. Springer, 2019.
[36] P. Howard. Data migration, 2011. Bloor Research, London, UK, 2011.
[37] J. Kasai, K. Qian, S. Gurajada, Y. Li, and L. Popa. Low-resource
deep entity resolution with transfer and active learning. arXiv preprint
arXiv:1906.08042, 2019.
[38] A. Kimmig, A. Memory, R. J. Miller, and L. Getoor. A collective,
probabilistic approach to schema mapping using diverse noisy evidence.
IEEE Transactions on Knowledge and Data Engineering, 31(8):1426–
1439, 2018.
[39] A. Kipf, T. Kipf, B. Radke, V. Leis, P. Boncz, and A. Kemper. Learned
cardinalities: Estimating correlated joins with deep learning. arXiv
preprint arXiv:1809.00677, 2018.
[40] M. C. Klein and D. Fensel. Ontology versioning on the semantic web.
In SWWS, pages 75–91, 2001.
[41] M. Klettke. Evolution management of multi-model data (position paper).
[42] P. Konda, S. Das, P. Suganthan GC, A. Doan, A. Ardalan, J. R. Ballard,
H. Li, F. Panahi, H. Zhang, J. Naughton, et al. Magellan: Toward
building entity matching management systems. Proceedings of the VLDB
Endowment, 9(12):1197–1208, 2016.
[43] T. Kraska, A. Beutel, E. H. Chi, J. Dean, and N. Polyzotis. The case
for learned index structures. In Proceedings of the 2018 International
Conference on Management of Data, pages 489–504, 2018.
[44] S. Krishnan, Z. Yang, K. Goldberg, J. Hellerstein, and I. Stoica. Learning
to optimize join queries with deep reinforcement learning. arXiv preprint
arXiv:1808.03196, 2018.
[45] A. Kurakin, I. Goodfellow, and S. Bengio. Adversarial machine learning
at scale. arXiv preprint arXiv:1611.01236, 2016.
[46] Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. nature,
521(7553):436–444, 2015.
[47] G. Li, X. Zhou, S. Li, and B. Gao. Qtune: A query-aware database
tuning system with deep reinforcement learning. Proceedings of the
VLDB Endowment, 12(12):2118–2130, 2019.
[48] X.-L. Mao, B.-S. Feng, Y.-J. Hao, L. Nie, H. Huang, and G. Wen. S2jsd-
lsh: A locality-sensitive hashing schema for probability distributions. In
Proceedings of the Thirty-First AAAI Conference on Artificial Intelli-
gence, pages 3244–3251, 2017.
[49] V. V. Meduri, L. Popa, P. Sen, and M. Sarwat. A comprehensive
benchmark framework for active learning methods in entity matching.
In Proceedings of the 2020 ACM SIGMOD International Conference on
Management of Data, pages 1133–1147, 2020.
[50] R. J. Miller. Open data integration. Proceedings of the VLDB
Endowment, 11(12):2130–2139, 2018.
[51] R. J. Miller, L. M. Haas, and M. A. Hern
´
andez. Schema mapping as
query discovery. In VLDB, volume 2000, pages 77–88, 2000.
[52] M. J. Mior, K. Salem, A. Aboulnaga, and R. Liu. Nose: Schema design
for nosql applications. IEEE Transactions on Knowledge and Data
Engineering, 29(10):2275–2289, 2017.
[53] M. L. M
¨
oller, M. Klettke, A. Hillenbrand, and U. St
¨
orl. Query rewriting
for continuously evolving nosql databases. In International Conference
on Conceptual Modeling, pages 213–221. Springer, 2019.
[54] H. J. Moon, C. A. Curino, A. Deutsch, C.-Y. Hou, and C. Zaniolo. Man-
aging and querying transaction-time databases under schema evolution.
Proceedings of the VLDB Endowment, 1(1):882–895, 2008.
[55] H. J. Moon, C. A. Curino, M. Ham, and C. Zaniolo. Prima: archiving
and querying historical data with evolving schemas. In Proceedings of
the 2009 ACM SIGMOD International Conference on Management of
data, pages 1019–1022, 2009.
[56] H. J. Moon, C. A. Curino, and C. Zaniolo. Scalable architecture and
query optimization fortransaction-time dbs with evolving schemas. In
Proceedings of the 2010 ACM SIGMOD International Conference on
Management of data, pages 207–218, 2010.
[57] S. Mudgal, H. Li, T. Rekatsinas, A. Doan, Y. Park, G. Krishnan, R. Deep,
E. Arcaute, and V. Raghavendra. Deep learning for entity matching:
A design space exploration. In Proceedings of the 2018 International
Conference on Management of Data, pages 19–34, 2018.
[58] F. Nargesian, K. Q. Pu, E. Zhu, B. Ghadiri Bashardoost, and R. J.
Miller. Organizing data lakes for navigation. In Proceedings of the
2020 ACM SIGMOD International Conference on Management of Data,
pages 1939–1950, 2020.
[59] F. Nargesian, E. Zhu, K. Q. Pu, and R. J. Miller. Table union search on
open data. Proceedings of the VLDB Endowment, 11(7):813–825, 2018.
[60] S. Palkar, J. J. Thomas, A. Shanbhag, D. Narayanan, H. Pirk,
M. Schwarzkopf, S. Amarasinghe, M. Zaharia, and S. InfoLab. Weld:
A common runtime for high performance data analytics. In Conference
on Innovative Data Systems Research (CIDR), page 45, 2017.
[61] L. Popa, M. A. Hernandez, Y. Velegrakis, R. J. Miller, F. Naumann, and
H. Ho. Mapping xml and relational schemas with clio. In Proceedings
18th International Conference on Data Engineering, pages 498–499.
IEEE, 2002.
[62] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever.
Language models are unsupervised multitask learners. OpenAI Blog,
1(8):9, 2019.
[63] E. Rahm and P. A. Bernstein. An online bibliography on schema
evolution. ACM Sigmod Record, 35(4):30–31, 2006.
[64] S. Scherzinger, M. Klettke, and U. St
¨
orl. Managing schema evolution
in nosql data stores. arXiv preprint arXiv:1308.0514, 2013.
[65] J. Schmidhuber. Deep learning in neural networks: An overview. Neural
networks, 61:85–117, 2015.
[66] Y. Shen, K. Chakrabarti, S. Chaudhuri, B. Ding, and L. Novik. Dis-
covering queries based on example tuples. In Proceedings of the 2014
ACM SIGMOD international conference on Management of data, pages
493–504, 2014.
[67] Y. Sheng. Non-blocking Lazy Schema Changes in Multi-Version
Database Management Systems. PhD thesis, Carnegie Mellon University
Pittsburgh, PA, 2019.
[68] A. Shrivastava, T. Pfister, O. Tuzel, J. Susskind, W. Wang, and R. Webb.
Learning from simulated and unsupervised images through adversarial
training. In Proceedings of the IEEE conference on computer vision and
pattern recognition, pages 2107–2116, 2017.
[69] M. Stonebraker and I. F. Ilyas. Data integration: The current status and
the way forward. IEEE Data Eng. Bull., 41(2):3–9, 2018.
[70] U. St
¨
orl, M. Klettke, and S. Scherzinger. Nosql schema evolution and
data migration: State-of-the-art and opportunities. In EDBT, pages 655–
658, 2020.
[71] D. G. Sullivan, M. I. Seltzer, and A. Pfeffer. Using probabilistic
reasoning to automate software tuning. ACM SIGMETRICS Performance
Evaluation Review, 32(1):404–405, 2004.
[72] B. Ten Cate, P. G. Kolaitis, and W.-C. Tan. Schema mappings and
data examples. In Proceedings of the 16th International Conference on
Extending Database Technology, pages 777–780, 2013.
[73] S. Thirumuruganathan, S. A. P. Parambath, M. Ouzzani, N. Tang, and
S. Joty. Reuse and adaptation for entity resolution through transfer
learning. arXiv preprint arXiv:1809.11084, 2018.
[74] S. Thirumuruganathan, N. Tang, M. Ouzzani, and A. Doan. Data
curation with deep learning [vision]. arXiv preprint arXiv:1803.01384,
2018.
[75] A. Tsymbal. The problem of concept drift: definitions and related work.
Computer Science Department, Trinity College Dublin, 106(2):58, 2004.
[76] D. Van Aken, A. Pavlo, G. J. Gordon, and B. Zhang. Automatic database
management system tuning through large-scale machine learning. In
Proceedings of the 2017 ACM International Conference on Management
of Data, pages 1009–1024, 2017.
[77] Y. Velegrakis, R. J. Miller, and L. Popa. Preserving mapping consistency
under schema changes. The VLDB Journal, 13(3):274–293, 2004.
[78] L. Wang, S. Zhang, J. Shi, L. Jiao, O. Hassanzadeh, J. Zou, and
C. Wangz. Schema management for document stores. Proceedings of
the VLDB Endowment, 8(9):922–933, 2015.
[79] C. Xiao, W. Wang, and X. Lin. Ed-join: an efficient algorithm for
similarity joins with edit distance constraints. Proceedings of the VLDB
Endowment, 1(1):933–944, 2008.
[80] C. Yu and L. Popa. Semantic adaptation of schema mappings when
schemas evolve. In Proceedings of the 31st international conference on
Very large data bases, pages 1006–1017. VLDB Endowment, 2005.
[81] J. Zhang, Y. Liu, K. Zhou, G. Li, Z. Xiao, B. Cheng, J. Xing, Y. Wang,
T. Cheng, L. Liu, et al. An end-to-end automatic cloud database tuning
system using deep reinforcement learning. In Proceedings of the 2019
International Conference on Management of Data, pages 415–432, 2019.
[82] C. Zhao and Y. He. Auto-em: End-to-end fuzzy entity-matching using
pre-trained deep models and transfer learning. In The World Wide Web
Conference, pages 2413–2424, 2019.
[83] E. Zhu, D. Deng, F. Nargesian, and R. J. Miller. Josie: Overlap set
similarity search for finding joinable tables in data lakes. In Proceedings
of the 2019 International Conference on Management of Data, pages
847–864, 2019.
[84] E. Zhu, Y. He, and S. Chaudhuri. Auto-join: Joining tables by leveraging
transformations. Proceedings of the VLDB Endowment, 10(10):1034–
1045, 2017.
[85] E. Zhu, F. Nargesian, K. Q. Pu, and R. J. Miller. Lsh ensemble: Internet-
scale domain search. arXiv preprint arXiv:1603.07410, 2016.
[86] J. Zou, P. Barhate, A. Das, A. Iyengar, B. Yuan, D. Jankov, and
C. Jermaine. Lachesis: Automated generation of persistent partitionings
for big data applications. arXiv preprint arXiv:2006.16529, 2020.