Survive the Schema Changes: Integration of Unmanaged Data

Survive the Schema Changes: Integration of

Unmanaged Data Using Deep Learning

Zijie Wang

∗

, Lixi Zhou

∗

, Amitabh Das

∗

, Valay Dave

∗

, Zhanpeng Jin

†

, and Jia Zou

∗

Arizona State University,

†

University at Buffalo, State University of New York

Email:

∗

{zijiewang, lixi.zhou, adas59, vddave, jia.zou}@asu.edu,

†

zjin@buffalo.edu

Abstract—Data is the king in the age of AI. However data

integration is often a laborious task that is hard to automate.

Schema change is one signiﬁcant obstacle to the automation of

the end-to-end data integration process. Although there exist

mechanisms such as query discovery and schema modiﬁcation

language to handle the problem, these approaches can only work

with the assumption that the schema is maintained by a database.

However, we observe diversiﬁed schema changes in heterogeneous

data and open data, most of which has no schema deﬁned. In

this work, we propose to use deep learning to automatically deal

with schema changes through a super cell representation and

automatic injection of perturbations to the training data to make

the model robust to schema changes. Our experimental results

demonstrate that our proposed approach is effective for two real-

world data integration scenarios: coronavirus data integration,

and machine log integration.

Keywords—data integration, deep learning, schema changes,

schema evolution, perturbation, intermediate representation

I. INTRODUCTION

It was reported in 2018 that data scientists spent 80-90%

efforts in the data integration process [8], [69], [74]. The

schema change, which impacts applications and causes system

downtimes, is always a main factor leading to the tremen-

dous human resource overhead required for data integration.

Schema changes are often caused by software evolution that

is pervasive and persistent in agile development [36], or the

diversity in data formats due to the lack of standards [13].

Example: coronavirus disease 2019 (COVID-19) data inte-

gration. To predict the coronavirus outbreak, we integrate the

coronavirus data repository at Johns Hopkins University (JHU)

[4] and the Google mobility data [3]. The JHU’s data reposi-

tory maintains the world coronavirus cases on a daily basis as

a set of CSV ﬁles. However, we ﬁnd the schema of the data has

frequent changes. As illustrated in Fig. 1(a), the changes in-

clude attribute name changes (e.g., Longitude → Long ), addi-

tion and removal of attributes (e.g., from six attributes initially

to 15 attributes), attribute type changes (e.g., date formats),

key changes (e.g., from (country/region, province/state) to

(combined key), (FIPS) and (country region, province state,

Admin2). The data scientist’s Python code for parsing these

ﬁles easily breaks at each schema change, and requires manual

efforts to debug and ﬁx the issues. This becomes a ubiquitous

pain for the users of this JHU data repository, and people even

launched a project to periodically clean the JHU coronavirus

data into a stable R-friendly format

. However such cleaning

https://github.com/Lucas-Czarnecki/COVID-19-CLEANED-JHUCSSE

is purely based on manual efforts and obviously not scalable.

Prior arts. Schema evolution for data that was managed

in relational databases, NoSQL databases, and multi-model

databases are well-established research topics. The fundamen-

tal idea is to capture the semantic mappings between the

old and the new schemas, so that the legacy queries can be

transformed and/or legacy data can be migrated to work with

the new schemas. There are two general approaches to capture

the semantics mappings: (1) To search for the queries that can

transform the old schema to the new schema [9], [25], [30],

[51], [61], [66]. (2) To ask the database administrators (DBAs)

or application developers to use a domain-speciﬁc language

(DSL) to describe the schema transformation process [11],

[17], [31]–[33], [41], [53], [55], [56], [64].

However, these approaches are not applicable to unmanaged

data, including open data such as publicly available CSV,

JSON, HTML, or text ﬁles that can be downloaded from a

URL, and transient data that is directly collected from sensor

devices or machines in real-time and discarded after being

integrated. That’s because the history of schema changes for

these data is totally lost or become opaque to the users. It is

an urgent need to automatically handle schema changes for

unmanaged data without interruptions to applications and any

human interventions. Otherwise, with the rapid increase of the

volume and diversity of unmanaged data in the era of Big Data

and Internet of Things (IoT), it is unavoidable to waste a huge

amount of time and human resources in manually handling the

system downtimes incurred by schema changes.

A deep learning approach. In this work, we argue for

a new data integration pipeline that uses deep learning to

avoid interruptions caused by the schema changes. In the

past few years, deep learning (DL) has become the most

popular direction in machine learning and artiﬁcial intel-

ligence [46], [65], and has transformed a lot of research

areas, such as image recognition, computer vision, speech

recognition, natural language processing, etc.. In recent years,

DL has been applied to database systems and applications to

facilitate parameter tuning [47], [71], [76], [81], indexing [21],

[43], partitioning [34], [86], cardinality estimation and query

optimization [39], [44], and entity matching [24], [37], [42],

[57], [73], [82]. While predictions based on deep learning

cannot guarantee correctness, in the Big Data era, errors in

data integration are usually tolerable as long as most of the

data is correct, which is another motivation of our work. To the

best of our knowledge, we are the ﬁrst to apply deep learning

arXiv:2010.07586v1 [cs.DB] 15 Oct 2020

to learn the process of join/union/aggregation-like operations

with schema changes occurring in the data sources. However,

it’s not an easy task and the speciﬁc research questions include:

(1) It is not straightforward to formulate a data integration task,

which is usually represented as a combination of relational

or dataﬂow operators such as join, union, filter, map,

flatmap, aggregate, into a prediction task. What are

effective representations for the features and labels?

(2) How to design the training process to make the model

robust to schema changes?

(3) Different model architectures, for example, simple and

compact sequence models such as Bi-LSTM and complex

and large transformer such as GPT-2 and BERT, may strike

different trade-offs among accuracy, latency, and resource

consumption. What are the implications for model architecture

selection in different deploying environments?

(4) Annotating data to prepare for training data is always a ma-

jor bottleneck in the end-to-end lifecycle of model deployment

for production. Then how to automate training data preparation

for the aforementioned prediction tasks?

Uninterruptible integration of fast-evolving data. In this

work, we ﬁrst formulate a data integration problem as a deep

learning model that predicts the position in the target dataset

for each group of related data items in the source datasets. We

propose to group related items in the same tuple or object that

will always be processed together, and abstract each group into

a super cell concept. We further propose to use source keys and

attributes as features for describing the context of each cell,

and use the target keys and attributes as labels to describe the

target position where the super cell is mapped to. The features

and labels can be represented as sentences or sentences with

masks so that the representation can be applicable to state-of-

art language models, including sequence models like Bi-LSTM

and transformers like GPT-2 and BERT.

Then, to seamlessly handle various schema changes, in-

spired by adversarial attacks [28], [45], which is a hot topic

in DL, we see most types of schema changes as obfuscations

injected to testing samples at inference time, which may

confuse the model that is trained without noises. Therefore,

just like adversarial training [28], [45], [68], we address the

problem by adding specially designed noises to the training

samples to make the model robust to schema changes. The

techniques we employ to do so include replacing words by

randomly changed words and synonyms that are sampled from

Google Knowledge Graph. In addition, we propose to add an

aggregation mode label to indicate how to handle super cells

that are mapped to the same position, which can well handle

the schema change of the type key expansion in the example

we give earlier.

Based on the above discussions, we propose a fully au-

tomated end-to-end process for uninterruptible integration of

fast-evolving data sources, as illustrated in Fig. 1(b). Step 1,

the system will leverage our proposed Lachesis intermediate

representation (IR) [86] to automatically translate the user’s

initial data integration code into the executable code that

automatically creates training data based on our proposed

representation. Step 2, obfuscations will be automatically

injected to the training data to make the model robust to

various schema changes. Step 3, different model architectures

will be chosen to train the predictive model that will perform

the data integration task depending on the model deploying

environment. Due to the space limitation, this paper will focus

on Step 2 and 3, while we will also discuss Step 1 as well

as other techniques for reducing training data preparation

overhead such as crowdsourcing and model reuse.

The contributions of this work include:

(1) As to our best knowledge, we are the ﬁrst to systematically

investigate the application of deep learning and adversarial

training techniques to automatically handle schema changes

occurring in the data sources.

(2) We propose an effective formulation of the data integration

problem into a prediction task as well as ﬂexible feature

representation based on our super cell concept (Sec. II). We

also discuss how to alleviate the human costs involved in

preparing training data for the representation (Sec. V).

(3) We represent the common schema changes as various types

of obfuscations, which can be automatically injected to the

training data to make the model training process robust to

these types of schema changes. (Sec. III)

(4) We compare and evaluate various trade-offs made by

different model architectures, including both simple sequence

model and complex transformer model for two different data

integration tasks involving semi-structured and non-structured

data respectively. (Sec. IV and Sec. VI)

II. PROBLEM ANALYSIS AND FORMULATION

A. Assumptions

We assume that in a typical data integration scenario that

converts a set of source datasets into one target dataset, each

of the source datasets may have heterogeneous formats such

as CSV, JSON, text, etc., which may not be managed by any

relational or NoSQL data stores; However, the target dataset

must be tabular, so that each cell in the target dataset can be

uniquely identiﬁed by its tuple identiﬁer and attribute name.

B. Representation Analysis and Comparison

Given a set of objects, where each object may represent a

row in a CSV ﬁle, a JSON object, a time series record, or a ﬁle,

there are a few candidate representations for formulating the

predictive task, including dataset-level representation, object-

level representation, attribute-level representation, cell-level

representation, and our proposed super cell based represen-

tation. The coarser-grained of the representations, the fewer

times of inferences are required, and the more efﬁcient of

the prediction. However, a coarser-grained representation also

indicates that the prediction task is more complex and harder

to train a model with acceptable accuracy, because a training

sample will be larger and more complex than other ﬁner-

grained representations, and the mapping relationship to learn

will naturally become more complicated.

Various levels of representations for the motivating example

are illustrated in Fig. 2. We ﬁnd it almost impossible to

Confirmed(country, state,

20200201, 20200202, …)

Recovery(country, state,

20200201, 20200202, …)

pd.read_csv(confirmed,

dtype={‘country’:str, ‘state’:str,

dates[0]:int, dates[1]:int, …}); …

new_jhu_ser=reformat(jhu_ser)

new_google_mob=reformat(google_mob)

new_jhu_ser.join(new_google_mob, …)

20200201(region, subregion,

confirmed, recovery, …)

20200202(region, subregion,

confirmed, recovery, …)

02012020 (region, subregion,

county, confirmed, …)

02022020 (region, subregion,

county, confirmed, ….)

pd.read_csv(dates[0], dtype={‘region’:str,

‘subregion’:str, ‘confirmed’:int,

‘recovery’:int,…}); …

new_jhu_ser=reformat1(jhu_ser)

new_google_mob=reformat1(google_mob)

new_jhu_ser.join(new_google_mob, …)

pd.read_csv(dates_newformat[0],

dtype={‘region’:str, ‘subregion’:str,

‘county’:str, ‘confirmed’:int,

‘recovery’:int,…}); …

new_jhu_ser=reformat2(jhu_ser)

new_google_mob=reformat2(google_mob)

new_jhu_ser.join(new_google_mob, …)

JHU data at time

, location

JHU data at time

, location

JHU data at time

, location

User’s initial code: rev

User’s modified code: rev

…

Tremendous

human effort

spent in

handling

schema

evolution.

Target

Data

(a) A human-centered approach

Data at time

location

User’s initial code: rev

IR and Code Generator

Training Data

Perturbation Injection

Training

Augmented

Training Data

Data at time

location

Data at time

location

Target

Data

…

Data at time

location

…

Fully

Automated

by our

proposed

work

(b) A deep learning approach (our proposal)

Fig. 1. Motivation and Overview of Our Proposed Approach.

collect sufﬁcient training data represented at the dataset-level.

The object-level representation groups all attributes and is not

expressive enough in describing the different transformations

applied to each attribute. Similarly, the attribute-level repre-

sentation assembles all values in the same attribute and is

not efﬁcient in expressing logics like ﬁltering or aggregation.

The cell-level representation, referring to a value of a single

attribute in one tuple, is at the ﬁnest granularity, which sim-

pliﬁes the learning process. However, it may incur too many

inferences and waste computational resources, particularly if

there exist multiple cells in one object that will always be

mapped/transformed together.

Therefore we propose and argue for a super cell represen-

tation. A super cell is a group of cells in an object that will

always be mapped to the target table together in a similar

way, such as the values of the confirmed attribute and the

recovery attribute, as shown in Fig. 2. While the ﬂexible

granularity of a super cell is between the object-level and

cell-level, it can well balance the expressiveness and the

performance regarding the training and testing process.

Use scenario: COVID-19 outbreak prediction

Global_Mobility_Report.csv

Country_

Region

Sub_Re

gion_1

Sub_Re

gion_2

...

Date

Retail

Grocery

South

Carolina

Abbeville

county

…

3/20/

2020

South

Carolina

Aiken

county

…

3/20/

2020

03-20-2020.csv

Date

Admin2

Province_

State

Country

_Region

Confirmed

Recovery

…

3/20/20

Abbeville

USA

…

3/20/20

Aiken

USA

…

Target Table

Time

Subregion

Region

Confirmed

_cases

Recovery

_cases

Grocery

_mobility

Retail_

mobility

-20-

2020

South

Carolina

Integration

Dataset-level

Attribute-level

Object-level

Cell-level

SuperCell-level

Fig. 2. Various representations at different granularities

For each super cell, we represent its features as a sentence

that concatenates the keys, as well as the value and attribute of

each cell in the super cell. Particularly the keys in the context

features should include both of the join key if applicable

and the tuple/object identiﬁer of the local table. We ﬁnd

that for 1-1 join, the join key usually also serves as a local

tuple/object identiﬁer; for 1-N and N -M joins, the join key

is not necessarily the local key; and for non-join operations

(e.g., union, ﬁlter, aggregate), no join key is required. We

assume the target dataset is always tabular and propose a label

representation that includes the tuple identiﬁer (i.e., key of

the target table) of the super cell, the target attribute name

of each cell, and an aggregation mode to specify how values

that are mapped to the same position should be aggregated

into one value, e.g., add, avg, max, min, count, replace old,

discard new, etc..

Super cell formulation. Suppose there are m source datasets,

represented as D = {d

}(0 ≤ i < m), each source dataset is

modeled as a set of n

super cells, denoted as d

= {s

}(0 ≤

i < m, 0 ≤ j < n

). We further describe each super cell

∈ d

as a triplet that consists of three vectors for the

keys shared by each cell in the super cell, attribute names of

each cell, and values of each cell, respectively, represented

as s

= (

key

attribute

value

). We can further deﬁne

a super set to describe the current state of the entire data

repository: S = {s

∈ d

|∀d

∈ D}.

A super cell may be mapped to zero, one, or more than one

positions in the target dataset, depending on the operations

involved in the data integration task. For example, as illustrated

in Fig. 3, in a 1-N join operation, a super cell in the table

at the left-hand side may be mapped to many positions in

the target table. Thus we can represent the target positions

where the super cell is mapped to as a list of triples: f(s

) =

{(

key

attribute

, agg mode)}, where each triple refers

to one position that is indexed by the target keys

key

shared

by all cells in the super cell, as well as attribute name of each

cell in the super cell, denoted as

attribute

OrderKey CustKey TotalPrice OrderDate

0001 0001 10.0 08012020

0002 0002 1000.0 08012020

0003 0004 50.0 08012020

0004 0002 2000.0 08012020

CustKey Name Address Zipcode

0001 A E 85248

0002 B F 85249

0003 C G 85248

0004 D H 85249

⨝

OrderKey CustKey TotalPrice OrderDate Name Address Zipcode

0001 0001 10.0 08012020 A E 85248

0002 0002 1000.0 08012020 B F 85249

0003 0004 50.0 08012020 D H 85249

0004 0002 2000.0 08012020 B F 85249

Fig. 3. Example of 1-N join

C. Problem Deﬁnition

Given a fast evolving data repository S = {s

∈ d

|∀d

∈

D}, for a data integration request, the user input should specify

the schema of the expected target table. The schema includes

a list of p attributes denoted as A = {a

}(0 ≤ k < p), where

represents the k-th attribute in the target table, as well

as a list q attributes that serve as minimum tuple identiﬁer

(i.e. key) denoted as A

key

= {a

}(0 ≤ l < q). We further

denote a set of all possible key values in the target table as

R = {r

∈ a

}(0 ≤ l < q). Then we need ﬁnd a model

S→(A∪{NU LL})×(R∪{NULL})

that predicts a set of target

positions denoted as {(

key

attribute

, agg mode)} for

each super cell s

∈ S, where each element of the key

∀x ∈ [0, q),

key

[x] ∈ R, and each element of the attribute

vector ∀y ∈ [0, |s

|),

attribute

[y] ∈ A, where |s

denotes the number of cells in the super cell s

. If a super

cell doesn’t belong to the target table and should be discarded,

we deﬁne that r

= N ULL and a

= N ULL in this case.

III. TRAINING DATA AUGMENTATIONS FOR SCHEMA

CHANGES

We identify ﬁve basic types of data schema changes, which

cover all of the relational or NoSQL schema changing patterns

[11], [17], [31]–[33], [41], [53], [55], [56], [64] as well as

schema changes that we have discovered from open data. We

ﬁrst discuss the impact of each type and then propose our

approaches to handle these changes by adding perturbations

to the training process.

(1) Domain pivoting. For example, originally the dataset was

stored as three CSV ﬁles, describing the daily conﬁrmed coro-

navirus cases, daily recovery cases, and daily death cases; later

the schema changed to a new set of CSV ﬁles so that each ﬁle

described all information (conﬁrmed, death, recovery cases)

on that speciﬁc date. We observed such changes prevalent in

the coronavirus data [3], [4], [6] and the weather database

hosted by National Oceanic and Atmospheric Administration

(NOAA) [7]. Such changes will easily break a conventional

Python-based data integration pipeline.

(2) Key expansion. For example, the key of the dataset

is lowered down from the state level (country/region,

province/state) to the county level (combined key), (FIPS)

and (country region, province state, Admin2), which means a

tuple in the original table (that describes statistics for a state)

is broken down into multiple tuples with each describes the

statistics for a county. Such changes cannot be easily handled

by conventional data integration methodologies.

(3) Attribute name and ordering change. For example, in the

ﬁrst CSV ﬁle added to the JHU COVID-19 daily report data

repository on Jan 22nd, 2020, the third column name is “Last

Update”. But in the CSV ﬁle added to the same repository

on Sept 24th, 2020, the same column is moved to the ﬁfth

position, and the name is slightly changed to “Last Update”.

Such changes may interrupt a conventional program that joins

two datasets on the “Last Update” column.

(4) Value type/format change. For example, in the daily

COVID-19 ﬁle created on Jan 22nd, the ”Last Update” has

values in the format of “1/22/2020 17:00”. However, in the

ﬁle on Sept 24th, the value format has changed to “2020-

09-25 04:23:21”. A conventional exact join operation using

”Last Update” as the join key cannot handle such value

format change, unless the programmer chooses to convert it

into a similarity join, which is more complicated and much

slower than an exact join and will result in signiﬁcantly higher

development costs [79], [84].

(5) Addition or removal of non-key attributes. For example,

the JHU COVID-19 global daily report has changed from six

attributes initially to 14 attributes after a few months. This

change may not make much inﬂuence, if the affected attributes

are not used by users’ data integration workloads. On the

contrary, if a required column is totally removed, there is no

way to handle such a situation without interruption, even if

using a deep learning approach, so we mainly focus on the

ﬁrst four types of schema changes.

Perturbation based on schema changes. First, our super

cell based representation will not be affected by dimension

pivoting, attribute ordering change, and addition or removal of

irrelevant attributes, because the context for any cell remains

the same despite of these schema changes. Therefore a model

trained with our proposed representations is robust to these

types of schema changes.

Second, schema changes such as renaming of an attribute

and reformatting of cell values, are similar to adversarial

attacks, which confuse the pre-trained models by adding

noises to the expected testing samples. Adding perturbations

to training data is an effective way of training robust models

against adversarial attacks [28]. Inspired by this analogy,

we add specially designed perturbations to training data to

handle these parts of schema changes. If we see each super

cell representation as a sequence of words (i.e., a sentence),

the training data is a corpus of sentences. Then we can

augment the training data by adding new sentences (i.e.,

perturbations), which are changed from existing sentences by

randomly replacing a word using synonyms extracted from

Google Knowledge Graph [3], [4], [6], or randomly modiﬁed

words by removing one or more letters. Then we train a

character-based embedding on this augmented corpus using

fastText [5], which maps related words to vectors that are close

to each other so that the model can recognize the similarity of

such words. We observe through experiments that character-

based embedding can achieve better accuracy and reliability

than word-based embedding and can smoothly handle out-

of-vocabulary words. It also shows that our locally trained

embedding signiﬁcantly outperforms pre-trained embeddings

with Google News or Wikipedia.

Third, to make the deep learning model robust to key expan-

sion, as mentioned, we add a new label to the representation

called “aggregation mode”. Each value of the label represents

an aggregation operator such as sum, avg, min, max, which

will be applied to the cells that are mapped to the target

position; replace, which means the cell will replace the

old cell that exists in the same position of the target table; or

discard, which means the new cell will be discarded if an

older cell has been mapped to the same target position.

IV. MODEL TRAINING AND INFERENCE

A. Model Architectures

In recent several years, Natural Language Processing (NLP)

has experienced several major advancements including the

bi-directional mechanism, attention mechanism, transformer

mechanism, and so on. Existing works show that the ﬁnal

hidden state in Bi-LSTM networks cannot capture all im-

portant information in a long sentence. Therefore, the atten-

tion mechanism was introduced to address the problem by

preserving information from all hidden states from encoder

cells and aligning them with the current target output. Later

such idea was integrated into the transformer architectures,

so that encoders had self-attention layers, while decoders had

encoder-decoder attention layers. Most recently, to make the

transformer architecture more ﬂexible to applications other

than language translation, GPT-2 that only uses the decoders’

part and BERT that only uses encoders’ part are invented and

achieve great success in a broad class of NLP problems. Our

assumption is that on one hand, more complicated models like

GPT-2 and BERT may naturally achieve better accuracy than

a simpler model like Bi-LSTM; but on the other hand, these

complex models may require signiﬁcantly higher storage and

computational resources, as well as more training data. It is

important to know the trade-offs among accuracy, latency, and

resource consumption, made by different model architectures.

We mainly consider two types of language model architec-

tures: (1) simple and compact sequence models based on

customized local character-based embedding and Bi-LSTM;

and (2) complex and large pre-trained transformer models,

such as GPT-2 [62] and BERT [20].

1) Sequence Model (Bi-LSTM): Our Bi-LSTM model ar-

chitecture, includes an embedding layer that has 150 neurons;

a Bi-LSTM layer that consists of 512 neurons; and a fully-

connected layer that has 256 neurons.

2) Transformer Model: Moreover, we also consider trans-

former models based on GPT-2 [62] and BERT [20]. We use

a pre-trained GPT-2 small model or a pre-trained BERT base

model as the backend, which connects to a frontend classiﬁer

composed of four convolutional layers and a fully connected

layer. During the training process, the parameters of the GPT-2

small model and the BERT base model are freezed, and only

the parameters of the frontend will be updated.

The pre-trained GPT-2 small has 117 millions of parameters,

including 12 layers of transformers, each with 12 independent

attention mechanisms, called “heads”, and an embedding size

of 768 dimensions. The hidden vector output from the GPT-

2 small model is reshaped to add a channel dimension and

then passed to four convolutional layers, including two max-

pooled 2D convolution layer and two average-pooled 2D

convolution layer respectively, the output is applied with a

hadamard product, and then sent to a fully connected layer.

The BERT base model has 110 millions of parameters, with

12 transformer blocks, and each has 768 hidden neurons and

12 self-attention heads. It uses the same architecture of the

frontend classiﬁer with the GPT-2 small model. Although

GPT-2 and BERT are both based on the transformer model,

they use different units of the transformer. GPT-2 is built using

transformer decoder blocks constructed by the masked self-

attention layers, while the BERT utilizes transformer encoder

blocks with self-attention layers.

Although transformer models usually achieve better accu-

racy than sequence models through its attention mechanism,

they also require signiﬁcantly more storage space. For exam-

ple, GPT-2 small, which is the smallest variant of GPT-2 model

requires more than 500 megabytes of storage space; The BERT

base model

that we use takes 450 megabytes of storage space.

In contrast, the Bi-LSTM model is smaller than 1 megabyte.

B. Assembling of Inference Results

For each super cell, the model will predict a set of target

positions in the form of {(

key

attribute

, agg

mode)},

as we mentioned in Sec. II. Then based on each super cell

and its predicted positions, a general data assembler will put

each value to the right places in the target table. Based on the

conﬁguration, the assembler can work in either local mode

by buffering and writing one ﬁle to store the target dataset

in local or dispatch the assembled tuples to users’ registered

deep learning workers (i.e., target data is consumed by a deep

learning application) once an in-memory buffer is full. In

the latter case, in each deep learning worker’s side, a client

is responsible for receiving and assembling tuples into the

target dataset. During the dispatching process, the output table

will be partitioned in a way to guarantee load balance and

ensure the independent identical distribution (i.e., IID) to avoid

introducing bias.

V. AUTOMATION OF TRAINING DATA PREPARATION

A. Code Generation Based on Intermediate Representation

An important objective of this work is to free human experts

from all dirty works of wrangling with schema changes.

Therefore it’s critical to reduce the human efforts required

in training data preparation, such as parsing and annotating

data. We propose to automate the training data creation by

utilizing conventional Python code developed for integrating

https://tfhub.dev/google/bert uncased L-12 H-768 A-12/1

an initial set of data sources. The users’ Python codes specify

how to transform the data sources (usually with heterogeneous

formats) to a target table (usually in tabular format), which is

exactly the information needed for creating the training data.

This gives us an opportunity to translate users’ data integration

code to training data preparation code.

For relational data, the integration logic can be fully ex-

pressed in SQL, which maps to relational algebra. Then it

is easy to generate code for training data creation process

based on the relational algebra. First, all key and join key

information are well maintained and can be directly retrieved.

Second, it is easy to identify which attributes of a table

will always be processed similarly by analyzing the relational

algebra expression, so that the values of these attributes in the

same tuple can be grouped into a super cell. For example, by

analyzing a query coded up for a data integration task such as

SELECT COVID-19.Date, COVID-19.State,

COVID-19.Country, COVID-19.Confirmed,

COVID-19.Recovered, Mobility.Workplace,

Mobility.Recreation, Mobility.Grocery

FROM COVID-19, Mobility

WHERE COVID-19.Date = Mobility.Time

AND COVID-19.Country = Mobility.Region

AND COVID-19.State = Mobility.SubRegion;

, we will know that COVID-19.Conﬁrmed and COVID-

19.Recovered should be grouped into one super cell; and Mo-

bility.Workplace, Mobility.Recreation, and Mobility.Grocery

should be grouped into another super cell. Third, it is pos-

sible to rewrite each relational operator, so that the input

and output of the relational operators are all based on the

super cell representation. For example, the input of the Join

operator in the above example has the form of ((2020-10-

06, AZ, US), (COVID-19.Conﬁrmed, COVID-19.Recovered),

(3103, 2214)). Then the join outputs are a set of pairs of

super cells that match the join predicate, such as h((2020-10-

06, AZ, US), (COVID-19.Conﬁrmed, COVID-19.Recovered),

(3103, 2214)), ((2020-10-06, AZ, US), (Mobility.Workplace,

Mobility.Recreation, Mobility.Grocery), (21, 5, 17))i. This

output can be easily transformed into a base set of training

data, into which the perturbations will be injected.

However, because the integration code of open data, is

usually written in an object-oriented language such as Python,

Java, C++, the code after compilation is opaque to the system,

and it is hard to modify the code directly. One solution is

to map the integration code to an intermediate representation

(IR), such as Weld IR [60] that is integrated with libraries like

numpy and SparkSQL; and our proposed Lachesis IR [86].

Such IR is usually a directed acyclic graph (DAG), and can

be reasoned by the system. In this DAG, each node is an

atomic computation, and each edge represents a data ﬂow or

a control ﬂow from the source node to the destination node.

The atomic computations useful to data integration workloads

usually can be composed by three categories of operators:

(1) Lambda abstraction functions such as a function that

returns a literal (a constant numerical value or string), a

member attribute or a member function from an object; unary

functions such as exp, log, sqrt, sin, cos, tan, etc..

(2) Higher-order lambda composition functions such as binary

operators: &&, ||, &, |, <,>, ==, +, -,

, /; conditional

operator like condition? on_true:on_false; etc..

(3) Set-based operators such as scan and write that read-

s/writes a set of objects from/to the storage; map, join,

aggregate, flatten, filter, etc..

We propose to modify existing intermediate representations,

so that a super cell based processor can be derived from each

atomic computation. We assume that each source dataset can

be represented as Pandas dataframes. Then by traversing the IR

graph, the system can understand the keys and the super cell

mapping relationship. The super cell based processor of each

of atomic computations transforms each super cell represen-

tation accordingly. For example, map operator that transforms

a date cell “2020-10-06” to “Oct 6, 2020” as an example, the

processor takes a super cell {“keys”: [“2020-10-06”, “AZ”,

“US”], “attributes”: [“Date”], “cells”:[“2020-10-06”]} as

input, and outputs {“keys”: [“2020-10-06”, “AZ”, “US”],

“attributes”: [“Date”], “cells”:[“Oct 6, 2020”]} so that

the contextual relationship between “Oct 6, 2020” and its

source key and attribute name is preserved. The write’s

processor transforms each super cell into a hf eature, labeli

representation, such as {“source super cell”:{“keys”: [“2020-

10-06”, “AZ”, “US”], “attributes”: [“Date”], “cells”:[“Oct

6, 2020”]}, “target position”: {“keys”: [“Oct 6, 2020”,

“Arizona”, “United States”], “attributes”:[“datetime”]}}. In

this way, we can obtain training data automatically.

However, the limitation of above approach is that it may not

work if the input object is totally nested and opaque and cannot

be represented as a set of cells like Pandas dataframes or Spark

dataframes. For example, a corpus of totally unstructured

text ﬁles, unavoidably requires human pre-processing efforts.

Thereby, we design following approaches to further alleviate

the problem: model reusing and crowdsourcing.

B. Crowdsourcing

According to Sec. V-A, if we are able to convert an

unstructured dataset, e.g., a set of opaque and nested objects or

a set of unstructured texts, into a Pandas dataframe or similar

structures, the code generation approach maybe applicable to

automate the training data creation process. However, it is non-

trivial to identify the parsing logic, perform such conversion

and identify the keys. All these tasks are hard to automate. We

consider crowdsourcing as a potential approach to alleviate the

burden from the data scientists or domain experts for these

tasks. However, based on our experiments of crowdsourcing

160 key identiﬁcation tasks to 8 graduate students, and 164

undergraduate students from an introductory database course,

requesting to identify all keys. We ﬁnd that the accuracy is

merely 65.7%. First, some of the datasets, particularly these

scientiﬁc datasets, require domain-speciﬁc knowledge to tell

the tuple identiﬁer, because these attribute names are acronyms

or terms that are not understandable to most people who

are not in the domain, and usually datasets are not shipped

with detailed explanations for each attribute. Second, for large

datasets, it is impossible for a person who are not familiar

with the datasets to tell the keys. Third, it is not easy to ﬁnd

a lot of people who has database knowledge. Other tasks such

as identifying super cells and parsing unstructured datasets are

even more challenging for crowdsourcing platforms due to the

expert knowledge required in nature.

C. Model Reuse

Another approach to reduce human efforts involved in

preparing training data is to reuse models for similar data

integration tasks. For this purpose, we design and develop a

system, called as ModelHub, which searches for reusable mod-

els for a new data integration task by comparing the attributes

of the target dataset (that is created by the programmer’s initial

data integration code) with the target dataset of each existing

data integration models. We leverage locality sensitive hashing

(LSH) based on MinWise hash [19], [85] for text-based data

and LSH based on JS-divergence [15], [48] for numerical data

to accelerate the attribute-matching process. Another beneﬁt

of utilizing the LSH is that, in the ModelHub platform, each

model only needs to be uploaded with LSH signatures of the

target dataset’s attributes, while the target dataset does not

need to be submitted, which saves the storage overhead and

also addresses privacy concerns.

VI. EVALUATION

We mainly answer following questions in this section:

(1) How effective is our proposed deep learning representation

for different data integration tasks?

(2) How effective are the perturbations added to the training

data for handling various types of schema changes?

(3) How will different super cell granularities affect the

accuracy, and the overheads for the training, testing, and

assembling process?

(4) How will different model architectures (complex and large

models vs. simple and compact models) affect the accuracy

and latency for different types of data integration tasks?

(5) How will our approach of handling schema changes

improve productivity and alleviate programmers’ efforts?

A. Environment Setup

Based on the proposed training data representation and

training data perturbation methodology, we have created train-

ing data to train Bi-LSTM model, GPT-2 small model, and

BERT base model for two scenarios: coronavirus data in-

tegration and heterogeneous machine data integration. The

ﬁrst scenario mainly involves tabular source datasets in CSV

formats with aforementioned schema changes. However, the

source datasets for the second scenario are mainly unstructured

text data, in which most of the similar terms in different

platforms are expressed very differently (e.g., CPU user time

is logged as “CPU usage: 14.90% user” in MacOS, “%Cpu(s):

14.9 us” in Ubuntu, and “400%cpu 86%user” in Android).

Model Architectures. We compare three neural networks:

Bi-LSTM, GPT-2 small with a CNN frontend classiﬁer, and

BERT base with the same CNN frontend classiﬁer. The model

architectures are described in Sec. IV.

Model Training For the training process of both scenarios, Bi-

LSTM is relatively slower in converging, requiring around 50

epochs; while the models leveraging pre-trained GPT-2 small

and BERT base are much faster to converge, requiring only

around 5 epochs, as illustrated in Fig. 4.

(a) Bi-LSTM model (b) Transformer models

Fig. 4. Loss vs. epochs in the training process

Metrics. We evaluate and compare the accuracy, the storage

overhead, and the end-to-end training and inference latency,

with all types of schema changes as mentioned in Sec. III

applied at the inference stage. The accuracy of the data

integration model is deﬁned as the ratio of the number of super

cells that has been predicted with correct target positions and

aggregation actions to the total number of super cells in the

testing data.

Hardware Platform. For all experiments, if without speciﬁ-

cation, we use one NVIDIA Tesla V100 GPU from Google

Colab. All running times (e.g., training time, inference time)

are measured as the average of multiple repeated runs.

B. Coronavirus Data Integration Scenario

1) Experiment Setup: We evaluate our system in a COVID-

19 data integration scenario that is close to the example in

Sec. I. We predict COVID-19 trend using daily and regional

information regarding the number of vaqarious cases and

mobility factors. Given a set of raw data sources, we need

to create a 2-dimensional target dataset on daily basis. In the

target dataset, each row represents coronavirus and mobility

information for a state/province on the speciﬁc date, and each

column represents the state, country, number of conﬁrmed

cases, recovery cases, death cases, and the mobility factors

regarding workplace, grocery, transit, etc.. The target dataset

can be used as inputs to various curve-ﬁtting techniques [1],

[2] for COVID-19 prediction.

Datasets. We assume the user speciﬁes/recommends a small

set of initial data sources. For the ﬁrst scenario, the user

speciﬁes the John Hopkins University’s COVID-19 github

repository [4] and Google mobility data [3]. The statistics

about the above source tables are illustrated in Tab. I. The

JHU dataset contains 258 ﬁles with each ﬁle representing

COVID-19 statistics on a speciﬁc date. These ﬁles have tens

of versions, growing from 36 rows and 6 attributes to 3956

rows and 14 columns.

Perturbations. We add perturbations such as random changes

to attribute names and values, and replacing attribute names

and value tokens by synonyms as described in Sec. III to

58.3% of the attributes in the training data. In addition, we add

key expansion changes, which accounts for 18.6% of the rows

in the training data. We test the model using JHU-COVID-19

data and Google mobility data collected from Feb 15, 2020 to

Oct 6, 2020, as illustrated in Tab. I.

TABLE I

STATISTICS OF RELEVANT DATA SOURCES.

numFiles numRows numCols

JHU-COVID19 258 36-3956 6-14

Google-Mobility 1 2, 526, 500 14

2) Overall Results: The overall results are illustrated in

Tab. II, which show that employing a complex transformer

like the pre-trained GPT-2 small and BERT base, we can

achieve better accuracy, though more complicated models

require signiﬁcantly more storage space and computational

time for training one epoch or inference.

The results also show that with the increase of the granular-

ity of super cells , the required training and testing time will

be signiﬁcantly reduced, while the accuracy will decrease.

TABLE II

COVID-19 DATA INTEGRATION WITH DIFFERENT SUPER CELL

GRANULARITY (NUMBER OF SUPER CELLS PER TUPLE OF TARGET

DATASET)

#supercell model accuracy T

train

/epoch T

predict

Bi-LSTM 96.6% 4.8 sec 18.9 sec

GPT-2 small 99.8% 28.1sec 21.7 sec

BERT base 99.8% 29.3 sec 20.6 sec

Bi-LSTM 92.4% 3.2 sec 8.6 sec

GPT-2 small 99.6% 9.2 sec 6.8 sec

BERT base 99.6% 9.6 sec 6.7 sec

Bi-LSTM Failed

GPT-2 small 99.4% 5.8 sec 3.8 sec

BERT base 99.4% 6.0 sec 3.8 sec

3) Ablation Study: Using the Bi-LSTM model with single-

cell representation, we also conducted detailed ablation study

as illustrated in Tab. III. It shows that handling value for-

mat changes (e.g., date format change like 10-06-2020 and

06102020; and different abbreviations of region and subdis-

tricts like AZ and Arizona.) is a main factor for accuracy

degradation. Using a customized synonymous dictionary to

encode these format changes for adding perturbations to the

training data can greatly improve the accuracy compared with

using synonyms extracted from Google Knowledge Graph, as

illustrated in Tab. IV. In addition, we also ﬁnd that using

character-based embedding can signiﬁcantly outperform word-

based embedding, as illustrated in Tab. V.

4) Human Productivity Comparison: We developed the

data integration code using Python and Pandas dataframe to

integrate the JHU COVID-19 data collected on Feb 15, 2020

and the time-series Google mobility data. After Feb 15, 2020,

the ﬁrst schema evolution of the JHU COVID-19 data schema

that breaks the integration code and causes system downtime,

happened on Mar 22, 2020. we invite an experienced software

engineer, a Ph.D student, and an undergraduate student to

TABLE III

ABLATION STUDY FOR DIFFERENT SCHEMA CHANGE TYPES, CHANGES

ARE INCREMENTALLY ADDED.

Testing Cases Testing Accuracy

relevant data with no schema changes 99.9%

irrelevant data 99.9%

changes of two attributes 99.9%

changes of ﬁve attributes 99.9%

changes of six attributes and value format changes 97.5%

key expansion 96.4%

TABLE IV

ACCURACY COMPARISON FOR USING SYNONYMS RETRIEVED FROM

GOOGLE KNOWLEDGE GRAPH (GKG) AND USING SELF-CODED

SYNONYMS.

Testing Cases GKG Self-coded

changes of two attributes 94.5% 99.9%

changes of ﬁve attributes 97.5% 99.9%

changes of six attributes and region format changes 82.2% 97.5%

develop the revisions respectively and ask them to deliver

the task as soon as possible. We record the time between

the task assignment and code submission, as well as the

time they dedicated to ﬁxing the issue as they reported. We

ﬁnd that although the reported dedicated time ranges from

15 to 25 minutes; the time between the task assignment

and code submission ranges from one to three days. This

example illustrates the unpredictability of human resources. In

contrast, our proposed data integration pipeline can smoothly

handle schema changes without any interruptions or delays,

and requires no human intervention at all.

Performance of Python-based Integration Code. We run

our Python-based and human-coded data integration pipeline

on the aforementioned daily JHU COVID-19 data and Google

mobility data in a C4.xlarge AWS instance that has four CPUs

and eight gigabytes memory, and it takes 417 seconds of time

on average to integrate data for one day, without considering

the time required to ﬁx the pipeline for schema changes.

97% of the time is spent on removing redundant county-level

statistics from the relatively large Google mobility ﬁle that has

2.5 millions of tuples. Otherwise the co-existing state-level and

county-level statistics in the Google mobility ﬁle will confuse

the join processing. This observation indicates that with the

acceleration of high-end GPU processor, the overall training

and inference latency of using a deep learning based pipeline

is lower than using the traditional human-centered pipeline.

Considering that the training process only needs to be carried

out at the beginning and when a concept drift [75] is detected.

C. Machine Log Integration

1) Environment Setup: Suppose a lab administrator devel-

oped a Python tool to integrate various performance metrics

of a cluster of MacOS workstations, such as CPU utiliza-

tion (user, system, idle, wait), memory utilization (cached,

buffered, swap), network utilization (input, output), disk uti-

lization (write, read), and so on. The tool collects these metrics

by periodically reading the output of an omnipresent shell

TABLE V

ACCURACY COMPARISON FOR USING DIFFERENT EMBEDDING

APPROACHES.

Testing Cases accuracy

Word2Vec pretrained on Wikipidia 71.2%

Word2Vec pretrained on Google News 67.5%

fastText pretrained on self-customized corpus 99.9%

tool “top”

and then perform a union operation for time-series

metrics collected from each machine. Now the lab purchased

four Ubuntu Linux servers. However, because the “top” tool’s

output in Ubuntu is very different from MacOS, the Python

tool cannot work with these new Linux machines without ad-

ditional coding efforts. Such problem is prevalent in machine

or sensor data integration, where different devices produced by

different manufacturers may use different schemas to describe

similar information.

2) Overall Results: The results are illustrated in Tab. VI and

Tab. VII, showing that our approach can achieve acceptable

accuracy. Particularly, the transformer models can achieve

signiﬁcantly better accuracy than the Bi-LSTM model. For

this case, with the increase in super cell granularity (i.e.,

decrease in number of super cells per target tuple), the

accuracy of the Bi-LSTM network is improved, while the

accuracy of the transformer-based models is slightly degraded.

The transformer-based models can achieve signiﬁcantly better

accuracy, while the computational time required for training

(per epoch) and inference is signiﬁcantly higher. Also the

larger of the super cell granularity, the fewer number of

training and testing samples. Therefore, the time required

for training and testing is also signiﬁcantly reduced with the

increase in the super cell granularity.

TABLE VI

MACHINE LOG DATA INTEGRATION WITH DIFFERENT SUPER CELL

GRANULARITY (NUMBER OF SUPER CELLS PER TUPLE OF TARGET

DATASET): UBUNTU AND MACOS DATA USED FOR TRAINING, AND

ANDROID DATA USED FOR TESTING

#supercell model accuracy T

train

/epoch T

predict

Bi-LSTM 82.1% 8.7 sec 8.1 sec

GPT-2 small 99.6% 52.4sec 21.3 sec

BERT base 99.7% 39.1 sec 15.3 sec

Bi-LSTM 85.3% 5.4 sec 3.6 sec

GPT-2 small 99.4% 29.3 sec 9.8 sec

BERT base 99.4% 22.7 sec 7.6 sec

Bi-LSTM 91.2% 2.8 sec 1.9 sec

GPT-2 small 99.0% 12.3 sec 4.0 sec

BERT base 99.1% 13.9 sec 3.7 sec

D. Output Assembling

In this section, we discuss the process of assembling pre-

diction results into tabular ﬁles. We mainly measure how

the sizes of source datasets, target datasets, and granularity

of super cells affect the overall latency of the assembling

process. The results are illustrated in Fig. 5, which show

that increasing super cell granularity will signiﬁcantly reduce

the assembling latency. It indicates that if storage space is

https://linux.die.net/man/1/top

TABLE VII

MACHINE LOG DATA INTEGRATION WITH DIFFERENT SUPER CELL

GRANULARITY (NUMBER OF SUPER CELLS PER TUPLE OF TARGET

DATASET): ONLY UBUNTU DATA USED FOR TRAINING, MACOS AND

ANDROID DATA USED FOR TESTING

#supercell model accuracy

Bi-LSTM 63.7%

GPT-2 small 99.5%

BERT base 99.6%

Bi-LSTM 71.2%

GPT-2 small 99.3%

BERT base 99.3%

Bi-LSTM 75.6%

GPT-2 small 98.7%

BERT base 99.1%

not the bottleneck, using a transformer-based model and the

largest possible super cell granularity will achieve acceptable

accuracy while signiﬁcantly reducing the computational time

required for training, inferences, and assembling.

(a) 100K-latency (b) 100K-breakdown

100

150

200

250

300

350

1 2 4 8 16 32

Seconds

Granularity: Number of attributes in one super cell

Assembling time with different granularity of

super cells for 500,000 rows

32 Attributes

16 Attributes

8 Attributes

4 Attributes

100

200

300

400

1 2 4 8 16 32

Seconds

Granularity: Number of attributes in one super cell

Assembling time breakdown with different

granularity of super cell for 500,000 rows and

32 attributes

Build Time

Transform Time

Write Time

(d) 500K-breakdown

(e) 1 million-latency (f) 1 million

Fig. 5. Inﬂuence of different super cell granularity to assembling latency

and breakdown analysis (build time: parse and insert cells involved in each

prediction to a dictionary. transform time: transform the dictionary to a Pandas

dataframe. write time: write the Pandas dataframe to a CSV ﬁle.). We use

different number of attributes and rows for the target table in each experiment.

VII. RELATED WORKS

A. Handling Schema Evolutions

Schema evolution in relational database, XML, JSON and

ontology has been an active research area for a long time [23],

[63]. One major approach is through model (schema) man-

agement [10], [12] and to automatically generate executable

mapping between the old and evolved schema [51], [77], [80].

While this approach greatly expands the theoretical foundation

of relational schema evolution, it requires application mainte-

nance and may cause undesirable system downtimes [18]. To

address the problem, Prism [18] is proposed to automate the

end-to-end schema modiﬁcation process by providing DBAs

a schema modiﬁcation language (SMO) and automatically

rewriting users’ legacy queries. However, Prism requires data

migration to the latest schema for each schema evolution,

which may not be practical for today’s Big Data era. Other

techniques include versioning [40], [54], [67], which avoids

the data migration overhead, but incurs version management

burden and signiﬁcantly slows down query performance. There

are also abundant works discussing about the schema evolu-

tion problem in NoSQL databases, Polystore or multi-model

databases [33], [35], [53], [70]

Most of these works are mainly targeting at enterprise data

integration problems and require that each source dataset is

managed by a relational or non-relational data store. However

the open data sources widely used by today’s data science

applications are often unmanaged, and thus lack schemas

or metadata information [50]. A deep learning model, once

trained, can handle most schema evolution without any human

intervention, and does not require any data migration, or

version management overhead. Moreover, today’s data science

applications are more tolerant to data errors compared to

traditional enterprise transaction applications, which makes

a deep learning approach promising.

B. Data Discovery

Data discovery is to ﬁnd related tables in a data lake. Au-

rum [26] is an automatic data discovery system that proposes

to build enterprise knowledge graph (EKG) to solve real-world

business data integration problems. In EKG, a node represents

a set of attributes/columns, and an edge connects two similar

nodes. In addition, a hyperedge connects any number of

nodes that are hierarchically related. They propose a two-step

approach to build EKG using LSH-based and TFIDF-based

signatures. They also provide a data discovery query language

SRQL so that users can efﬁciently query the relationships

among datasets. Aurum [26] is mainly targeting at enterprise

data integration. In recent, numerous works are proposed to

address open data discovery problems, including automatically

discover table unionability [59] and joinability [83], [85],

based on LSH and similarity measures. Nargesian and et

al. [58] propose a Markov approach to optimize the navigation

organization as a DAG for a data lake so that the probability

of ﬁnding a table by any of attributes can be maximized. In the

DAG, each node of navigation DAG represents a subset of the

attributes in the data lake, and an edge represents a navigation

transition. All of these works provide helpful insights from an

algorithmatic perspective and system perspective for general

data discovery problems. Particularly, Fernandez and et al. [27]

proposes a semantic matcher based on word embeddings to

discover semantic links in the EKG.

Our work has a potential to integrate data discovery and

schema matching into a deep learning model inference pro-

cess. We argue that in our targeting scenario, the approach we

propose can save signiﬁcant storage overhead as we only need

store data integration models which are signiﬁcantly smaller

than the EKG, and can also achieve better performance for

wide and sparse tables. We will prove in the paper that the

training data generation and labeling process can be fully

automated.

C. Schema/Entity Matching

Traditionally, to solve the data integration problem for data

science applications, once related datasets are discovered, the

programmer will either manually design queries to integrate

these datasets, or leverage a schema matching tool to automat-

ically discover queries to perform the data integration.

There are numerous prior-arts in schema matching [29],

[38], [51], [72], which mainly match schemas based on meta-

data (e.g., attribute name) and/or instances. Entity matching

(EM) [16], which is to identify data instances that refer to the

same real-world entity, is also related. Some EM works also

employ a deep learning-based approach [24], [37], [42], [49],

[57], [73], [82]. Mudgal and et al. [57] evaluates and compares

the performance of different deep learning models applied to

EM with three types of data: structured data, textual data, and

dirty data (with missing value, inconsistent attributes and/or

miss-placed values). They ﬁnd that deep learning doesn’t

outperform existing EM solutions on structured data, but it

outperforms them on textual and dirty data.

In addition, to apply schema matching to heterogeneous

data sources, it is important to discover schemas from semi-

structured or non-structured data. We proposed a schema

discovery mechanism for JSON data [78], among other related

works [22], [52].

Our approach proposes a super cell data model to unify

open datasets. We train deep learning models to learn the

mappings between the data items in source datasets and their

positions as well as aggregation modes in the target table.

If we see the context of a super cell in the source as an

entity, and the target position of the super cell as another

entity, the problem we study in this work shares some similarity

with the entity matching problem. The distinction is that the

equivalence of two ”entities” in our problem is determined by

users’ data integration logic, while general entity matching

problem does not have such constraints.

D. Other Related Works

Thirumuruganathan and et al. [74] discuss various represen-

tations for learning tasks in relational data curation. Cappuzzo

and et al. [14] further propose an algorithm for obtaining local

embeddings using a tripartite-graph-based representation for

data integration tasks such as schema matching, and entity

matching on relational database. We are mainly targeting at

open data in CSV, JSON and text format and choose to use a

super cell based representation. These works can be leveraged

to improve the super cell representation and corresponding

embeddings proposed in this work.

VIII. CONCLUSION

In this work, we propose an end-to-end approach based

on deep learning for periodical extraction of user expected

tables from fast evolving data sources of open datasets. We

further propose a relatively stable super cell based represen-

tation to embody the fast-evolving source data and to train

models that are robust to schema changes by automatically

injecting schema changes (e.g., dimension pivoting, attribute

name changes, attribute addition/removal, key expansion/con-

traction, etc.) to the training data. We formalize the problem

and conduct experiments on integration of open COVID-19

data and machine log data. The results show that our proposed

approach can achieve acceptable accuracy. In addition, by

applying our proposed approach, the system will not be easily

interrupted by schema changes and no human intervention is

required for handling most of the schema changes.

REFERENCES

[1] Caltech covid-19 modeling. https://github.com/quantummind/ cal-

tech covid 19 modeling.

[2] Cdc covid-19 death forecasting models.

https://www.cdc.gov/coronavirus/2019-ncov/covid-data/forecasting-

us.html.

[3] Community mobility reports. https://www.google.com/covid19/mobility/.

[4] Covid-19 data repository by the center for systems

science and engineering (csse) at johns hopkins university.

https://github.com/CSSEGISandData/COVID-19.

[5] Fasttext. https://fasttext.cc/.

[6] Harvard covid-19 data: county age&sex with ann.

https://dataverse.harvard.edu.

[7] National oceanic and atmospheric administration. www.noaa.gov.

[8] D. Abadi, A. Ailamaki, D. Andersen, P. Bailis, M. Balazinska, P. Bern-

stein, P. Boncz, S. Chaudhuri, A. Cheung, A. Doan, et al. The seattle

report on database research. ACM SIGMOD Record, 48(4):44–53, 2020.

[9] Y. An, A. Borgida, R. J. Miller, and J. Mylopoulos. A semantic

approach to discovering schema mapping expressions. In 2007 IEEE

23rd International Conference on Data Engineering, pages 206–215.

IEEE, 2007.

[10] P. A. Bernstein. Applying model management to classical meta data

problems. In CIDR, volume 2003, pages 209–220. Citeseer, 2003.

[11] P. A. Bernstein and S. Melnik. Model management 2.0: manipulating

richer mappings. In Proceedings of the 2007 ACM SIGMOD interna-

tional conference on Management of data, pages 1–12, 2007.

[12] P. A. Bernstein and E. Rahm. Data warehouse scenarios for model

management. In International Conference on Conceptual Modeling,

pages 1–15. Springer, 2000.

[13] H. A. Campbell, F. Urbano, S. Davidson, H. Dettki, and F. Cagnacci. A

plea for standards in reporting data collected by animal-borne electronic

devices. Animal Biotelemetry, 4(1):1, 2016.

[14] R. Cappuzzo, P. Papotti, and S. Thirumuruganathan. Creating embed-

dings of heterogeneous relational datasets for data integration tasks. In

Proceedings of the 2020 ACM SIGMOD International Conference on

Management of Data, pages 1335–1349, 2020.

[15] L. Chen, H. Esfandiari, G. Fu, and V. Mirrokni. Locality-sensitive

hashing for f-divergences: Mutual information loss and beyond. In

Advances in Neural Information Processing Systems, pages 10044–

10054, 2019.

[16] P. Christen. Data matching: concepts and techniques for record linkage,

entity resolution, and duplicate detection. Springer Science & Business

Media, 2012.

[17] C. Curino, H. J. Moon, A. Deutsch, and C. Zaniolo. Automating the

database schema evolution process. The VLDB Journal, 22(1):73–98,

2013.

[18] C. A. Curino, H. J. Moon, and C. Zaniolo. Graceful database schema

evolution: the prism workbench. Proceedings of the VLDB Endowment,

1(1):761–772, 2008.

[19] M. Datar, N. Immorlica, P. Indyk, and V. S. Mirrokni. Locality-sensitive

hashing scheme based on p-stable distributions. In Proceedings of the

twentieth annual symposium on Computational geometry, pages 253–

262, 2004.

[20] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. Bert: Pre-training

of deep bidirectional transformers for language understanding. arXiv

preprint arXiv:1810.04805, 2018.

[21] J. Ding, U. F. Minhas, J. Yu, C. Wang, J. Do, Y. Li, H. Zhang,

B. Chandramouli, J. Gehrke, D. Kossmann, et al. Alex: an updatable

adaptive learned index. In Proceedings of the 2020 ACM SIGMOD

International Conference on Management of Data, pages 969–984, 2020.

[22] M. DiScala and D. J. Abadi. Automatic generation of normalized

relational schemas from nested key-value data. In Proceedings of the

2016 International Conference on Management of Data, pages 295–310,

2016.

[23] A. Doan and A. Y. Halevy. Semantic integration research in the database

community: A brief survey. AI magazine, 26(1):83–83, 2005.

[24] M. Ebraheem, S. Thirumuruganathan, S. Joty, M. Ouzzani, and N. Tang.

Deeper–deep entity resolution. arXiv preprint arXiv:1710.00597, 2017.

[25] R. Fagin, L. M. Haas, M. Hern

andez, R. J. Miller, L. Popa, and

Y. Velegrakis. Clio: Schema mapping creation and data exchange. In

Conceptual modeling: foundations and applications, pages 198–236.

Springer, 2009.

[26] R. C. Fernandez, Z. Abedjan, F. Koko, G. Yuan, S. Madden, and

M. Stonebraker. Aurum: A data discovery system. In 2018 IEEE 34th

International Conference on Data Engineering (ICDE), pages 1001–

1012. IEEE, 2018.

[27] R. C. Fernandez, E. Mansour, A. A. Qahtan, A. Elmagarmid, I. Ilyas,

S. Madden, M. Ouzzani, M. Stonebraker, and N. Tang. Seeping

semantics: Linking datasets using word embeddings for data discovery.

In 2018 IEEE 34th International Conference on Data Engineering

(ICDE), pages 989–1000. IEEE, 2018.

[28] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Lavi-

olette, M. Marchand, and V. Lempitsky. Domain-adversarial training

of neural networks. The Journal of Machine Learning Research,

17(1):2096–2030, 2016.

[29] G. Gottlob and P. Senellart. Schema mapping discovery from data

instances. Journal of the ACM (JACM), 57(2):1–37, 2010.

[30] M. A. Hern

andez, R. J. Miller, and L. M. Haas. Clio: A semi-automatic

tool for schema mapping. ACM Sigmod Record, 30(2):607, 2001.

[31] K. Herrmann, H. Voigt, A. Behrend, and W. Lehner. Codel–a relationally

complete language for database evolution. In East European Conference

on Advances in Databases and Information Systems, pages 63–76.

Springer, 2015.

[32] K. Herrmann, H. Voigt, A. Behrend, J. Rausch, and W. Lehner. Living

in parallel realities: Co-existing schema versions with a bidirectional

database evolution language. In Proceedings of the 2017 ACM Interna-

tional Conference on Management of Data, pages 1101–1116, 2017.

[33] A. Hillenbrand, M. Levchenko, U. St

orl, S. Scherzinger, and M. Klettke.

Migcast: Putting a price tag on data model evolution in nosql data stores.

In Proceedings of the 2019 International Conference on Management

of Data, pages 1925–1928, 2019.

[34] B. Hilprecht, C. Binnig, and U. R

ohm. Learning a partitioning advisor

for cloud databases. In D. Maier, R. Pottinger, A. Doan, W. Tan,

A. Alawini, and H. Q. Ngo, editors, Proceedings of the 2020 Interna-

tional Conference on Management of Data, SIGMOD Conference 2020,

online conference [Portland, OR, USA], June 14-19, 2020, pages 143–

157. ACM, 2020.

[35] I. Holubov

a, M. Klettke, and U. St

orl. Evolution management of

multi-model data. In Heterogeneous Data Management, Polystores, and

Analytics for Healthcare, pages 139–153. Springer, 2019.

[36] P. Howard. Data migration, 2011. Bloor Research, London, UK, 2011.

[37] J. Kasai, K. Qian, S. Gurajada, Y. Li, and L. Popa. Low-resource

deep entity resolution with transfer and active learning. arXiv preprint

arXiv:1906.08042, 2019.

[38] A. Kimmig, A. Memory, R. J. Miller, and L. Getoor. A collective,

probabilistic approach to schema mapping using diverse noisy evidence.

IEEE Transactions on Knowledge and Data Engineering, 31(8):1426–

1439, 2018.

[39] A. Kipf, T. Kipf, B. Radke, V. Leis, P. Boncz, and A. Kemper. Learned

cardinalities: Estimating correlated joins with deep learning. arXiv

preprint arXiv:1809.00677, 2018.

[40] M. C. Klein and D. Fensel. Ontology versioning on the semantic web.

In SWWS, pages 75–91, 2001.

[41] M. Klettke. Evolution management of multi-model data (position paper).

[42] P. Konda, S. Das, P. Suganthan GC, A. Doan, A. Ardalan, J. R. Ballard,

H. Li, F. Panahi, H. Zhang, J. Naughton, et al. Magellan: Toward

building entity matching management systems. Proceedings of the VLDB

Endowment, 9(12):1197–1208, 2016.

[43] T. Kraska, A. Beutel, E. H. Chi, J. Dean, and N. Polyzotis. The case

for learned index structures. In Proceedings of the 2018 International

Conference on Management of Data, pages 489–504, 2018.

[44] S. Krishnan, Z. Yang, K. Goldberg, J. Hellerstein, and I. Stoica. Learning

to optimize join queries with deep reinforcement learning. arXiv preprint

arXiv:1808.03196, 2018.

[45] A. Kurakin, I. Goodfellow, and S. Bengio. Adversarial machine learning

at scale. arXiv preprint arXiv:1611.01236, 2016.

[46] Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. nature,

521(7553):436–444, 2015.

[47] G. Li, X. Zhou, S. Li, and B. Gao. Qtune: A query-aware database

tuning system with deep reinforcement learning. Proceedings of the

VLDB Endowment, 12(12):2118–2130, 2019.

[48] X.-L. Mao, B.-S. Feng, Y.-J. Hao, L. Nie, H. Huang, and G. Wen. S2jsd-

lsh: A locality-sensitive hashing schema for probability distributions. In

Proceedings of the Thirty-First AAAI Conference on Artiﬁcial Intelli-

gence, pages 3244–3251, 2017.

[49] V. V. Meduri, L. Popa, P. Sen, and M. Sarwat. A comprehensive

benchmark framework for active learning methods in entity matching.

In Proceedings of the 2020 ACM SIGMOD International Conference on

Management of Data, pages 1133–1147, 2020.

[50] R. J. Miller. Open data integration. Proceedings of the VLDB

Endowment, 11(12):2130–2139, 2018.

[51] R. J. Miller, L. M. Haas, and M. A. Hern

andez. Schema mapping as

query discovery. In VLDB, volume 2000, pages 77–88, 2000.

[52] M. J. Mior, K. Salem, A. Aboulnaga, and R. Liu. Nose: Schema design

for nosql applications. IEEE Transactions on Knowledge and Data

Engineering, 29(10):2275–2289, 2017.

[53] M. L. M

oller, M. Klettke, A. Hillenbrand, and U. St

orl. Query rewriting

for continuously evolving nosql databases. In International Conference

on Conceptual Modeling, pages 213–221. Springer, 2019.

[54] H. J. Moon, C. A. Curino, A. Deutsch, C.-Y. Hou, and C. Zaniolo. Man-

aging and querying transaction-time databases under schema evolution.

Proceedings of the VLDB Endowment, 1(1):882–895, 2008.

[55] H. J. Moon, C. A. Curino, M. Ham, and C. Zaniolo. Prima: archiving

and querying historical data with evolving schemas. In Proceedings of

the 2009 ACM SIGMOD International Conference on Management of

data, pages 1019–1022, 2009.

[56] H. J. Moon, C. A. Curino, and C. Zaniolo. Scalable architecture and

query optimization fortransaction-time dbs with evolving schemas. In

Proceedings of the 2010 ACM SIGMOD International Conference on

Management of data, pages 207–218, 2010.

[57] S. Mudgal, H. Li, T. Rekatsinas, A. Doan, Y. Park, G. Krishnan, R. Deep,

E. Arcaute, and V. Raghavendra. Deep learning for entity matching:

A design space exploration. In Proceedings of the 2018 International

Conference on Management of Data, pages 19–34, 2018.

[58] F. Nargesian, K. Q. Pu, E. Zhu, B. Ghadiri Bashardoost, and R. J.

Miller. Organizing data lakes for navigation. In Proceedings of the

2020 ACM SIGMOD International Conference on Management of Data,

pages 1939–1950, 2020.

[59] F. Nargesian, E. Zhu, K. Q. Pu, and R. J. Miller. Table union search on

open data. Proceedings of the VLDB Endowment, 11(7):813–825, 2018.

[60] S. Palkar, J. J. Thomas, A. Shanbhag, D. Narayanan, H. Pirk,

M. Schwarzkopf, S. Amarasinghe, M. Zaharia, and S. InfoLab. Weld:

A common runtime for high performance data analytics. In Conference

on Innovative Data Systems Research (CIDR), page 45, 2017.

[61] L. Popa, M. A. Hernandez, Y. Velegrakis, R. J. Miller, F. Naumann, and

H. Ho. Mapping xml and relational schemas with clio. In Proceedings

18th International Conference on Data Engineering, pages 498–499.

IEEE, 2002.

[62] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever.

Language models are unsupervised multitask learners. OpenAI Blog,

1(8):9, 2019.

[63] E. Rahm and P. A. Bernstein. An online bibliography on schema

evolution. ACM Sigmod Record, 35(4):30–31, 2006.

[64] S. Scherzinger, M. Klettke, and U. St

orl. Managing schema evolution

in nosql data stores. arXiv preprint arXiv:1308.0514, 2013.

[65] J. Schmidhuber. Deep learning in neural networks: An overview. Neural

networks, 61:85–117, 2015.

[66] Y. Shen, K. Chakrabarti, S. Chaudhuri, B. Ding, and L. Novik. Dis-

covering queries based on example tuples. In Proceedings of the 2014

ACM SIGMOD international conference on Management of data, pages

493–504, 2014.

[67] Y. Sheng. Non-blocking Lazy Schema Changes in Multi-Version

Database Management Systems. PhD thesis, Carnegie Mellon University

Pittsburgh, PA, 2019.

[68] A. Shrivastava, T. Pﬁster, O. Tuzel, J. Susskind, W. Wang, and R. Webb.

Learning from simulated and unsupervised images through adversarial

training. In Proceedings of the IEEE conference on computer vision and

pattern recognition, pages 2107–2116, 2017.

[69] M. Stonebraker and I. F. Ilyas. Data integration: The current status and

the way forward. IEEE Data Eng. Bull., 41(2):3–9, 2018.

[70] U. St

orl, M. Klettke, and S. Scherzinger. Nosql schema evolution and

data migration: State-of-the-art and opportunities. In EDBT, pages 655–

658, 2020.

[71] D. G. Sullivan, M. I. Seltzer, and A. Pfeffer. Using probabilistic

reasoning to automate software tuning. ACM SIGMETRICS Performance

Evaluation Review, 32(1):404–405, 2004.

[72] B. Ten Cate, P. G. Kolaitis, and W.-C. Tan. Schema mappings and

data examples. In Proceedings of the 16th International Conference on

Extending Database Technology, pages 777–780, 2013.

[73] S. Thirumuruganathan, S. A. P. Parambath, M. Ouzzani, N. Tang, and

S. Joty. Reuse and adaptation for entity resolution through transfer

learning. arXiv preprint arXiv:1809.11084, 2018.

[74] S. Thirumuruganathan, N. Tang, M. Ouzzani, and A. Doan. Data

curation with deep learning [vision]. arXiv preprint arXiv:1803.01384,

2018.

[75] A. Tsymbal. The problem of concept drift: deﬁnitions and related work.

Computer Science Department, Trinity College Dublin, 106(2):58, 2004.

[76] D. Van Aken, A. Pavlo, G. J. Gordon, and B. Zhang. Automatic database

management system tuning through large-scale machine learning. In

Proceedings of the 2017 ACM International Conference on Management

of Data, pages 1009–1024, 2017.

[77] Y. Velegrakis, R. J. Miller, and L. Popa. Preserving mapping consistency

under schema changes. The VLDB Journal, 13(3):274–293, 2004.

[78] L. Wang, S. Zhang, J. Shi, L. Jiao, O. Hassanzadeh, J. Zou, and

C. Wangz. Schema management for document stores. Proceedings of

the VLDB Endowment, 8(9):922–933, 2015.

[79] C. Xiao, W. Wang, and X. Lin. Ed-join: an efﬁcient algorithm for

similarity joins with edit distance constraints. Proceedings of the VLDB

Endowment, 1(1):933–944, 2008.

[80] C. Yu and L. Popa. Semantic adaptation of schema mappings when

schemas evolve. In Proceedings of the 31st international conference on

Very large data bases, pages 1006–1017. VLDB Endowment, 2005.

[81] J. Zhang, Y. Liu, K. Zhou, G. Li, Z. Xiao, B. Cheng, J. Xing, Y. Wang,

T. Cheng, L. Liu, et al. An end-to-end automatic cloud database tuning

system using deep reinforcement learning. In Proceedings of the 2019

International Conference on Management of Data, pages 415–432, 2019.

[82] C. Zhao and Y. He. Auto-em: End-to-end fuzzy entity-matching using

pre-trained deep models and transfer learning. In The World Wide Web

Conference, pages 2413–2424, 2019.

[83] E. Zhu, D. Deng, F. Nargesian, and R. J. Miller. Josie: Overlap set

similarity search for ﬁnding joinable tables in data lakes. In Proceedings

of the 2019 International Conference on Management of Data, pages

847–864, 2019.

[84] E. Zhu, Y. He, and S. Chaudhuri. Auto-join: Joining tables by leveraging

transformations. Proceedings of the VLDB Endowment, 10(10):1034–

1045, 2017.

[85] E. Zhu, F. Nargesian, K. Q. Pu, and R. J. Miller. Lsh ensemble: Internet-

scale domain search. arXiv preprint arXiv:1603.07410, 2016.

[86] J. Zou, P. Barhate, A. Das, A. Iyengar, B. Yuan, D. Jankov, and

C. Jermaine. Lachesis: Automated generation of persistent partitionings

for big data applications. arXiv preprint arXiv:2006.16529, 2020.