Aspects of Integrating Heterogeneous and Inconsistent Data

As a result of the unrelenting proliferation of sources and types of information, integrating heterogeneous data is a major current challenge that manifests itself across a wide spectrum, from government and business to science and health care. This challenge encompasses a collection of inter-related problems that include extracting and cleaning data from the sources, deriving a unified format for the integrated data, transforming and materializing data from the sources into data conforming with the unified format, and answering queries over the unified format. A critical task underlying these problems is the derivation of the relationship between different database schemas. During the past decade, schema mappings have emerged as the right tool for this task. Schema mappings are high-level, declarative specifications of the relationships between two database schemas; they provide the “right” level of abstraction and, at the same time, can be automatically compiled into executable code. The main aim of this project is to advance the state of the art in the design of schema mappings and to investigate their uses in integrating and exchanging inconsistent data.

The intellectual merit of this proposal is twofold: (i) the development of a framework and tools for schema-mapping design based on data examples; (ii) the development of a methodology and tools for integrating and exchanging inconsistent data.

At present, most schema-mapping design systems adhere to the same general methodology: first, a visual specification of the relationship between elements of two schemas is solicited from the user and then a schema mapping is automatically derived from the visual specification. Since several pairwise logically inequivalent schema mappings may be consistent with a given visual specification, different systems may produce different schema mappings from the same visual specification. Thus, the user is still required to understand the formal specification in order to verify the correctness of the derived schema mapping. Recently, a new and alternative approach to schema-mapping design has emerged. This approach is centered around the systematic use of data examples. In particular, data examples can be used both as a device to illustrate and understand the behavior of already derived schema mappings and as input to a schema-mapping design system that will derive a suitable schema mapping based on the input data examples. This project will systematically explore fundamental algorithmic tasks underlying these two uses of data examples in schema-mapping design.

Once schema mappings have been obtained between one or more source schemas and a target schema, data exchange systems enable the translation of data structured under the source schemas into data structured under the target schema. Here, however, a serious problem often arises, namely, the quality of source data tends to be unreliable. In particular, sources may contain inconsistencies or may contribute to inconsistencies when combined. The current framework of data exchange does not handle inconsistencies well. Indeed, when data sources contain or contribute inconsistencies, established data exchange methods fail to materialize a target instance; furthermore, the semantics of query answering yields undesirable results. This project will re-examine and extend the data exchange framework in order to gracefully handle inconsistencies.

The broader impact of this project is the development of human resources in science and engineering through the teaching, mentoring, and research training of graduate students. Results of this project will be incorporated into an advanced database topics course that will be developed and taught at UC Santa Cruz. All publications, course material, and software prototypes and tools developed in this project will be made publicly available on the Web for broad dissemination and sharing. Additional dissemination efforts will include tutorials at conferences and workshops, and short courses in advanced schools for graduate students and postdoctoral researchers.

This research project is funded by the National Science Foundation under NSF grant 1217869, and is being carried out at the University of California, Santa Cruz. The research team is composed as follows:

Participants:

Phokion Kolaitis (principle investigator)
Balder ten Cate (co-principle investigator)
Richard Halpert (graduate student researcher)
Kun Qian (graduate student researcher)

External collaborators:

Samson Abramsky, Vince Barany, Michael Benedikt, Meghyn Bienvenu, Douglas Burdick, Cristina Civili, Thomas Colcombet, Víctor Dalmau, Ronald Fagin, Enrico Franconi, Georg Gottlob, Lauri Hella, Benny Kimelfeld, Carsten Lutz, Dan Olteanu, Walied Othman, Reinhard Pichler, Lucian Popa, Emanuel Salligner, Vadim Savenkov, Luc Segoufin, Inanç Seylan, Evgeny Sherkhonov, Francesca Spezzano, Wang-Chiew Tan, Efthymia Tsamoura, Zografoula Vagena, Michael Vanden Boom, Frank Wolter