Large extraction and integration of data from Web sources

Flint is a system developed at the DB group of Università Roma Tre. Our ultimate goal is a system that given a small set of samples for an entity of interest automatically retrieves sources, extracts data, and integrates the information offered by these sources.

Overview

Due to the proliferation of Web publishing tools, we are assisting to a continuous growth of Web sources that provide detailed information in many disparate domains, ranging from finance to sports, from bibliographies to real estate. Within the same domain, e.g. sports, the Web now offers a huge number of sources that publish data about specific domain types (e.g. players, teams, etc.). Each source organizes the published data into unstructured documents, which usually consist of HTML pages built according to a source-specific template.

Information about the instances of a given type is therefore scattered across a myriad of sources. However, although each source usually provides peculiar information, it is interesting to observe that on the Web there is also a high degree of information redundancy, both at the intensional and at the extensional levels:

•at the extensional level, many instances of the same type occur in a large number of sources. Consider the sport domain: there are many Web sites that publish information about a common set of soccer players;

•at the intensional level, different sources provide a core set of attributes about the same type. Following the sport example, a large majority of soccer Web sites describe players with data such as height, weight, and birth date. On the contrary, some attributes appear only in a limited yet significant number sources. Again from our example, some sources provide also the national team or the birth place of the described soccer players.

The above observations suggest an interesting perspective about the information that is available on the Web. Let us refer again to the soccer player example: we can imagine that there exists a hidden relation with information about soccer players, whose data are actually scattered across sources over the Web. Each source can be seen as an encoding (in HTML format) of a partial view (projection and selection) over such a relation.

Under this perspective, we aim at studying methods and techniques for materializing the hidden relation spread throughout the myriad of sources that publish data of the same type. We believe that this is a interesting problem that raises several issues whose difficulties are exacerbated by the Web scale.