XML Schema Matching

Overview of the project

Comparing schemas to obtain matches is a major part of the processes of schema or data integration and in any application which merges information from more than one XML data source. This is important for information cooperation, data warehouse, e-commerce and scientific applications. In practice, schema matching is done manually with the help of graphical user interfaces in a labour-intensive process. As the number of online information sources increases rapidly we need better ways of merging and summarising information from multiple heterogeneous sources. Hence, improved schema matching algorithms are increasingly important. The datasets for our experiments we collected are all available for research purposes.

The descriptions of the datasets contain either links to sets which we have not gathered ourselves, or downloadable zipped files containing the schemas we have collected.

Experimental Datasets

Small schemas

We have used the collection of XML schemas originally developed by AnHai Doan and his colleagues for the LSD project at Illinois. The schemas in this collection are approximately 3-5 levels deep and contain 15-20 distinct components. links to the relevant The collection covers:

Medium schemas

Our medium collection has XML schemas related to the chemical industry which we have have collected from CIDX website. All schemas in this collection are about trade, especially about descriptions of purchase, purchase order, purchase processing, etc. They are approximately 4-6 levels deep and contain 80-100 distinct components.

Medium Collection (downloadable zipped)

Trade

Large schemas

The XML Schemas in this collection are at least 5 levels deep and contain over 100 distinct components. They are partitioned into three groups.
The first group includes schemas for geographical and related standards:

XML schemas in the second group have been converted from DTDs which are harvested from the Internet in the period October to December 2004.
The final group also has XML schemas harvested from the Internet. They are not converted from DTDs.

Large Collection (downloadable zipped)

Meta Standard schemas

Documentation

Harvested from the Internet1

Harvested from the Internet2

Artificial schemas

The collection of artificial schemas contains XML Schemas created by either W3C for welformness testing purposes or us for algorithm testing purposes (e.g., testing parsing tool, testing matching with references, etc).

Artificial Collection (downloadable zipped)

Tools

The tool we used to convert DTDs to Schemas is available for download
Windows version

Publications

Tran Hong-Minh, Dan Smith (2006) Machine Learning Models: Combining Evidence of Similarity for XML Schema Matching. KDXD 2006, 43-53 DOI: 10.1007/11730262_7
Tran Hong-Minh, Dan Smith (2006) Word Simlarity in WordNet, HPSC 2006, Hanoi (PDF 180kb)

Contact us

Dr. Dan Smith
Minh Tran