Overview of the project
Comparing schemas to obtain matches is a major part of the processes of
schema or data integration and in any application which merges information from more than one XML data source.
This is important for information cooperation, data warehouse, e-commerce
and scientific applications. In practice, schema matching is done manually with
the help of graphical user interfaces in a labour-intensive process. As
the number of online information sources increases rapidly we need
better ways of merging and summarising information from multiple
heterogeneous sources. Hence, improved schema matching algorithms are
increasingly important. The datasets for our experiments we collected are
all available for research purposes.
The descriptions of the datasets contain
either links to sets which we have not gathered ourselves,
or downloadable zipped files containing the schemas we have collected.
Experimental Datasets
Small schemas
We have used the collection of XML schemas originally developed
by
AnHai Doan and his colleagues for the
LSD project at Illinois. The schemas in this collection
are approximately 3-5 levels deep and contain 15-20 distinct components. links to the relevant
The collection covers:
Medium schemas
Our medium collection has XML schemas related to the chemical industry which
we have have
collected from
CIDX website. All
schemas in this collection are about trade, especially about
descriptions of purchase, purchase order, purchase processing, etc. They
are approximately 4-6 levels deep and contain 80-100 distinct components.
Medium Collection (downloadable zipped) |
Trade |
Large schemas
The XML Schemas in this collection are at least 5 levels deep and contain over 100
distinct components. They are partitioned into three groups.
The first group includes schemas for geographical and related standards:
XML schemas in the second group have been converted from DTDs which
are harvested from the Internet in the period October to December 2004.
The final group also has XML schemas harvested from the Internet.
They are not converted from DTDs.
Artificial schemas
The collection of artificial schemas contains XML Schemas created by either W3C
for welformness testing purposes or us for algorithm testing purposes
(e.g., testing parsing tool, testing matching with references, etc).
Tools
The tool we used to convert DTDs to Schemas is available for download
Windows version
Publications
Tran Hong-Minh, Dan Smith (2006) Machine Learning Models: Combining Evidence of Similarity
for XML Schema Matching.
KDXD 2006, 43-53
DOI: 10.1007/11730262_7
Tran Hong-Minh, Dan Smith (2006) Word Simlarity in WordNet,
HPSC 2006, Hanoi
(PDF 180kb)
Contact us
Dr. Dan Smith
Minh Tran