KDD Projects at UEA
Methodology
- UEA's KDD Roadmap
The rapid expansion of all areas with in KDD community has made it increasingly difficult for KDD engineers to keep track of the techniques available to solve a particular task. The methodology devised at UEA aims to collate data miner's wealth of expert input and present it in such a way that it can easily be applied when designing new projects or reviewing existing projects. The stages of the KDD process are expressed in the form of a roadmap with a feedback loop. The iterative nature of the problem allows results from one stage to be improved by returning to previous stages and making minor adjustments to some of the parameters and decisions.
An important analogy can be drawn between the KDD process and well-established software engineering processes. Both processes share this iterative nature and emphasise the importance of the early stages of the processes. It is well known in software engineering that at least sixty percent, if not more, of a project should be allocated to the analysis and design of the software; if this is carried out correctly, it should reduce the work involved in the programming and testing stages. The early stages of the KDD process share the same importance, if the project is carefully analysed, specified and managed, the decisions to be made at later stages will be clearer and it will be easier to identify the critical stages of the project.
Techniques
- Exact Methods / All Rules Search
-
All rules search is the search in databases for all patterns that satisfy defined constraints. These patterns are expressed as rules. The main problem with all rules search is that the search space is very large and possibly infinite. Methods are being developed that can find the patterns of interest in reasonable time.
- Missing Data
-
The work extends the groups original definition of accuracy and coverage to include missing and unreliable data and explain how our rule extraction algorithm has been modified to use these redefinitions. As a case study, the work examined a meteorological application of knowledge discovery and identified a number of objectives. The formatting of meteorological problems can yield extremely wide databases, abundant with missing values and unreliable data. Feature selection was applied to remove irrelevant fields from the database thus creating a problem of workable proportions for later stages of the KDD process. Simulated annealing was then used to find patterns in the database. Throughout this project, the emphasis remained on the handling of missing and unreliable values at each stage and how the entire KDD process can be affected by them.
- Generalised Linear Modelling
The project will involve working with, and understanding, generalised linear modelling as well as research into methods of improving models generated from current software packages. Part of the work will involve working with a major insurance company.
- Clustering Algorithms
The traditional clustering algorithms still rely on rigid assumption about the data which must be provided by the user. These assumption appear both explicitly as fixed numerical parameters and implicitly in the way which the user represents the input to the algorithm. However, data mining is currently a trendy application of clustering technology and attempt to eliminate fixed input parameters to clustering algorithms and allow the computer to search for good specific choices and select the appropriate dimension of data to use in similarity calculation. Through this project, a new approach uses metric spaces under study to improve the k-clustering algorithms.
Toolkits
- Development of "DATALAMP" - TCS Programme with Lanner Group
Software developed at UEA in conjunction with the Lanner Group provides a suite of tools to analyze, visualize and extract knowledge from large databases. A research prototype of the software has been in use at UEA for a number of years and used on projects with Norwich Union, Master Foods and Nortel. The first commercial version of the toolkit, DATALAMP was launched at the start of 2000. Version 2 of the software is planned for development in the second half of 2000 and will build on the existing design by introducing COM (Component Object Model) to implement the simulated annealing search engine. Other technologies such as the ability to link to database systems will be included along with a host of new knowledge discovery tasks covering each stage of the KDD roadmap.
Bioinformatics
- Data Mining for Emmerging Biological Threats - Institute of Food Research and John Innes Centre
The project uses yeast data as a model for biological data mining to investigate how current techniques can be applied to this area. The physiological and sequence data for a number of species and strains will also be examined to establish new methods to identify and classify yeasts. The project may also yield unparalleled opportunities for research into generating predictive capabilities for emerging microbial threats and aims to extend the application of the developed methods to new areas within the biological domain.
Contacts: Ian Roberts (IFR) Jo Dicks (JIC)
Data sources: National Collection of Yeast Cultures, Centraal Bureau voor Schimmelcultures and EMBL Nucleotide Sequence Database
UEA Web page: Data Mining: Bioinformatics
Application Areas
- Meterological Data Mining with Master Foods and Centrica
The group has been working in collaboration with Master Foods using data mining in the area of mid to long-term forecasting. A research report on the ability to predict discretised monthly NAO indices has also been delivered. The work has centred around using historical temperature and pressure data to find patterns describing European weather for certain months of the year. Datasets and expert meteorological knowledge for the initial project has been provided by the Climatic Research Unit and School of Environmental Sciences, both at the University of East Anglia. More recently, the U.S. COADS database has been used to obtain temperature, pressure and station data on a regular basis. Techniques for dealing with large amounts of missing data, and sometimes unreliable data, have been developed during this project and integrated into our approach to data mining. Our work for British Gas involved the search for rules that, given global meteorological data, would predict future values of a weather measure known to be closely linked to gas demand.
- Telecoms Fraud Detection - TCS Programme with Nortel Networks
One of the top four world-wide telecommunications companies, Nortel Networks, supplied the data for a number of projects. One such project consisted of customer profile information where each record contained a variety of customer attributes, together with a field determining whether the customer was suspected of using their service in a fraudulent manner. These records represented 'snapshots' of the customer, taken every two hours and primarily contain information describing the difference between current and historical customer behaviour. Techniques developed at UEA were used to identify customer behaviour characteristics, which could be used to make predictions about whether they were using the telecommunications service fraudulently. The project resulted in a paper being presented at World Multiconference on Systemics, Cybernetics and Informatics.
- Web and Sequence Mining - TCS Programme with Nortel Networks
Another project with Nortel Networks examined data from log files detailing alarms and informational messages generated by nodes in a network. As these nodes are generally interdependent, a failure at one node is likely to have a knock-on effect elsewhere in the network. If this is a common chain of events, a pattern can be recognised and expressed as a rule representing the behaviour when that node fails. These rules could then be used in a predictive context to predict when a fault will occur. Technologies including SQL and XML have been used extensively in this project to advance the group's current mining algorithms. The second part of the project is to identify users on a web server log and analyse patterns in their behaviour around a particular site. The patterns would determine what paths they take through the site, what areas they usually look at, and more importantly, when and what made them leave. Users could also be clustered into groups who regularly read similar pages, as well as being able to suggest other sites they may be interested in. The work is currently funded through a TCS programme.
- Financial Sector - TCS Programme with Norwich Union (now CGNU)
A number of KDD projects have taken place with Norwich Union (now CGNU), the third largest insurance group in the UK; with insurance being the second biggest UK invisible earner. An initial project, a TCS programme (rated alpha 5 by TCD/EPSRC), investigated the application of modern heuristic techniques to KDD problems in insurance, in particular the pricing of policies and marketing strategies. The work resulted in a unique collection of data mining tools using techniques such as genetic algorithms and simulated annealing. Both the company and research group has subsequently used the tools in later projects. The approaches developed in this project have led to two journal papers, eight conference papers and seminars including the second international conference on Knowledge Discovery in Databases, the leading international conference in this area.
- Medical
The UEA / Lanner KDD software, DATALAMP, has been used to analyse a database of 20,000 diabetes patients records supplied by St. Thomas’ Hospital, London (STH). A number of rules predicting early mortality were discovered. STH confirmed the rules were valid, novel and concurred with their latest research. Other medical databases are now being analysed.
|