DM: What we are doing in data mining

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [Subscribe]
DM: What we are doing in data mining

From: lindblad@physto.se (by way of Dorothy Firsching )
Date: Sun, 27 Jul 1997 20:03:29 -0400 (EDT)

Taming the Information Torrent
***************************
Thomas Lindblad and Clark Lindsey
Royal Institute of Technology, Stockholm
(lindblad@particle.kth.se)

Where the Torrent is

   With the tremendous advances in data acquisition and storage 
technologies,
the problem of  turning large volumes of measured raw data into useful
information becomes a significant one. Reaching sizes that defy even 
partial
examination by humans, the data volumes in both commercial and 
scientific
areas are literally swamping users. This data firehose phenomenon 
appears in
many scientific fields including particle physics, medical imaging, 
remote
sensing, etc. Human genome database projects have collected gigabytes 
of
data and the NASA Earth Observation System satellites will yield 20 
MB/s
of sensed image data. In future high energy physics experiments, such 
as
those planned at the Large Hadron Collider under construction near 
Geneva,
one will look for new particles in an enormous data flow. In spite of
hierachial on-line  filtering, the data volume sent to mass storage 
units
is estimated to be close to 100 MB/s.

Mining your own data

  Data mining has become the catch phrase to describe various methods 
of
locating previously untapped value in large databases. The approach 
utilises
data-driven extraction of information.  Knowledge discovery in data 
volumes
and data mining are areas of common interest to scientists in many 
fields.

  Knowledge discovery  is the automatized search for patterns and/or
relationships hidden in large data volumes. The information, at least
in detail, is generally unknown and important. The results are crucial
but generally appear in fragments that need interpretation. The search
itself can be carried out in several ways depending on the formatting 
of
the database. This may involve pre-selection and transformation of the
data using highly specialised and clever data structures. Algorithms 
based
on attribute-oriented induction and general domain knowledge have 
been shown
to be efficient.

  Massively parallel processing of the data is generally required in 
order
to achieve a reasonable search time. This is also necessary when the 
data
volume is a fast stream of data on a computer bus, telecommunication 
system
or data acquisition systems. Clearly the approach will depend on the
particular case, e.g. how well known the pattern or data is, how 
often the
data base will be used for the same or different searches, etc. Often 
it is
enough to extract fragments of patterns for assimilations yielding
identification. Otherwise subsequent investigations apply model 
calculations
in order to verify the data subset.

Mining and brainmaking

  One very promising approach is to combine data mining with neural 
networks,
which are algorithms that learn from examples, to produce "intelligent
instruments". Such systems would turn real-time data directly into 
useful
information. A recent example of this approach is in the diagnosis of 
the
Space Shuttle main engine during launch. A neural network recognizes
signatures of contaminants in spectra from the rocket engine plume. 
Although
the model used to train the net provides strong prior knowledge, it 
needs to
be incorporated with care to find instrumentation characteristics as 
well as
the critical values to understand the measurements. Here the approach 
of
using neural networks should be helpful in the interpretation of the 
mined
data. Furthermore, analysis often leads to numerical results (e.g. the
wavelength in nm for spectral lines), but it may not always be easy to
assimilate the information efficiently and discover the rules 
expressing
the relationships. Due to the inherent property of a well trained 
neural
network to generalize, this analysis may well be extended to include
predictions.

Relative relationships

 Finding relationships among entries in a database is another 
important task
in data mining. For example, a typical task involves categorizing 
related
texts in a large database of reports. Here neural networks have also 
shown
great promise. For example, a group at the University of Helsinki has 
used
a particular neural network algorithm called the Self Organizing Map 
(SOM)
to categorize entries in a database of internet newsgroup 
submissions. The
number of occurrences of individual words become the components of a 
very
long  vector. From these vectors the SOM learns which vectors, i.e. 
texts,
are similar. The SOM, in effect, learns that texts about similar 
topics tend
to use similar words.

  Numerous commercial, governmental, and scientific endeavours result 
in
very large image archives that can contain millions of entries. For 
example,
remote sensing by satellites produce thousands of images every day. 
These
raw images are then processed to produce even greater numbers of 
enhanced
images. Similarly, photographic archives, built over decades from
journalistic and other sources, provide relevant pictures for books 
and
articles. Finding a particular image or one related to a given 
subject, e.g.
a satellite image of a particular type of offshore pollution in the
satellite database or a picture of a seashore factory dumping waste 
from
the photography database, demands searching tools that are fast yet 
flexible
enough to allow searches for arbitrary scenes. Typically, the 
database index
holds compact descriptions, text or numbers, of the images. However, a
balance must be struck between the need for a small index that can be
quickly searched and the loss of information if the descriptions are 
toobrief.

  Several image database systems are available both commercially and 
from
research groups. The simplest indexing techniques involve text 
descriptions
of the images with searches made by keywords, e.g. 'factory' and 
'seashore'.
More advanced systems, however, allow one to present an example 
image, or
even a hand drawing, and then similar images in the database are 
found. Such
indexing requires processing of the images to bring out features that
distinguish the images. Such features can be as straightforward as the
colour composition (e.g. percentages of red, green and blue) to much 
more
complicated features such as mathematical values representing 
textures and
structures. The latter do not have rigid definitions and various
transformation functions are used and new ones are being researched.

  In a manner similar to the text categorizing, we are investigating 
whether
very simple features can distinguish and identify images. Instead of
particular words acting as the components of a feature vector, we 
would
look at small details in the image such as at the number of  times 
simple
sub-matrices of light and dark patterns occur in the images. We will 
also
look at other simple features for indexing such as wavelet 
coefficients .

Realisation in hardware

   Data mining research involves implementation of the algorithms and 
data
in very fast hardware to manage the torrent. Portions of a given 
database
would be buffered onto a very large RAM disk, i.e. a device that 
appears to
the program like a hard drive but is actually composed of random 
access
memories to offer orders of magnitude faster response. In addition,
algorithms such as the neural networks mentioned above, would be 
implemented
in hardware chips, such as FPGAs to allow for reconfiguring of new
algorithms and for very fast performance compared to that of running 
in a
central processor. The RAM disk and algorithm hardware  would be 
packaged
in a single box that would be as easily added to a system as adding 
an extra
hard drive.  A user interface program would provide the choice of a 
given
algorithm. Portions of the data from the long term storage devices 
would be
loaded onto the RAM disk where the algorithms can run at a maximum 
speed.
Meanwhile, new data will be moving into a RAM buffer to be ready when
processing of the previous data is finished.
Prev by Date: Re: DM: Introductions
Next by Date: DM: Introductions
Prev by thread: DM: Requested Introduction
Next by thread: DM: Data Mining List Introduction
Index(es):
- Date
- Thread