[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Subscribe]
DM: What we are doing in data miningFrom: lindblad@physto.se (by way of Dorothy Firsching ) Date: Sun, 27 Jul 1997 20:03:29 -0400 (EDT) Taming the Information Torrent *************************** Thomas Lindblad and Clark Lindsey Royal Institute of Technology, Stockholm (lindblad@particle.kth.se) Where the Torrent is With the tremendous advances in data acquisition and storage technologies, the problem of turning large volumes of measured raw data into useful information becomes a significant one. Reaching sizes that defy even partial examination by humans, the data volumes in both commercial and scientific areas are literally swamping users. This data firehose phenomenon appears in many scientific fields including particle physics, medical imaging, remote sensing, etc. Human genome database projects have collected gigabytes of data and the NASA Earth Observation System satellites will yield 20 MB/s of sensed image data. In future high energy physics experiments, such as those planned at the Large Hadron Collider under construction near Geneva, one will look for new particles in an enormous data flow. In spite of hierachial on-line filtering, the data volume sent to mass storage units is estimated to be close to 100 MB/s. Mining your own data Data mining has become the catch phrase to describe various methods of locating previously untapped value in large databases. The approach utilises data-driven extraction of information. Knowledge discovery in data volumes and data mining are areas of common interest to scientists in many fields. Knowledge discovery is the automatized search for patterns and/or relationships hidden in large data volumes. The information, at least in detail, is generally unknown and important. The results are crucial but generally appear in fragments that need interpretation. The search itself can be carried out in several ways depending on the formatting of the database. This may involve pre-selection and transformation of the data using highly specialised and clever data structures. Algorithms based on attribute-oriented induction and general domain knowledge have been shown to be efficient. Massively parallel processing of the data is generally required in order to achieve a reasonable search time. This is also necessary when the data volume is a fast stream of data on a computer bus, telecommunication system or data acquisition systems. Clearly the approach will depend on the particular case, e.g. how well known the pattern or data is, how often the data base will be used for the same or different searches, etc. Often it is enough to extract fragments of patterns for assimilations yielding identification. Otherwise subsequent investigations apply model calculations in order to verify the data subset. Mining and brainmaking One very promising approach is to combine data mining with neural networks, which are algorithms that learn from examples, to produce "intelligent instruments". Such systems would turn real-time data directly into useful information. A recent example of this approach is in the diagnosis of the Space Shuttle main engine during launch. A neural network recognizes signatures of contaminants in spectra from the rocket engine plume. Although the model used to train the net provides strong prior knowledge, it needs to be incorporated with care to find instrumentation characteristics as well as the critical values to understand the measurements. Here the approach of using neural networks should be helpful in the interpretation of the mined data. Furthermore, analysis often leads to numerical results (e.g. the wavelength in nm for spectral lines), but it may not always be easy to assimilate the information efficiently and discover the rules expressing the relationships. Due to the inherent property of a well trained neural network to generalize, this analysis may well be extended to include predictions. Relative relationships Finding relationships among entries in a database is another important task in data mining. For example, a typical task involves categorizing related texts in a large database of reports. Here neural networks have also shown great promise. For example, a group at the University of Helsinki has used a particular neural network algorithm called the Self Organizing Map (SOM) to categorize entries in a database of internet newsgroup submissions. The number of occurrences of individual words become the components of a very long vector. From these vectors the SOM learns which vectors, i.e. texts, are similar. The SOM, in effect, learns that texts about similar topics tend to use similar words. Numerous commercial, governmental, and scientific endeavours result in very large image archives that can contain millions of entries. For example, remote sensing by satellites produce thousands of images every day. These raw images are then processed to produce even greater numbers of enhanced images. Similarly, photographic archives, built over decades from journalistic and other sources, provide relevant pictures for books and articles. Finding a particular image or one related to a given subject, e.g. a satellite image of a particular type of offshore pollution in the satellite database or a picture of a seashore factory dumping waste from the photography database, demands searching tools that are fast yet flexible enough to allow searches for arbitrary scenes. Typically, the database index holds compact descriptions, text or numbers, of the images. However, a balance must be struck between the need for a small index that can be quickly searched and the loss of information if the descriptions are toobrief. Several image database systems are available both commercially and from research groups. The simplest indexing techniques involve text descriptions of the images with searches made by keywords, e.g. 'factory' and 'seashore'. More advanced systems, however, allow one to present an example image, or even a hand drawing, and then similar images in the database are found. Such indexing requires processing of the images to bring out features that distinguish the images. Such features can be as straightforward as the colour composition (e.g. percentages of red, green and blue) to much more complicated features such as mathematical values representing textures and structures. The latter do not have rigid definitions and various transformation functions are used and new ones are being researched. In a manner similar to the text categorizing, we are investigating whether very simple features can distinguish and identify images. Instead of particular words acting as the components of a feature vector, we would look at small details in the image such as at the number of times simple sub-matrices of light and dark patterns occur in the images. We will also look at other simple features for indexing such as wavelet coefficients . Realisation in hardware Data mining research involves implementation of the algorithms and data in very fast hardware to manage the torrent. Portions of a given database would be buffered onto a very large RAM disk, i.e. a device that appears to the program like a hard drive but is actually composed of random access memories to offer orders of magnitude faster response. In addition, algorithms such as the neural networks mentioned above, would be implemented in hardware chips, such as FPGAs to allow for reconfiguring of new algorithms and for very fast performance compared to that of running in a central processor. The RAM disk and algorithm hardware would be packaged in a single box that would be as easily added to a system as adding an extra hard drive. A user interface program would provide the choice of a given algorithm. Portions of the data from the long term storage devices would be loaded onto the RAM disk where the algorithms can run at a maximum speed. Meanwhile, new data will be moving into a RAM buffer to be ready when processing of the previous data is finished.
|
MHonArc
2.2.0