Nautilus Systems, Inc. logo and menu bar Site Index Home
News Books
Button Bar Menu- Choices also at bottom of page About Nautilus Services Partners Case Studies Contact Us
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [Subscribe]

DM: Small data sets


From: Frank Lemke
Date: Fri, 28 Apr 2000 03:59:24 +0200
 >          Hi!
 >
 >         I've been doing some research on Data Mining and have come into
the =
 >          twilight zone: why is everybody talking only about "large"
databases? =
 >          What about "small" databases - don't they have anything
valuable inside? =
 >          Don't they hide nuggets, useful patterns?
 >
 >          And, nobody (best to my knowledge) has come up with a
definition of =
 >          "small" and "large" - not in terms of bits and bytes, but
something more =
 >          persistent to the change.

Good question. From my experience, data mining is getting more and more
assigned to database marketing, customer behavior modeling, web mining.
Here, one has - naturally - to deal with large numbers of records. But
"large" can also be defined this:

"Commonly, a large data set is one that has many cases or records. With
this book, however, 'large' rather refers to the number of variables
describing each record. When there are more variables than cases, the most
known algorithms are running into some problems (in mathematical
statistics, for instance, covariance matrix becomes singular so that
inversion is impossible; Neural Networks fail to learn). Even if the data
are well-behaved, a large number of variables means that the data are
distributed in a high dimensional
hypercube, causing the known dimensionality problem." (Mueller/Lemke,
Self-Organising Data Mining, ISBN 3-89811-861-4)

Often it is more difficult to extract useful knowledge from 'small' data
sets. For many economical (world model e.g.), ecological (global warming,
water/ air pollution), medical/ bio-chemical (diagnosis of diseases,
carcinogenicity prediction of aromatic compounds), a.o. problems are only
rather short data sets available. Extracting knowledge from short and
noisy data is the primary application area of self-organising data mining
technologies. On all the mentioned problems the KnowledgeMiner software
has been using successfully. Other examples are included in the
downloadable demo (http://www.knowledgeminer.net).

Frank




[ Home | About Nautilus | Case Studies | Partners | Contact Nautilus ]
[ Subscribe to Lists | Recommended Books ]

logo Copyright © 1999 Nautilus Systems, Inc. All Rights Reserved.
Email: firschng@nautilus-systems.com
Mail converted by MHonArc 2.2.0