DM: Small data sets

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [Subscribe]

DM: Small data sets

From: Frank Lemke
Date: Fri, 28 Apr 2000 03:59:24 +0200

 >          Hi!
 >
 >         I've been doing some research on Data Mining and have come into
the =
 >          twilight zone: why is everybody talking only about "large"
databases? =
 >          What about "small" databases - don't they have anything
valuable inside? =
 >          Don't they hide nuggets, useful patterns?
 >
 >          And, nobody (best to my knowledge) has come up with a
definition of =
 >          "small" and "large" - not in terms of bits and bytes, but
something more =
 >          persistent to the change.

Good question. From my experience, data mining is getting more and more
assigned to database marketing, customer behavior modeling, web mining.
Here, one has - naturally - to deal with large numbers of records. But
"large" can also be defined this:

"Commonly, a large data set is one that has many cases or records. With
this book, however, 'large' rather refers to the number of variables
describing each record. When there are more variables than cases, the most
known algorithms are running into some problems (in mathematical
statistics, for instance, covariance matrix becomes singular so that
inversion is impossible; Neural Networks fail to learn). Even if the data
are well-behaved, a large number of variables means that the data are
distributed in a high dimensional
hypercube, causing the known dimensionality problem." (Mueller/Lemke,
Self-Organising Data Mining, ISBN 3-89811-861-4)

Often it is more difficult to extract useful knowledge from 'small' data
sets. For many economical (world model e.g.), ecological (global warming,
water/ air pollution), medical/ bio-chemical (diagnosis of diseases,
carcinogenicity prediction of aromatic compounds), a.o. problems are only
rather short data sets available. Extracting knowledge from short and
noisy data is the primary application area of self-organising data mining
technologies. On all the mentioned problems the KnowledgeMiner software
has been using successfully. Other examples are included in the
downloadable demo (http://www.knowledgeminer.net).

Frank

Prev by Date: DM: UCLA summer short courses in Information Technology
Next by Date: DM: Data Mining SAS Users Conference - May 28-31 - San Francisco
Prev by thread: DM: Data Mining SAS Users Conference - May 28-31 - San Francisco
Next by thread: Re: DM: Small data sets
Index(es):
- Date
- Thread