Re: DM: Small data sets

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [Subscribe]

Re: DM: Small data sets

From: Werner E. HELM
Date: Tue, 02 May 2000 11:07:37 +0200


 >Hi Frank :

Question cut off.
   Your answer was :

 >Good question. From my experience, data mining is getting more and more
 >assigned to database marketing, customer behavior modeling, web mining.
 >Here, one has - naturally - to deal with large numbers of records. But
 >"large" can also be defined this:
 >
 >"Commonly, a large data set is one that has many cases or records. With
 >this book, however, 'large' rather refers to the number of variables
 >describing each record. When there are more variables than cases, the most
 >known algorithms are running into some problems (in mathematical
 >statistics, for instance, covariance matrix becomes singular so that
 >inversion is impossible; Neural Networks fail to learn). Even if the data
 >are well-behaved, a large number of variables means that the data are
 >distributed in a high dimensional
 >hypercube, causing the known dimensionality problem." (Mueller/Lemke,
 >Self-Organising Data Mining, ISBN 3-89811-861-4)
 >
 >Often it is more difficult to extract useful knowledge from 'small' data
 >sets. For many economical (world model e.g.), ecological (global warming,
 >water/ air pollution), medical/ bio-chemical (diagnosis of diseases,
 >carcinogenicity prediction of aromatic compounds), a.o. problems are only
 >rather short data sets available. Extracting knowledge from short and
 >noisy data is the primary application area of self-organising data mining
 >technologies. On all the mentioned problems the KnowledgeMiner software
 >has been using successfully. Other examples are included in the
 >downloadable demo (http://www.knowledgeminer.net).
 >
 >Frank

My brief remark is :

Concentrating on large datasets I believe is an issue of delimiting which
is essential for a new field to arise and to successfully  establish
itself.  If you or someone else claims that  DM  would cover any dataset
then it must be a superset of  STATISTICS, which has some hundred years of
tradition, many departments worldwide and under the label   Exploratory
Data Analysis   developed many methods and techniques dealing with small
datasets, some extensible to large datasets and has also developed many
methods dealing with the many variables situation (see the contributions of
C.F. Gauss, ....., L. Breiman, J. Friedman, ......).
Which DM-tool, which DM-specialist or DM-group could really claim to master
all of statistics ???

Is there a  DM-department at any European / American university ??

How would you position yourself  w.r.t. statistics as a field or to a
stat-dept. as an organization ???

I could imagine that quite some top-DMers would like to succeed  J.
Friedman at Stanford, where it not for so much less money ....

Werner

(I'm Professor of Statistics and Operations Research ; I let myself
call  mathematician, statistician, operations researcher, simulation guy,
problem solver, data-miner, expect some more in the future, should I live
long enough . I'm very liberal w.r.t. labels.)

What I mean is that I did  DM  with large and small datasets -
non-automatically of course - when the label  DM was not yet invented or
meaning Deutsche Mark. From this point of view it appears that much noise
about  DM  nowadays is mere  marketing hype in order to sell products. This
marketing hype could vanish as soon as new targets seem profitable, people
really doing  DM  would continue to do so, irrespective of what their work
is being labeled.  (Parellel were  MIS , EIS, etc.  : it's out to talk
about  MIS , when  some  EIS now deliver some parts of what has been
promised 10 years ago what each MIS  would deliver, but never really did
etc.).

Prev by Date: Re: DM: Reminder: Workshop on "Advances in Conceptual Modeling in
Next by Date: DM: Subscribing and Unsubscribing
Prev by thread: DM: Small data sets
Next by thread: Re: DM: Small data sets
Index(es):
- Date
- Thread