[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Subscribe]
Re: DM: Small data setsFrom: Werner E. HELM Date: Tue, 02 May 2000 11:07:37 +0200 >Hi Frank : Question cut off. Your answer was : >Good question. From my experience, data mining is getting more and more >assigned to database marketing, customer behavior modeling, web mining. >Here, one has - naturally - to deal with large numbers of records. But >"large" can also be defined this: > >"Commonly, a large data set is one that has many cases or records. With >this book, however, 'large' rather refers to the number of variables >describing each record. When there are more variables than cases, the most >known algorithms are running into some problems (in mathematical >statistics, for instance, covariance matrix becomes singular so that >inversion is impossible; Neural Networks fail to learn). Even if the data >are well-behaved, a large number of variables means that the data are >distributed in a high dimensional >hypercube, causing the known dimensionality problem." (Mueller/Lemke, >Self-Organising Data Mining, ISBN 3-89811-861-4) > >Often it is more difficult to extract useful knowledge from 'small' data >sets. For many economical (world model e.g.), ecological (global warming, >water/ air pollution), medical/ bio-chemical (diagnosis of diseases, >carcinogenicity prediction of aromatic compounds), a.o. problems are only >rather short data sets available. Extracting knowledge from short and >noisy data is the primary application area of self-organising data mining >technologies. On all the mentioned problems the KnowledgeMiner software >has been using successfully. Other examples are included in the >downloadable demo (http://www.knowledgeminer.net). > >Frank My brief remark is : Concentrating on large datasets I believe is an issue of delimiting which is essential for a new field to arise and to successfully establish itself. If you or someone else claims that DM would cover any dataset then it must be a superset of STATISTICS, which has some hundred years of tradition, many departments worldwide and under the label Exploratory Data Analysis developed many methods and techniques dealing with small datasets, some extensible to large datasets and has also developed many methods dealing with the many variables situation (see the contributions of C.F. Gauss, ....., L. Breiman, J. Friedman, ......). Which DM-tool, which DM-specialist or DM-group could really claim to master all of statistics ??? Is there a DM-department at any European / American university ?? How would you position yourself w.r.t. statistics as a field or to a stat-dept. as an organization ??? I could imagine that quite some top-DMers would like to succeed J. Friedman at Stanford, where it not for so much less money .... Werner (I'm Professor of Statistics and Operations Research ; I let myself call mathematician, statistician, operations researcher, simulation guy, problem solver, data-miner, expect some more in the future, should I live long enough . I'm very liberal w.r.t. labels.) What I mean is that I did DM with large and small datasets - non-automatically of course - when the label DM was not yet invented or meaning Deutsche Mark. From this point of view it appears that much noise about DM nowadays is mere marketing hype in order to sell products. This marketing hype could vanish as soon as new targets seem profitable, people really doing DM would continue to do so, irrespective of what their work is being labeled. (Parellel were MIS , EIS, etc. : it's out to talk about MIS , when some EIS now deliver some parts of what has been promised 10 years ago what each MIS would deliver, but never really did etc.).
|
MHonArc
2.2.0