RE: DM: RE: Data Forms for Mining

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [Subscribe]

RE: DM: RE: Data Forms for Mining

From: Khabaza, Tom
Date: Thu, 25 May 2000 13:49:02 +0100


Interesting response Ken.  I'm slightly surprised you had
so much trouble with the wide dataset you describe in your
earlier response.  As I'm sure you know, Clementine has
been coping with datasets of thousands of columns since
the DTOX project in '96 which described molecules using
about 7700 features.

As you rightly point out, the problems are not technical
as any resonable tool should cope with wide datasets.
The problems are methodological: wide datasets are unwieldy,
and how does the human user of data mining results
understand the models when they refer to thousands
of variables?

Of course there is no single answer that will cover
all cases.  In DTOX we reduced the width of the data using
Clementine's filter operations to combine the field
selections from multiple C5.0 rulesets.
(It sounds like this might have been appropriate technology
for your widget example also.)
In other situations, such as basket analysis or web-mining,
this might not be appropriate, and you may have to live
with the wide datasets for some purposes.  It's up to the
tool designers to make this as painless as possible.
--
Tom Khabaza
Programme Manager, Data Mining
SPSS Services
+44 1483 719304
tomk@spss.com


 > -----Original Message-----
 > From: Collier, Ken [mailto:kencollier@kpmg.com]
 > Sent: Wednesday, May 24, 2000 6:43 PM
 > To: datamine-l@nautilus-sys.com
 > Subject: RE: DM: RE: Data Forms for Mining
 >
 >
 >
 > Yes. That was 700 "vars" not "chars". Sorry Eric, but for reasons of
 > diplomacy, I'm not going to name names of commercial DM tools
 > that we've had
 > trouble getting to handle more than 700 vars.
 >
 > We routinely have client data from a variety of industries
 > that exceeds 1000
 > variables. Most data mining tools handle many millions of
 > records without
 > difficulty (other than longer compute times). However, it is
 > the width of
 > the data that commonly presents greater challenges. Now, the
 > reality is that
 > during the first few modeling iterations, we generally
 > identify the most
 > important variables and these often number less than 100. It
 > is this initial
 > variable reduction that is nontrivial.
 >
 > It has been our experience that good models can almost always
 > be built on a
 > 20% sample of records and these models generally include
 > fewer than 50 key
 > variables. However, it should never be the tools that force
 > you to sample
 > records or reduce your independent variables, it should be
 > good methodology
 > with the support of tools that helps you arrive at efficient/effective
 > modeling.
 >
 > Ken Collier
 > Senior Manager, Business Intelligence
 > KPMG Consulting
 > Corporate Sponsor of the Center for Data Insight
 > http://insight.cse.nau.edu

Prev by Date: RE: DM: RE: Data Forms for Mining (Limit on variables)
Next by Date: Re: DM: RE: Data Forms for Mining
Prev by thread: Re: DM: RE: Data Forms for Mining
Next by thread: Re: DM: RE: Data Forms for Mining
Index(es):
- Date
- Thread