RE: DM: RE: Data Forms for Mining

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [Subscribe]
RE: DM: RE: Data Forms for Mining

From: Collier, Ken
Date: Fri, 26 May 2000 01:43:13 -0400

Thanks Tom. We've recently had some good successes with Clementine and I
appreciate your suggestion. I'm wondering if a data set like DTOX is as
varied in data types, formats, codes, quality problems, etc. as a typical
business data set containing 5-10 years of historical, often manually
entered data. May not matter, but even when our tools have successfully
handled more than 1000 features, it hasn't been simple.

In your filtering suggestion are you saying that you generate multiple C5.0
results from the entire database using different parameter settings, and
then use the filtering node to isolate key features? Sounds like a variation
on bundling. I'd like to know more. We are doing a lot with bundling,
bagging, and boosting to improve our predictive accuracy. However, as far as
I know SAS EM is the only tool that has built in this capability. I'd love
to be able to combine models in Clementine. Is this possible?

Ken Collier
Senior Manager, Business Intelligence
KPMG Consulting
Corporate Sponsor of the Center for Data Insight
http://insight.cse.nau.edu

-----Original Message-----
From: Khabaza, Tom [mailto:tkhabaza@spss.com]
Sent: Thursday, May 25, 2000 5:49 AM
To: 'datamine-l@nautilus-sys.com'
Subject: RE: DM: RE: Data Forms for Mining



Interesting response Ken.  I'm slightly surprised you had
so much trouble with the wide dataset you describe in your
earlier response.  As I'm sure you know, Clementine has
been coping with datasets of thousands of columns since
the DTOX project in '96 which described molecules using
about 7700 features.

As you rightly point out, the problems are not technical
as any resonable tool should cope with wide datasets.
The problems are methodological: wide datasets are unwieldy,
and how does the human user of data mining results
understand the models when they refer to thousands
of variables?

Of course there is no single answer that will cover
all cases.  In DTOX we reduced the width of the data using
Clementine's filter operations to combine the field
selections from multiple C5.0 rulesets.
(It sounds like this might have been appropriate technology
for your widget example also.)
In other situations, such as basket analysis or web-mining,
this might not be appropriate, and you may have to live
with the wide datasets for some purposes.  It's up to the
tool designers to make this as painless as possible.
--
Tom Khabaza
Programme Manager, Data Mining
SPSS Services
+44 1483 719304
tomk@spss.com


  > -----Original Message-----
  > From: Collier, Ken [mailto:kencollier@kpmg.com]
  > Sent: Wednesday, May 24, 2000 6:43 PM
  > To: datamine-l@nautilus-sys.com
  > Subject: RE: DM: RE: Data Forms for Mining
  >
  >
  >
  > Yes. That was 700 "vars" not "chars". Sorry Eric, but for reasons of
  > diplomacy, I'm not going to name names of commercial DM tools
  > that we've had
  > trouble getting to handle more than 700 vars.
  >
  > We routinely have client data from a variety of industries
  > that exceeds 1000
  > variables. Most data mining tools handle many millions of
  > records without
  > difficulty (other than longer compute times). However, it is
  > the width of
  > the data that commonly presents greater challenges. Now, the
  > reality is that
  > during the first few modeling iterations, we generally
  > identify the most
  > important variables and these often number less than 100. It
  > is this initial
  > variable reduction that is nontrivial.
  >
  > It has been our experience that good models can almost always
  > be built on a
  > 20% sample of records and these models generally include
  > fewer than 50 key
  > variables. However, it should never be the tools that force
  > you to sample
  > records or reduce your independent variables, it should be
  > good methodology
  > with the support of tools that helps you arrive at efficient/effective
  > modeling.
  >
  > Ken Collier
  > Senior Manager, Business Intelligence
  > KPMG Consulting
  > Corporate Sponsor of the Center for Data Insight
  > http://insight.cse.nau.edu
*****************************************************************************
The information in this email is confidential and may be legally privileged.
It is intended solely for the addressee. Access to this email by anyone else
is unauthorized.

If you are not the intended recipient, any disclosure, copying, distribution
or any action taken or omitted to be taken in reliance on it, is prohibited
and may be unlawful. When addressed to our clients any opinions or advice
contained in this email are subject to the terms and conditions expressed in
the governing KPMG client engagement letter.
*****************************************************************************
Prev by Date: RE: DM: Classification problem
Next by Date: AW: AW: DM: RE: Data Forms for Mining (Limit on variables)
Prev by thread: Re: DM: RE: Data Forms for Mining
Next by thread: RE: DM: RE: Data Forms for Mining
Index(es):
- Date
- Thread