RE: DM: RE: Data Forms for Mining

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [Subscribe]
RE: DM: RE: Data Forms for Mining

From: Collier, Ken
Date: Wed, 24 May 2000 13:42:51 -0400

Yes. That was 700 "vars" not "chars". Sorry Eric, but for reasons of
diplomacy, I'm not going to name names of commercial DM tools that we've had
trouble getting to handle more than 700 vars.

We routinely have client data from a variety of industries that exceeds 1000
variables. Most data mining tools handle many millions of records without
difficulty (other than longer compute times). However, it is the width of
the data that commonly presents greater challenges. Now, the reality is that
during the first few modeling iterations, we generally identify the most
important variables and these often number less than 100. It is this initial
variable reduction that is nontrivial.

It has been our experience that good models can almost always be built on a
20% sample of records and these models generally include fewer than 50 key
variables. However, it should never be the tools that force you to sample
records or reduce your independent variables, it should be good methodology
with the support of tools that helps you arrive at efficient/effective
modeling.

Ken Collier
Senior Manager, Business Intelligence
KPMG Consulting
Corporate Sponsor of the Center for Data Insight http://insight.cse.nau.edu

-----Original Message-----
From: greg.della-croce@marchfirst.com
[mailto:greg.della-croce@marchfirst.com]
Sent: Tuesday, May 23, 2000 6:01 AM
To: datamine-l@nautilus-sys.com
Subject: Re: DM: RE: Data Forms for Mining



Eric,
       I think you miss read the line.  It is 700 Variables, not
characters.   But
that does bring up an interesting question.  IF DM uses a 0NF to lay out
data
(all the info for any given instance in one record)  What are the practical
limits of the tool, today, for the number of variables in that record.    I
am
currently working with biomed field and that could very easily go into the
hundreds of varables per instance.
       Anyone have any input on this?

Greg




Eric Bloedorn <bloedorn@mitre.org> on 05/22/2000 12:17:37 PM

Please respond to datamine-l@nautilus-sys.com

To:   datamine-l@nautilus-sys.com
cc:    (bcc: Greg Della-Croce/Whittman-Hart LP)
Subject:  Re: DM: RE: Data Forms for Mining





Ken: I am curious - what commercial tools choke and die on tables wider
than 700 chars?!

-Eric Bloedorn,
MITRE Corporation

"Collier, Ken" wrote:
   >
   > This question hit a hot-button for me. While most OLAP and DSS
technology
   > require a great deal of structure in thier data (e.g., start schema),
data
   > mining tools expect the data to be denormalized into a single 2D table.
   > Furthermore, aside from market basket analysis, most data mining
algorithms
   > assume that each observation in a data set represents a unique entity
(e.g.,
   > each record is a different customer).
   >
   > What this implies is that there is substantial data preprocessing
required
   > in most cases to transform data from a relational, star, or other
structured
   > model, into the mineable denormalized structure required. In our
experience
   > with retailers, telecos, manufacturers, insurance companies, banks, and
   > others, this preprocessing generally consumes about 80% of the total
effort
   > compared to the actual data mining, validation, verification, and
   > deployment, which consumes the remaining 20%. Your mileage may vary.
   >
   > Now, here's the rub: We recently had a manufacturing client with ~1000
   > quality control parameters for each component within a single widget. In
   > this scenario a widget is made up of 2-6 major sub-widgets, and each
   > sub-widget is made up of 3 components. The same set of QC parameters is
   > collected on each component. So, even when we denormalize the data into
a
   > single table, there can be as many as 18 (6 x 3) records for a single
   > widget. Our objective in this analysis was to identify root causes of
widget
   > failure in order to reduce the defect rate.
   >
   > Now, we want the data mining algorithms to "see" all 18 records
associated
   > w/ a single widget as a single "pattern". Unfortunately commercial tools
   > don't tune their algorithms to do this even though it is technically
   > possible. One exception is time series and sequence analysis
algorithms, but
   > these methods are really intended for a different purpose. Another
kludgy
   > solution to this problem is to string out all 18 records into a single
WIDE
   > record per widget. Many commercial tools choke and die on tables that
are
   > wider than 700 vars.
   >
   > We finally wound up solving this problem using SAS Enterprise Miner and
SGI
   > Mineset, but not without a lot of data transformations, preprocessing,
and
   > preliminary variable reduction. To my thinking, the next generation of
data
   > mining tools should provide the flexibility to "see" data in a wide
variety
   > of structures. The price we may pay for this flexibility is the speed of
   > data sourcing prior to analysis.
   > ---
   > Ken Collier
   > Senior Manager, Business Intelligence
   > KPMG Consulting
   > Corporate Sponsor of the Center for Data Insight
http://insight.cse.nau.edu
   >
   > -----Original Message-----
   > From: greg.della-croce@marchfirst.com
   > [mailto:greg.della-croce@marchfirst.com]
   > Sent: Thursday, May 18, 2000 6:09 AM
   > To: datamine-l@nautilus-sys.com
   > Subject: DM: Data Forms for Mining
   >
   > I have worked in and around Data Warehouse/Marts with their star schema
for
   > awhile now.   However I am interested what form the data takes when it
is
   > being
   > optimized for Mining.   I am speaking to structured data, not
unstructured
   > data
   > such as large bodies of text.   What are the architectures of a Data
Mining
   > DB?
   > Is the form dependent on the algorithms that you are going to employ
   > against it?
   > Or is it more general in nature?
   >
   > Thank you for your replies!
   >
   > Greg Della-Croce
   > marchFirst
   > BI/KM
   >
   >
****************************************************************************
*
   > The information in this email is confidential and may be legally
privileged.
   > It is intended solely for the addressee. Access to this email by anyone
else
   > is unauthorized.
   >
   > If you are not the intended recipient, any disclosure, copying,
distribution
   > or any action taken or omitted to be taken in reliance on it, is
prohibited
   > and may be unlawful. When addressed to our clients any opinions or
advice
   > contained in this email are subject to the terms and conditions
expressed in
   > the governing KPMG client engagement letter.
   >
****************************************************************************
*






*****************************************************************************
The information in this email is confidential and may be legally privileged.
It is intended solely for the addressee. Access to this email by anyone else
is unauthorized.

If you are not the intended recipient, any disclosure, copying, distribution
or any action taken or omitted to be taken in reliance on it, is prohibited
and may be unlawful. When addressed to our clients any opinions or advice
contained in this email are subject to the terms and conditions expressed in
the governing KPMG client engagement letter.
*****************************************************************************
Prev by Date: RE: DM: RE: Data Forms for Mining (Limit on variables)
Next by Date: Re: AW: DM: RE: Data Forms for Mining (Limit on variables)
Prev by thread: Re: DM: RE: Data Forms for Mining
Next by thread: Re: DM: RE: Data Forms for Mining
Index(es):
- Date
- Thread