RE: DM: RE: Data Forms for Mining (Limit on variables)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [Subscribe]
RE: DM: RE: Data Forms for Mining (Limit on variables)

From: Helberg, Clay
Date: Wed, 24 May 2000 11:18:56 -0500
Conceptually, it is a nice goal to automate as much of the data mining
process as possible. However, a great deal of the work that goes into data
preparation relies on domain knowledge rather than statistical
characteristics of the data itself. I'm not sure how practical it would be
to automate this--you'd need to have a very specific implementation for each
domain area, and the implementation would have to be further refined for the
data "environment", i.e. different banks will have different field names for
the same basic information, and may have slightly different methods for
calculating or recording the values. A "smart" data prep engine would need
to take all that into account. The implications are that such a thing would
most likely need to be custom-built for each application.

The alternative, using a "dumb" data prep engine that only understands
mathematical properties of the data, is unlikely to give optimal results,
and in many cases is likely to give ridiculous results. (See the statistical
literature on stepwise regression for examples of how automated field
selection can go catastrophically awry.)

								--Clay

Clay Helberg                           http://www.execpc.com/~helberg/
SPSS Documentation                      chelberg@spss.com
Speaking only for myself....

 > -----Original Message-----
 > From: Peter van der Putten [mailto:pvdputten@smr.nl]
 > Sent: Wednesday, May 24, 2000 8:46 AM
 > To: 'datamine-l@nautilus-sys.com'
 > Subject: RE: DM: RE: Data Forms for Mining (Limit on variables)
 >
 >
 >
 > Under what assumptions?
 >
 > Most real world datasets contain a lot of highly correlated
 > variables. So
 > the underlying probability distribution of the data is much,
 > much lower. We
 > do a lot of datamining on surveys with 5k-10k attribute value
 > pairs and
 > around 10k of records; subselections contain al lot less
 > records ofcourse.
 >
 > In my point of view algorithms should be able to cope with
 > this kind of
 > data, because ideally you don't want to force the user to do this
 > datapreparation. And for descriptive datamining it even makes sense to
 > include variables which are the same from a statistical point
 > of view, but
 > conceptually different for the user. There *is* a lot to gain from
 > datapreparation, but this should be automized as much as possible
 > (forward/backward selection)
 >
 > Any reactions?
 >
 > Peter
 >
 >  > -----Original Message-----
 >  > From: Frank Buckler [mailto:buckler@m2.uni-hannover.de]
 >  > Sent: Wednesday, May 24, 2000 11:15 AM
 >  > To: datamine-l
 >  > Subject: AW: DM: RE: Data Forms for Mining (Limit on variables)
 >  >
 >  >
 >  >
 >  > Issue: Number of Inputs
 >  >
 >  > I'm surprised to hear that some guy's are using from thousand
 >  > up to million
 >  > of inputs.
 >  >
 >  > There exist an upper bound on input-number determined by
 >  > sample size. This
 >  > is advocated due to the VC-Dimension.
 >  > For linear regression you need n+1 examples in sample (n =
 >  > number of input)
 >  > This is true but much more severe for non-linear modelling!
 >  >
 >  > frank
 >  >
 >  > -----Ursprüngliche Nachricht-----
 >  > Von: owner-datamine-l@nautilus-sys.com
 >  > [mailto:owner-datamine-l@nautilus-sys.com]Im Auftrag von
 >  > greg.della-croce@marchFIRST.com
 >  > Gesendet: Dienstag, 23. Mai 2000 15:01
 >  > An: datamine-l@nautilus-sys.com
 >  > Betreff: Re: DM: RE: Data Forms for Mining
 >  >
 >  >
 >  > Eric,
 >  >        I think you miss read the line.  It is 700 Variables, not
 >  > characters.   But
 >  > that does bring up an interesting question.  IF DM uses a 0NF
 >  > to lay out
 >  > data
 >  > (all the info for any given instance in one record)  What are
 >  > the practical
 >  > limits of the tool, today, for the number of variables in
 >  > that record.    I
 >  > am
 >  > currently working with biomed field and that could very
 >  > easily go into the
 >  > hundreds of varables per instance.
 >  >        Anyone have any input on this?
 >  >
 >  > Greg
 >  >
 >  >
 >  >
 >  >
 >  > Eric Bloedorn <bloedorn@mitre.org> on 05/22/2000 12:17:37 PM
 >  >
 >  > Please respond to datamine-l@nautilus-sys.com
 >  >
 >  > To:   datamine-l@nautilus-sys.com
 >  > cc:    (bcc: Greg Della-Croce/Whittman-Hart LP)
 >  > Subject:  Re: DM: RE: Data Forms for Mining
 >  >
 >  >
 >  >
 >  >
 >  >
 >  > Ken: I am curious - what commercial tools choke and die on
 >  > tables wider
 >  > than 700 chars?!
 >  >
 >  > -Eric Bloedorn,
 >  > MITRE Corporation
 >  >
 >  > "Collier, Ken" wrote:
 >  >    >
 >  >    > This question hit a hot-button for me. While most OLAP and DSS
 >  > technology
 >  >    > require a great deal of structure in thier data (e.g.,
 >  > start schema),
 >  > data
 >  >    > mining tools expect the data to be denormalized into a
 >  > single 2D table.
 >  >    > Furthermore, aside from market basket analysis, most
 > data mining
 >  > algorithms
 >  >    > assume that each observation in a data set represents a
 >  > unique entity
 >  > (e.g.,
 >  >    > each record is a different customer).
 >  >    >
 >  >    > What this implies is that there is substantial data
 > preprocessing
 >  > required
 >  >    > in most cases to transform data from a relational,
 > star, or other
 >  > structured
 >  >    > model, into the mineable denormalized structure
 > required. In our
 >  > experience
 >  >    > with retailers, telecos, manufacturers, insurance
 >  > companies, banks, and
 >  >    > others, this preprocessing generally consumes about 80%
 >  > of the total
 >  > effort
 >  >    > compared to the actual data mining, validation,
 > verification, and
 >  >    > deployment, which consumes the remaining 20%. Your
 >  > mileage may vary.
 >  >    >
 >  >    > Now, here's the rub: We recently had a manufacturing
 >  > client with ~1000
 >  >    > quality control parameters for each component within a
 >  > single widget. In
 >  >    > this scenario a widget is made up of 2-6 major
 >  > sub-widgets, and each
 >  >    > sub-widget is made up of 3 components. The same set of
 >  > QC parameters is
 >  >    > collected on each component. So, even when we
 >  > denormalize the data into
 >  > a
 >  >    > single table, there can be as many as 18 (6 x 3) records
 >  > for a single
 >  >    > widget. Our objective in this analysis was to identify
 >  > root causes of
 >  > widget
 >  >    > failure in order to reduce the defect rate.
 >  >    >
 >  >    > Now, we want the data mining algorithms to "see" all
 > 18 records
 >  > associated
 >  >    > w/ a single widget as a single "pattern". Unfortunately
 >  > commercial tools
 >  >    > don't tune their algorithms to do this even though it is
 >  > technically
 >  >    > possible. One exception is time series and sequence analysis
 >  > algorithms, but
 >  >    > these methods are really intended for a different
 >  > purpose. Another
 >  > kludgy
 >  >    > solution to this problem is to string out all 18 records
 >  > into a single
 >  > WIDE
 >  >    > record per widget. Many commercial tools choke and die
 >  > on tables that
 >  > are
 >  >    > wider than 700 vars.
 >  >    >
 >  >    > We finally wound up solving this problem using SAS
 >  > Enterprise Miner and
 >  > SGI
 >  >    > Mineset, but not without a lot of data transformations,
 >  > preprocessing,
 >  > and
 >  >    > preliminary variable reduction. To my thinking, the next
 >  > generation of
 >  > data
 >  >    > mining tools should provide the flexibility to "see"
 >  > data in a wide
 >  > variety
 >  >    > of structures. The price we may pay for this flexibility
 >  > is the speed of
 >  >    > data sourcing prior to analysis.
 >  >    > ---
 >  >    > Ken Collier
 >  >    > Senior Manager, Business Intelligence
 >  >    > KPMG Consulting
 >  >    > Corporate Sponsor of the Center for Data Insight
 >  > http://insight.cse.nau.edu
 >  >    >
 >  >    > -----Original Message-----
 >  >    > From: greg.della-croce@marchfirst.com
 >  >    > [mailto:greg.della-croce@marchfirst.com]
 >  >    > Sent: Thursday, May 18, 2000 6:09 AM
 >  >    > To: datamine-l@nautilus-sys.com
 >  >    > Subject: DM: Data Forms for Mining
 >  >    >
 >  >    > I have worked in and around Data Warehouse/Marts with
 >  > their star schema
 >  > for
 >  >    > awhile now.   However I am interested what form the data
 >  > takes when it
 >  > is
 >  >    > being
 >  >    > optimized for Mining.   I am speaking to structured data, not
 >  > unstructured
 >  >    > data
 >  >    > such as large bodies of text.   What are the
 >  > architectures of a Data
 >  > Mining
 >  >    > DB?
 >  >    > Is the form dependent on the algorithms that you are
 >  > going to employ
 >  >    > against it?
 >  >    > Or is it more general in nature?
 >  >    >
 >  >    > Thank you for your replies!
 >  >    >
 >  >    > Greg Della-Croce
 >  >    > marchFirst
 >  >    > BI/KM
 >  >    >
 >  >    >
 >  > **************************************************************
 >  > **************
 >  > *
 >  >    > The information in this email is confidential and may
 > be legally
 >  > privileged.
 >  >    > It is intended solely for the addressee. Access to this
 >  > email by anyone
 >  > else
 >  >    > is unauthorized.
 >  >    >
 >  >    > If you are not the intended recipient, any
 > disclosure, copying,
 >  > distribution
 >  >    > or any action taken or omitted to be taken in
 > reliance on it, is
 >  > prohibited
 >  >    > and may be unlawful. When addressed to our clients any
 >  > opinions or
 >  > advice
 >  >    > contained in this email are subject to the terms and
 > conditions
 >  > expressed in
 >  >    > the governing KPMG client engagement letter.
 >  >    >
 >  > **************************************************************
 >  > **************
 >  > *
 >  >
 >  >
 >  >
 >  >
 >  >
 >  >
 >
Follow-Ups:
- DM: Classification problem
  - From: Yannis Kopanas
Prev by Date: DM: RE: Data-mining companies based in London
Next by Date: RE: DM: RE: Data Forms for Mining
Prev by thread: RE: DM: RE: Data Forms for Mining (Limit on variables)
Next by thread: DM: Classification problem
Index(es):
- Date
- Thread