Nautilus Systems, Inc. logo and menu bar Site Index Home
News Books
Button Bar Menu- Choices also at bottom of page About Nautilus Services Partners Case Studies Contact Us
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [Subscribe]

Re: DM: RE: Data Forms for Mining


From: Eric Bloedorn
Date: Mon, 22 May 2000 13:17:37 -0400
  • Organization: The MITRE Corporation



Ken: I am curious - what commercial tools choke and die on tables wider
than 700 chars?!

-Eric Bloedorn,
MITRE Corporation

"Collier, Ken" wrote:
 >
 > This question hit a hot-button for me. While most OLAP and DSS technology
 > require a great deal of structure in thier data (e.g., start schema), data
 > mining tools expect the data to be denormalized into a single 2D table.
 > Furthermore, aside from market basket analysis, most data mining algorithms
 > assume that each observation in a data set represents a unique entity (e.g.,
 > each record is a different customer).
 >
 > What this implies is that there is substantial data preprocessing required
 > in most cases to transform data from a relational, star, or other structured
 > model, into the mineable denormalized structure required. In our experience
 > with retailers, telecos, manufacturers, insurance companies, banks, and
 > others, this preprocessing generally consumes about 80% of the total effort
 > compared to the actual data mining, validation, verification, and
 > deployment, which consumes the remaining 20%. Your mileage may vary.
 >
 > Now, here's the rub: We recently had a manufacturing client with ~1000
 > quality control parameters for each component within a single widget. In
 > this scenario a widget is made up of 2-6 major sub-widgets, and each
 > sub-widget is made up of 3 components. The same set of QC parameters is
 > collected on each component. So, even when we denormalize the data into a
 > single table, there can be as many as 18 (6 x 3) records for a single
 > widget. Our objective in this analysis was to identify root causes of widget
 > failure in order to reduce the defect rate.
 >
 > Now, we want the data mining algorithms to "see" all 18 records associated
 > w/ a single widget as a single "pattern". Unfortunately commercial tools
 > don't tune their algorithms to do this even though it is technically
 > possible. One exception is time series and sequence analysis algorithms, but
 > these methods are really intended for a different purpose. Another kludgy
 > solution to this problem is to string out all 18 records into a single WIDE
 > record per widget. Many commercial tools choke and die on tables that are
 > wider than 700 vars.
 >
 > We finally wound up solving this problem using SAS Enterprise Miner and SGI
 > Mineset, but not without a lot of data transformations, preprocessing, and
 > preliminary variable reduction. To my thinking, the next generation of data
 > mining tools should provide the flexibility to "see" data in a wide variety
 > of structures. The price we may pay for this flexibility is the speed of
 > data sourcing prior to analysis.
 > ---
 > Ken Collier
 > Senior Manager, Business Intelligence
 > KPMG Consulting
 > Corporate Sponsor of the Center for Data Insight http://insight.cse.nau.edu
 >
 > -----Original Message-----
 > From: greg.della-croce@marchfirst.com
 > [mailto:greg.della-croce@marchfirst.com]
 > Sent: Thursday, May 18, 2000 6:09 AM
 > To: datamine-l@nautilus-sys.com
 > Subject: DM: Data Forms for Mining
 >
 > I have worked in and around Data Warehouse/Marts with their star schema for
 > awhile now.   However I am interested what form the data takes when it is
 > being
 > optimized for Mining.   I am speaking to structured data, not unstructured
 > data
 > such as large bodies of text.   What are the architectures of a Data Mining
 > DB?
 > Is the form dependent on the algorithms that you are going to employ
 > against it?
 > Or is it more general in nature?
 >
 > Thank you for your replies!
 >
 > Greg Della-Croce
 > marchFirst
 > BI/KM
 >
 > 
*****************************************************************************
 > The information in this email is confidential and may be legally privileged.
 > It is intended solely for the addressee. Access to this email by anyone else
 > is unauthorized.
 >
 > If you are not the intended recipient, any disclosure, copying, distribution
 > or any action taken or omitted to be taken in reliance on it, is prohibited
 > and may be unlawful. When addressed to our clients any opinions or advice
 > contained in this email are subject to the terms and conditions expressed in
 > the governing KPMG client engagement letter.
 > 
*****************************************************************************




[ Home | About Nautilus | Case Studies | Partners | Contact Nautilus ]
[ Subscribe to Lists | Recommended Books ]

logo Copyright © 1999 Nautilus Systems, Inc. All Rights Reserved.
Email: firschng@nautilus-systems.com
Mail converted by MHonArc 2.2.0