Re: DM: RE: Data Forms for Mining

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [Subscribe]

Re: DM: RE: Data Forms for Mining

From: greg.della-croce
Date: Thu, 18 May 2000 14:27:44 -0500

Ken,

Thanks for the reply. From the replies so far it seems to me that 0NF or 1NF files are the standard for most tools. Your situation points to the limits that this sort of design can come to. 700 sounds big at first, but if you are using 0NF files, that shrinks fast!

How are time demensions handled? every event or record should be date or date/time stamped, but most time demensions will allow that to be related to Q1 or FY00 or SPRING SESSION or the like. I am guessing those just get sucked up as fields in the 0NF event?

Thanks again!

Greg Della-Croce

"Collier, Ken" <kencollier@kpmg.com> on 05/18/2000 11:14:37 AM

Please respond to datamine-l@nautilus-sys.com

To:   "'datamine-l@nautilus-sys.com'" <datamine-l@nautilus-sys.com>
cc:    (bcc: Greg Della-Croce/Whittman-Hart LP)
Subject:  DM: RE: Data Forms for Mining

This question hit a hot-button for me. While most OLAP and DSS technology
require a great deal of structure in thier data (e.g., start schema), data
mining tools expect the data to be denormalized into a single 2D table.
Furthermore, aside from market basket analysis, most data mining algorithms
assume that each observation in a data set represents a unique entity (e.g.,
each record is a different customer).

What this implies is that there is substantial data preprocessing required
in most cases to transform data from a relational, star, or other structured
model, into the mineable denormalized structure required. In our experience
with retailers, telecos, manufacturers, insurance companies, banks, and
others, this preprocessing generally consumes about 80% of the total effort
compared to the actual data mining, validation, verification, and
deployment, which consumes the remaining 20%. Your mileage may vary.

Now, here's the rub: We recently had a manufacturing client with ~1000
quality control parameters for each component within a single widget. In
this scenario a widget is made up of 2-6 major sub-widgets, and each
sub-widget is made up of 3 components. The same set of QC parameters is
collected on each component. So, even when we denormalize the data into a
single table, there can be as many as 18 (6 x 3) records for a single
widget. Our objective in this analysis was to identify root causes of widget
failure in order to reduce the defect rate.

Now, we want the data mining algorithms to "see" all 18 records associated
w/ a single widget as a single "pattern". Unfortunately commercial tools
don't tune their algorithms to do this even though it is technically
possible. One exception is time series and sequence analysis algorithms, but
these methods are really intended for a different purpose. Another kludgy
solution to this problem is to string out all 18 records into a single WIDE
record per widget. Many commercial tools choke and die on tables that are
wider than 700 vars.

We finally wound up solving this problem using SAS Enterprise Miner and SGI
Mineset, but not without a lot of data transformations, preprocessing, and
preliminary variable reduction. To my thinking, the next generation of data
mining tools should provide the flexibility to "see" data in a wide variety
of structures. The price we may pay for this flexibility is the speed of
data sourcing prior to analysis.
---
Ken Collier
Senior Manager, Business Intelligence
KPMG Consulting
Corporate Sponsor of the Center for Data Insight http://insight.cse.nau.edu

-----Original Message-----
From: greg.della-croce@marchfirst.com
[mailto:greg.della-croce@marchfirst.com]
Sent: Thursday, May 18, 2000 6:09 AM
To: datamine-l@nautilus-sys.com
Subject: DM: Data Forms for Mining

I have worked in and around Data Warehouse/Marts with their star schema for
awhile now.   However I am interested what form the data takes when it is
being
optimized for Mining.   I am speaking to structured data, not unstructured
data
such as large bodies of text.   What are the architectures of a Data Mining
DB?
Is the form dependent on the algorithms that you are going to employ
against it?
Or is it more general in nature?

Thank you for your replies!

Greg Della-Croce
marchFirst
BI/KM

*****************************************************************************
The information in this email is confidential and may be legally privileged.
It is intended solely for the addressee. Access to this email by anyone else
is unauthorized.

If you are not the intended recipient, any disclosure, copying, distribution
or any action taken or omitted to be taken in reliance on it, is prohibited
and may be unlawful. When addressed to our clients any opinions or advice
contained in this email are subject to the terms and conditions expressed in
the governing KPMG client engagement letter.
*****************************************************************************

Prev by Date: DM: Data Forms for Mining
Next by Date: DM: Datasets For Assocition
Prev by thread: DM: Datasets For Assocition
Next by thread: Re: DM: RE: Data Forms for Mining
Index(es):
- Date
- Thread