[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Subscribe]
RE: DM: RE: Data Forms for MiningFrom: Collier, Ken Date: Fri, 26 May 2000 01:43:13 -0400 Thanks Tom. We've recently had some good successes with Clementine and I appreciate your suggestion. I'm wondering if a data set like DTOX is as varied in data types, formats, codes, quality problems, etc. as a typical business data set containing 5-10 years of historical, often manually entered data. May not matter, but even when our tools have successfully handled more than 1000 features, it hasn't been simple. In your filtering suggestion are you saying that you generate multiple C5.0 results from the entire database using different parameter settings, and then use the filtering node to isolate key features? Sounds like a variation on bundling. I'd like to know more. We are doing a lot with bundling, bagging, and boosting to improve our predictive accuracy. However, as far as I know SAS EM is the only tool that has built in this capability. I'd love to be able to combine models in Clementine. Is this possible? Ken Collier Senior Manager, Business Intelligence KPMG Consulting Corporate Sponsor of the Center for Data Insight http://insight.cse.nau.edu -----Original Message----- From: Khabaza, Tom [mailto:tkhabaza@spss.com] Sent: Thursday, May 25, 2000 5:49 AM To: 'datamine-l@nautilus-sys.com' Subject: RE: DM: RE: Data Forms for Mining Interesting response Ken. I'm slightly surprised you had so much trouble with the wide dataset you describe in your earlier response. As I'm sure you know, Clementine has been coping with datasets of thousands of columns since the DTOX project in '96 which described molecules using about 7700 features. As you rightly point out, the problems are not technical as any resonable tool should cope with wide datasets. The problems are methodological: wide datasets are unwieldy, and how does the human user of data mining results understand the models when they refer to thousands of variables? Of course there is no single answer that will cover all cases. In DTOX we reduced the width of the data using Clementine's filter operations to combine the field selections from multiple C5.0 rulesets. (It sounds like this might have been appropriate technology for your widget example also.) In other situations, such as basket analysis or web-mining, this might not be appropriate, and you may have to live with the wide datasets for some purposes. It's up to the tool designers to make this as painless as possible. -- Tom Khabaza Programme Manager, Data Mining SPSS Services +44 1483 719304 tomk@spss.com > -----Original Message----- > From: Collier, Ken [mailto:kencollier@kpmg.com] > Sent: Wednesday, May 24, 2000 6:43 PM > To: datamine-l@nautilus-sys.com > Subject: RE: DM: RE: Data Forms for Mining > > > > Yes. That was 700 "vars" not "chars". Sorry Eric, but for reasons of > diplomacy, I'm not going to name names of commercial DM tools > that we've had > trouble getting to handle more than 700 vars. > > We routinely have client data from a variety of industries > that exceeds 1000 > variables. Most data mining tools handle many millions of > records without > difficulty (other than longer compute times). However, it is > the width of > the data that commonly presents greater challenges. Now, the > reality is that > during the first few modeling iterations, we generally > identify the most > important variables and these often number less than 100. It > is this initial > variable reduction that is nontrivial. > > It has been our experience that good models can almost always > be built on a > 20% sample of records and these models generally include > fewer than 50 key > variables. However, it should never be the tools that force > you to sample > records or reduce your independent variables, it should be > good methodology > with the support of tools that helps you arrive at efficient/effective > modeling. > > Ken Collier > Senior Manager, Business Intelligence > KPMG Consulting > Corporate Sponsor of the Center for Data Insight > http://insight.cse.nau.edu ***************************************************************************** The information in this email is confidential and may be legally privileged. It is intended solely for the addressee. Access to this email by anyone else is unauthorized. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited and may be unlawful. When addressed to our clients any opinions or advice contained in this email are subject to the terms and conditions expressed in the governing KPMG client engagement letter. *****************************************************************************
|
MHonArc
2.2.0