[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Subscribe]
RE: DM: RE: Data Forms for MiningFrom: Khabaza, Tom Date: Thu, 25 May 2000 13:49:02 +0100 Interesting response Ken. I'm slightly surprised you had so much trouble with the wide dataset you describe in your earlier response. As I'm sure you know, Clementine has been coping with datasets of thousands of columns since the DTOX project in '96 which described molecules using about 7700 features. As you rightly point out, the problems are not technical as any resonable tool should cope with wide datasets. The problems are methodological: wide datasets are unwieldy, and how does the human user of data mining results understand the models when they refer to thousands of variables? Of course there is no single answer that will cover all cases. In DTOX we reduced the width of the data using Clementine's filter operations to combine the field selections from multiple C5.0 rulesets. (It sounds like this might have been appropriate technology for your widget example also.) In other situations, such as basket analysis or web-mining, this might not be appropriate, and you may have to live with the wide datasets for some purposes. It's up to the tool designers to make this as painless as possible. -- Tom Khabaza Programme Manager, Data Mining SPSS Services +44 1483 719304 tomk@spss.com > -----Original Message----- > From: Collier, Ken [mailto:kencollier@kpmg.com] > Sent: Wednesday, May 24, 2000 6:43 PM > To: datamine-l@nautilus-sys.com > Subject: RE: DM: RE: Data Forms for Mining > > > > Yes. That was 700 "vars" not "chars". Sorry Eric, but for reasons of > diplomacy, I'm not going to name names of commercial DM tools > that we've had > trouble getting to handle more than 700 vars. > > We routinely have client data from a variety of industries > that exceeds 1000 > variables. Most data mining tools handle many millions of > records without > difficulty (other than longer compute times). However, it is > the width of > the data that commonly presents greater challenges. Now, the > reality is that > during the first few modeling iterations, we generally > identify the most > important variables and these often number less than 100. It > is this initial > variable reduction that is nontrivial. > > It has been our experience that good models can almost always > be built on a > 20% sample of records and these models generally include > fewer than 50 key > variables. However, it should never be the tools that force > you to sample > records or reduce your independent variables, it should be > good methodology > with the support of tools that helps you arrive at efficient/effective > modeling. > > Ken Collier > Senior Manager, Business Intelligence > KPMG Consulting > Corporate Sponsor of the Center for Data Insight > http://insight.cse.nau.edu
|
MHonArc
2.2.0