Re: DM: Decision Trees

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [Subscribe]

Re: DM: Decision Trees

From: David Jensen
Date: Fri, 5 Dec 1997 11:38:35 -0500 (EST)

Patrick Canarelli writes that previous discussions on this list 
concluded 
that: 1) the three families of decision tree algorithms (CARTŽ , ML, 
and 
AID) differ in only small ways; and 2) all lead to similar results.

In a recent paper, Tim Oates and I show that different pruning 
approaches 
(those from the CARTŽ  family and those from the ML family) can lead to 
radically different behavior in how tree size varies with training 
set 
size.  Ideally, training set size should not affect the size of the 
tree, 
particularly after some "correct" sized tree is obtained.  However, 
our 
experiments show that methods from the ML family produce a striking 
linear relationship between training set size and tree size, even 
after 
the accuracy of those trees has ceased to increase.  CARTŽ 's approach 
to 
pruning is much less likely to exhibit this behavior.  In a related 
paper, Oates, Cohen, and I summarize some of our own research on 
likely 
causes and show how TBA -- an algorithm somewhat similar to CHAID -- 
also 
avoids this pathology.

Methods from the CARTŽ , ML, and AID families may produce models with 
roughly equivalent accuracy, but the complexity of models from the ML 
family of algorithms can be much higher.  See below for details on 
the 
papers.

David Jensen
-------------------------------------------------------------------
Department of Computer Science                  jensen@cs.umass.edu
Box 34610 LGRC                http://eksl-www.cs.umass.edu/~jensen/
University of Massachusetts                     Voice: 413-545-9677
Amherst, MA 01003-4610                            Fax: 413-545-1249
-------------------------------------------------------------------


Tim Oates and David Jensen, "The Effects of Training Set Size on 
Decision Tree Complexity."  Proceedings of the Fourteenth 
International Conference on Machine Learning.  1997.  pp.  
254-262.

http://eksl-www.cs.umass.edu/papers/oatesML97.ps

Abstract: This paper presents experiments with 19 datasets and 5 
decision tree pruning algorithms that show that increasing 
training set size often results in a linear increase in tree 
size, even when that additional complexity results in no 
significant increase in classification accuracy.  Said 
differently, removing randomly selected training instances often 
results in trees that are substantially smaller and just as 
accurate as those built on all available training instances.  
This implies that decreases in tree size obtained by more 
sophisticated data reduction techniques should be decomposed into 
two parts: that which is due to reduction of training set size, 
and the remainder, which is due to how the method selects 
instances to discard.  We perform this decomposition for one 
recent data reduction technique, John's Robust-C4.5, and show 
that a large percentage of its effect on tree size is 
attributable to the fact that it simply reduces the size of the 
training set.  We conclude that random data reduction is a 
baseline against which more sophisticated data reduction 
techniques should be compared.  Finally, we examine one possible 
cause of the pathological relationship between tree size and 
training set size.


David Jensen, Tim Oates, and Paul R. Cohen.  "Building Simple 
Models: A Case Study with Decision Trees."  Proceedings of the 
Second International Symposium on Intelligent Data Analysis.  
July 1997.

http://eksl-www.cs.umass.edu/papers/jensenida97.ps

Abstract: Building correctly-sized models is a central challenge 
for induction algorithms.  Many approaches to decision tree 
induction fail this challenge.  Under a broad range of 
circumstances, these approaches exhibit a nearly linear 
relationship between training set size and tree size, even after 
accuracy has ceased to increase.  These algorithms fail to adjust 
for the statistical effects of comparing multiple subtrees.  
Adjusting for these effects produces trees with little or no 
excess structure.

Prev by Date: Re: DM: Clustering algorithm for high-dimensional Boolean space
Next by Date: DM: Could anyone download Alice?
Prev by thread: DM: Decision Trees
Next by thread: CRISP-DM: data-mining process model
Index(es):
- Date
- Thread