[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Subscribe]
Re: DM: Looking for three datasetsFrom: Ronny Kohavi Date: Thu Dec 2 09:41:14 1999
Jarek Sacha wrote: > > I am trying to locate three datasets: corral, m-of-n-3-7-10, and > shuttle-small (3866 test, 1934 train). > > The first two are synthetic. The last one is probably a smaller version of > the Statlog shuttle dataset. > They're all in http://www.sgi.com/Technology/mlc/db/ Note that for many datasets we provided a default "train" and "test" sets, in case you're not doing cross-validation. Corral is an artificial example designed to show that decision trees might pick a really bad attribute for the root. It's explained in John, G; Kohavi, R; and Pfleger, K., Irrelevant features and the subset selection problem. In Machine Learning:Proceedings of the Eleventh International Conference, 1994, available off http://robotics.Stanford.EDU/~ronnyk/ronnyk-bib.html and in my thesis (off the above web page at the top). The m-of-n-3-7-10 dataset represents the concept that at least three bits of bits numbered three to nine are set to one (bits one, two, and ten are irrelevant). Such target concepts are common in medical domains where a patient needs to exhibit at least m of a set of n symptoms to be diagnosed with some disease (Spackman 1988). The most interesting thing about this concept is that Naive-Bayes is unable to learn it even though it can be represented as a hyperplane and that performance improves if you hide a relevant feature (page 107 in my thesis). -- Ronny -------------
|
MHonArc
2.2.0