[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Subscribe]
RE: DM: RE: Data Forms for Mining (Limit on variables)From: osborn Date: Thu, 25 May 2000 11:58:02 +1000 Frank Buckler wrote: > I'm surprised to hear that some guy's are using from thousand up to > million of inputs. This can be relevant in signal processing and image processing, where the data source can be unlimited (in principle). Ie, they can just keep broadcasting. Also, see below. > There exist an upper bound on input-number determined by sample size. > This is advocated due to the VC-Dimension. Underlying VC is the precision of parametrising classification boundaries. Information in input data AND class exemplars provides "bits" for the precision, and ultimately probability neighbourhoods for misclassification. IE, THIS is ALL about the complexity of the classification, not prima facie the number of inputs (except for [generalised] linear models, but with caveats for other model classes). You get prediction error from such models due to finite data sets. > For linear regression you need n+1 examples in sample (n = number of > input) This is sort of true. With less than n+1 (independent) examples, there is an irreducible set of models (all perfect fit, and requiring a generalisation test [or subjective method] to sort them out). However, you can naively deal with even this situation, by (for example) using constraints of maximum smoothness or minimum mean square parameters, or add noise to data, etc. This IS naive, but is a workaround. Given a real problem and a budget, there would be less naive 'best' models. > This is true but much more severe for non-linear modelling! This isn't UNIVERSALLY the case. There are quite a few ways around this. For standard GLMs, using MLE, you need something like 2*n exemplars to avoid the irreducible set [above] (ie, achieve ONE best model). Ie, cheat as for linear models. For hierarchical models, pre-processing using PCA or non-linear variants of PCA, or cleverer attempts with neural or RBF or SVMs, you can fan-in as much width of data as you want. [Heirarchical] Mixture of Experts can handle this well, too. Certainly Dan Steinberg's comments about CART - a tree generater - are in principle unbounded by the size of the input space. CART (etc) only use the most informative variables (recursively), till some criterion stops the recursion! Roughly, the complexity of a CART model is how many leaves your tree has... The issue is whether you can generalise from the models so produced. I say it can be done, in particular situations. It depends on how complex the classification (or whatever function) is you're hoping to approximate. T. Dr Tom Osborn Director of Modelling NTF Decision Support Consultants Level 7, 1 York Street SYDNEY NSW 2000 AUSTRALIA phone: +61 2 9252 0600 fax: +61 2 9251 9894 []
|
MHonArc
2.2.0