Nautilus Systems, Inc. logo and menu bar Site Index Home
News Books
Button Bar Menu- Choices also at bottom of page About Nautilus Services Partners Case Studies Contact Us
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [Subscribe]

RE: DM: RE: Data Forms for Mining (Limit on variables)


From: osborn
Date: Thu, 25 May 2000 11:58:02 +1000
Frank Buckler wrote:

 > I'm surprised to hear that some guy's are using from thousand up to
 > million of inputs.

This can be relevant in signal processing and image processing, where
the data source can be unlimited (in principle). Ie, they can just keep
broadcasting. Also, see below.

 > There exist an upper bound on input-number determined by sample size.
 > This is advocated due to the VC-Dimension.

Underlying VC is the precision of parametrising classification
boundaries.
Information in input data AND class exemplars provides "bits" for the
precision, and ultimately probability neighbourhoods for
misclassification.

IE, THIS is ALL about the complexity of the classification, not prima
facie
the number of inputs (except for [generalised] linear models, but with
caveats for other model classes). You get prediction error from such
models due to finite data sets.

 > For linear regression you need n+1 examples in sample (n = number of
 > input)

This is sort of true. With less than n+1 (independent) examples, there
is
an irreducible set of models (all perfect fit, and requiring a
generalisation
test [or subjective method] to sort them out). However, you can naively
deal with even this situation, by (for example) using constraints of
maximum smoothness or minimum mean square parameters, or
add noise to data, etc. This IS naive, but is a workaround. Given a
real problem and a budget, there would be less naive 'best' models.

 > This is true but much more severe for non-linear modelling!

This isn't UNIVERSALLY the case. There are quite a few ways around
this. For standard GLMs, using MLE, you need something like 2*n
exemplars to avoid the irreducible set [above] (ie, achieve ONE best
model). Ie, cheat as for linear models.

For hierarchical models, pre-processing using PCA or non-linear variants

of PCA, or cleverer attempts with neural or RBF or SVMs, you can fan-in
as much width of data as you want. [Heirarchical] Mixture of Experts can

handle this well, too.

Certainly Dan Steinberg's comments about CART - a tree generater -
are in principle unbounded by the size of the input space. CART (etc)
only use the most informative variables (recursively), till some
criterion
stops the recursion! Roughly, the complexity of a CART model is how
many leaves your tree has...

The issue is whether you can generalise from the models so produced.
I say it can be done, in particular situations. It depends on how
complex
the classification (or whatever function) is you're hoping to
approximate.

T.

Dr Tom Osborn
Director of Modelling
NTF
Decision Support Consultants
Level 7, 1 York Street
SYDNEY NSW 2000
AUSTRALIA
phone:	+61 2 9252 0600
fax:	+61 2 9251 9894

[]




[ Home | About Nautilus | Case Studies | Partners | Contact Nautilus ]
[ Subscribe to Lists | Recommended Books ]

logo Copyright © 1999 Nautilus Systems, Inc. All Rights Reserved.
Email: firschng@nautilus-systems.com
Mail converted by MHonArc 2.2.0