Nautilus Systems, Inc. logo and menu bar Site Index Home
News Books
Button Bar Menu- Choices also at bottom of page About Nautilus Services Partners Case Studies Contact Us
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [Subscribe]

RE: AW: DM: RE: Data Forms for Mining (Limit on variables)


From: osborn
Date: Mon, 29 May 2000 11:31:41 +1000

H Mark Hubey wrote about VC dimension and necessary number of
[training] exemplars, etc:

 > I don't get this. The two phrases "the largest number h of vectors
 > that can be separated into two different classes" and "in
 > all the 2^h possible ways" is not registering in my brain. How
 > can a collection be separated into 2 classes in N ways?

2^h is the total number of mappings (ie, all classifications) of type
D -> {0,1} (where D is the domain). VC dimension is about the
achievability of a modelling framework to capture all of these.
Mark's example with {1,2,3,4,5,6} -> {0,1} has 2^6 possible
mappings. [VC dimension is also about some other things, too,
but let's not be distracted - in the DM forum, this part of VC dimension
is what's being talked about].

PAC learning theory [Valiant's "Probably Approximately Correct" stuff]
also addresses similar issues of learnability.

These both presume that learning would use the "Sherlock Holmesian"
notion of considering everything (all possible mappings) and eliminating
all those which are inconsistent with an exemplar set. Ie, an unbiassed
version space where all contending models live till killed by
inconsistency.
VC dimension is useful in addressing this (in a bounds sense) for
particular
model forms, while PAC is useful for addressing is by partitioning a
space of {0,1}^N -> {0,1} into 'right' and 'wrong'. BOTH approaches are
elegantly probabilistic...

As noted by others, in practical situations, the mappings of interest
are FAR less numerous than 2^h, since most interesting things in
the real world have a regularity, aka, compact representation, aka,
a lot of simple consistency with a set of exceptions, many of which
can likewise be easily represented. [Hence my reference to Hints].

...AND when the VC dimension of [generalised] linear forms is being
lorded above the non-linear/neural/trees/whatever else you can invent,
PLEASE remember that there's an awful lot of mappings which linear
forms CAN'T achieve!!!

Of course, there are mappings of interest [sic] which don't fall under
this umbrella, such as "describing the order of the first K coupling of
N drug influenced planerians worms at a (worm) sex orgy" - and maybe
there's significant predictability in that situation, awaiting an eager
young
ethologist data miner to discover...      ... :-)

T.

Dr Tom Osborn
Director of Modelling
NTF
Decision Support Consultants
Level 7, 1 York Street
SYDNEY NSW 2000
AUSTRALIA
phone:	+61 2 9252 0600
fax:	+61 2 9251 9894




[ Home | About Nautilus | Case Studies | Partners | Contact Nautilus ]
[ Subscribe to Lists | Recommended Books ]

logo Copyright © 1999 Nautilus Systems, Inc. All Rights Reserved.
Email: firschng@nautilus-systems.com
Mail converted by MHonArc 2.2.0