Nautilus Systems, Inc. logo and menu bar Site Index Home
News Books
Button Bar Menu- Choices also at bottom of page About Nautilus Services Partners Case Studies Contact Us
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [Subscribe]

Re: DM: Datamining Definition & UEA MSc & chaid question


From: Tony Bagnall
Date: Thu, 23 Mar 2000 18:11:51 +0000
I'm sorry, I missed the initial reference to the knowledge extraction MSc,
but I'll happily try to clarify the course content and give my opinions on
what data mining is (which may differ from the rest of the research groups!)


At 17:29 22/03/00 -0800, you wrote:
>Somebody posted the URL for an MSc in knowledge extraction from U East
>Anglia the other day. The web site says, "extracting hidden knowledge from
>larger data bases". Would this Msc in knowledge extraction be necessarily
>different from an MSc in data mining?

I've always thought data mining was a misnomer. We don't actually mine FOR
data in the way that you mine for gold or coal. Generally we have plenty of
data and we mine for patterns/knowledge in that data. I would say
personally that knowledge extraction is closer to the true description of
what we are trying to achieve. 

>If we go with such a broad term then data mining/knowledge extraction
>becomes synonymous with machine learning does it not? Would an Msc in
>machine learning then be the same as an Msc in data extraction? 

I'm not responsible for the course, but I know that it is definitely not
equivalent to an MSc in machine learning, primarily because of the major
statistical element to the course. We attempt to present the material from
two sides: the statistical approach to exploratory data analysis and the
machine learning approach to data mining, and we try to highlight areas of
obvious cross over (well, I do. I gave some lectures on Chaid and Bayesian
networks last semester).   Most (but not all) of the students are working
in industry, particularly in insurance, and their bosses tend to be keen on
a high stats content. I think it presents a much rounder picture of the
possible methods of approaching a problem than a pure machine learning
course, but I would say that wouldn't I (see my job description below).
Essentially I view the exploratory data analysis techniques used in stats
as attempting to achieve pretty much the same tasks as the machine learning
methods used in data mining.  


The web page is at 
http://www.sys.uea.ac.uk/PGStudy/mscke.html
the data mining groups page is at
http://www.sys.uea.ac.uk/kdd
There is a glossy brochure we can send you. Feel free to mail me with any
queries about our courses or our research or you can go straight to the top
to Professor Rayward-Smith vjrs@sys.uea.ac.uk

Tony Bagnall
Lecturer in Statistics for Data Mining
School of Information Systems/ School of Mathematics
University of East Anglia 

http://www.sys.uea.ac.uk/~ajb/

btw, as I'm posting, I sent a CHAID query to the list some time ago but got
no response. It was a bit involved, but I'll post it again just in case 

hi,

I was hoping someone could answer a few related questions about CHAID and
KS

1. CHAID: if final groupings from two or more predictors are found to be
significant, how does CHAID choose between them? I couldn't extract this
info for the Kaas 1980 paper (although it may be there). My guess would be
it chooses the predictor with the lowest p value, but this isnt completely
obvious, since if for example a predictor has a p of 0.0004 and a
Bonferroni adjusted threshold of 0.02, is it worse than a predictor with a
p of 0.00035 and a Bonferroni adjusted threshold of 0.0004, or should the p
values by adjusted by the Bonferroni as well as the significance levels?


2. KS: Again the question relates to how to choose between predictors for
which a significant category grouping has been found. The Biggs et al 1991
paper says
"the significance level at which the 'best' k-way split of the 'best'
variable should be tested is ..."
which implies to me the predictors are ranked by their p values then a
level is calculated. However, comments in the manual made me doubt this and
my initial assumption about CHAID: Page 166 manual
"as the predictor variables are ranked according to their significance
level, it is important that the calculated levels not favour one variable
over another"
which seems to be saying KS takes the (significant) predictor with the
lowest upper bound as calculated by alpha/N_Bv*N_Bc 
(e.g. P_1 has p of 0.00035 and a Bonferroni adjusted threshold of 0.02, P_2
a p of 0.009 and a Bonferroni adjusted threshold of 0.01, choose P_2)

3. Another KS question about the adjusters. When in cluster mode, does it
use the CHAID Bonferroni adjusters (as implied in the manual) or the KS
adjuster? If it uses the CHAID adjusters, does anyone know why (especially
after spending pages explaining how they can favour monotonic etc)?


thanks very much for any help, I appologise if the explanation is actually
staring me in the face

Tony Bagnall
(bogged down in the detail)







[ Home | About Nautilus | Case Studies | Partners | Contact Nautilus ]
[ Subscribe to Lists | Recommended Books ]

logo Copyright © 1999 Nautilus Systems, Inc. All Rights Reserved.
Email: firschng@nautilus-systems.com
Mail converted by MHonArc 2.2.0