Re: DM: Datamining Definition & UEA MSc & chaid question

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [Subscribe]

Re: DM: Datamining Definition & UEA MSc & chaid question

From: Paul Wilkie
Date: Fri, 24 Mar 2000 10:20:09 -0000


Personally I think Information Prospecting is closer to the truth - we are
usually trying to find where the info is (or if it is there at all!), and
often leave the heavy duty mining to more mundane tools.  Anyway I think the
image of an individual panning in a river is much more relevant than  helmet
and pickaxe
----- Original Message -----
From: Tony Bagnall <ajb@sys.uea.ac.uk>
To: <datamine-l@nautilus-sys.com>
Sent: 23 March 2000 18:11
Subject: Re: DM: Datamining Definition & UEA MSc & chaid question


 > I'm sorry, I missed the initial reference to the knowledge extraction MSc,
 > but I'll happily try to clarify the course content and give my opinions on
 > what data mining is (which may differ from the rest of the research
groups!)
 >
 >
 > At 17:29 22/03/00 -0800, you wrote:
 > >Somebody posted the URL for an MSc in knowledge extraction from U East
 > >Anglia the other day. The web site says, "extracting hidden knowledge
from
 > >larger data bases". Would this Msc in knowledge extraction be necessarily
 > >different from an MSc in data mining?
 >
 > I've always thought data mining was a misnomer. We don't actually mine FOR
 > data in the way that you mine for gold or coal. Generally we have plenty
of
 > data and we mine for patterns/knowledge in that data. I would say
 > personally that knowledge extraction is closer to the true description of
 > what we are trying to achieve.
 >
 > >If we go with such a broad term then data mining/knowledge extraction
 > >becomes synonymous with machine learning does it not? Would an Msc in
 > >machine learning then be the same as an Msc in data extraction?
 >
 > I'm not responsible for the course, but I know that it is definitely not
 > equivalent to an MSc in machine learning, primarily because of the major
 > statistical element to the course. We attempt to present the material from
 > two sides: the statistical approach to exploratory data analysis and the
 > machine learning approach to data mining, and we try to highlight areas of
 > obvious cross over (well, I do. I gave some lectures on Chaid and Bayesian
 > networks last semester).   Most (but not all) of the students are working
 > in industry, particularly in insurance, and their bosses tend to be keen
on
 > a high stats content. I think it presents a much rounder picture of the
 > possible methods of approaching a problem than a pure machine learning
 > course, but I would say that wouldn't I (see my job description below).
 > Essentially I view the exploratory data analysis techniques used in stats
 > as attempting to achieve pretty much the same tasks as the machine
learning
 > methods used in data mining.
 >
 >
 > The web page is at
 > http://www.sys.uea.ac.uk/PGStudy/mscke.html
 > the data mining groups page is at
 > http://www.sys.uea.ac.uk/kdd
 > There is a glossy brochure we can send you. Feel free to mail me with any
 > queries about our courses or our research or you can go straight to the
top
 > to Professor Rayward-Smith vjrs@sys.uea.ac.uk
 >
 > Tony Bagnall
 > Lecturer in Statistics for Data Mining
 > School of Information Systems/ School of Mathematics
 > University of East Anglia
 >
 > http://www.sys.uea.ac.uk/~ajb/
 >
 > btw, as I'm posting, I sent a CHAID query to the list some time ago but
got
 > no response. It was a bit involved, but I'll post it again just in case
 >
 > hi,
 >
 > I was hoping someone could answer a few related questions about CHAID and
 > KS
 >
 > 1. CHAID: if final groupings from two or more predictors are found to be
 > significant, how does CHAID choose between them? I couldn't extract this
 > info for the Kaas 1980 paper (although it may be there). My guess would be
 > it chooses the predictor with the lowest p value, but this isnt completely
 > obvious, since if for example a predictor has a p of 0.0004 and a
 > Bonferroni adjusted threshold of 0.02, is it worse than a predictor with a
 > p of 0.00035 and a Bonferroni adjusted threshold of 0.0004, or should the
p
 > values by adjusted by the Bonferroni as well as the significance levels?
 >
 >
 > 2. KS: Again the question relates to how to choose between predictors for
 > which a significant category grouping has been found. The Biggs et al 1991
 > paper says
 > "the significance level at which the 'best' k-way split of the 'best'
 > variable should be tested is ..."
 > which implies to me the predictors are ranked by their p values then a
 > level is calculated. However, comments in the manual made me doubt this
and
 > my initial assumption about CHAID: Page 166 manual
 > "as the predictor variables are ranked according to their significance
 > level, it is important that the calculated levels not favour one variable
 > over another"
 > which seems to be saying KS takes the (significant) predictor with the
 > lowest upper bound as calculated by alpha/N_Bv*N_Bc
 > (e.g. P_1 has p of 0.00035 and a Bonferroni adjusted threshold of 0.02,
P_2
 > a p of 0.009 and a Bonferroni adjusted threshold of 0.01, choose P_2)
 >
 > 3. Another KS question about the adjusters. When in cluster mode, does it
 > use the CHAID Bonferroni adjusters (as implied in the manual) or the KS
 > adjuster? If it uses the CHAID adjusters, does anyone know why (especially
 > after spending pages explaining how they can favour monotonic etc)?
 >
 >
 > thanks very much for any help, I appologise if the explanation is actually
 > staring me in the face
 >
 > Tony Bagnall
 > (bogged down in the detail)
 >
 >
 >
 >

References:
- Re: DM: Datamining Definition & UEA MSc & chaid question
  - From: Tony Bagnall

Prev by Date: Re: DM: Datamining Definition & UEA MSc & chaid question
Next by Date: DM: PhD student positions
Prev by thread: DM: Re: DM Datamining Definition & UEA MSc & chaid question
Next by thread: Re: DM: Datamining Definition.
Index(es):
- Date
- Thread