[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Subscribe]
Re: DM: Datamining Definition & UEA MSc & chaid questionFrom: Tony Bagnall Date: Thu, 23 Mar 2000 18:11:51 +0000 I'm sorry, I missed the initial reference to the knowledge extraction MSc, but I'll happily try to clarify the course content and give my opinions on what data mining is (which may differ from the rest of the research groups!) At 17:29 22/03/00 -0800, you wrote: >Somebody posted the URL for an MSc in knowledge extraction from U East >Anglia the other day. The web site says, "extracting hidden knowledge from >larger data bases". Would this Msc in knowledge extraction be necessarily >different from an MSc in data mining? I've always thought data mining was a misnomer. We don't actually mine FOR data in the way that you mine for gold or coal. Generally we have plenty of data and we mine for patterns/knowledge in that data. I would say personally that knowledge extraction is closer to the true description of what we are trying to achieve. >If we go with such a broad term then data mining/knowledge extraction >becomes synonymous with machine learning does it not? Would an Msc in >machine learning then be the same as an Msc in data extraction? I'm not responsible for the course, but I know that it is definitely not equivalent to an MSc in machine learning, primarily because of the major statistical element to the course. We attempt to present the material from two sides: the statistical approach to exploratory data analysis and the machine learning approach to data mining, and we try to highlight areas of obvious cross over (well, I do. I gave some lectures on Chaid and Bayesian networks last semester). Most (but not all) of the students are working in industry, particularly in insurance, and their bosses tend to be keen on a high stats content. I think it presents a much rounder picture of the possible methods of approaching a problem than a pure machine learning course, but I would say that wouldn't I (see my job description below). Essentially I view the exploratory data analysis techniques used in stats as attempting to achieve pretty much the same tasks as the machine learning methods used in data mining. The web page is at http://www.sys.uea.ac.uk/PGStudy/mscke.html the data mining groups page is at http://www.sys.uea.ac.uk/kdd There is a glossy brochure we can send you. Feel free to mail me with any queries about our courses or our research or you can go straight to the top to Professor Rayward-Smith vjrs@sys.uea.ac.uk Tony Bagnall Lecturer in Statistics for Data Mining School of Information Systems/ School of Mathematics University of East Anglia http://www.sys.uea.ac.uk/~ajb/ btw, as I'm posting, I sent a CHAID query to the list some time ago but got no response. It was a bit involved, but I'll post it again just in case hi, I was hoping someone could answer a few related questions about CHAID and KS 1. CHAID: if final groupings from two or more predictors are found to be significant, how does CHAID choose between them? I couldn't extract this info for the Kaas 1980 paper (although it may be there). My guess would be it chooses the predictor with the lowest p value, but this isnt completely obvious, since if for example a predictor has a p of 0.0004 and a Bonferroni adjusted threshold of 0.02, is it worse than a predictor with a p of 0.00035 and a Bonferroni adjusted threshold of 0.0004, or should the p values by adjusted by the Bonferroni as well as the significance levels? 2. KS: Again the question relates to how to choose between predictors for which a significant category grouping has been found. The Biggs et al 1991 paper says "the significance level at which the 'best' k-way split of the 'best' variable should be tested is ..." which implies to me the predictors are ranked by their p values then a level is calculated. However, comments in the manual made me doubt this and my initial assumption about CHAID: Page 166 manual "as the predictor variables are ranked according to their significance level, it is important that the calculated levels not favour one variable over another" which seems to be saying KS takes the (significant) predictor with the lowest upper bound as calculated by alpha/N_Bv*N_Bc (e.g. P_1 has p of 0.00035 and a Bonferroni adjusted threshold of 0.02, P_2 a p of 0.009 and a Bonferroni adjusted threshold of 0.01, choose P_2) 3. Another KS question about the adjusters. When in cluster mode, does it use the CHAID Bonferroni adjusters (as implied in the manual) or the KS adjuster? If it uses the CHAID adjusters, does anyone know why (especially after spending pages explaining how they can favour monotonic etc)? thanks very much for any help, I appologise if the explanation is actually staring me in the face Tony Bagnall (bogged down in the detail)
|
MHonArc
2.2.0