Nautilus Systems, Inc. logo and menu bar Site Index Home
News Books
Button Bar Menu- Choices also at bottom of page About Nautilus Services Partners Case Studies Contact Us
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [Subscribe]

Re: DM: Research based on restricted-access data.


From: Alex Alves Freitas
Date: Wed Nov 17 21:21:30 1999)

I believe there is a reasonable (though not perfect) solution to this problem:

The owner of the data can give researchers access to a modified version of its data. The "trick" is to rename attributes names such as Sex, Salary, etc. to generic attribute names such as Atr1, Atr2, etc.. If necessary the same principle can be used to rename attribute values as well. Rather than accessing attribute Sex with values "M" or "F" the researcher will access attribute Atr1 with values 1 or 2.

Hence, the discovered rules will refer to generic attribute / attribute value names, which will make sense only to the owner of the data. However, if we want to measure the success of a data mining algorithm by computing its predictive accuracy rate, as many people do, we can do it, and anyone can try to achieve a higher predictive accuracy rate on this public domain data set. Of course, this is not a perfect solution, because we would like to discover rules that are not only comprehensible and interesting, and in order to judge the comprehensibility and interestingness of a rule we should understand what it means. In addition, to do a good data mining job the researcher should understand very well the nature of the data. However, mining this modified, public-domain, "meaningless" data set seems better than not mining the data at all! Actually, the UCI's data set repository already contains one such public domain, "meaningless" data set, namely the Australian Credit data set. If I was the owner of a large data set, I would be happy to give this kind of modified data set to the research community, and let researchers try to mine my data for free. It seems I would have almost nothing to lose (except some time converting the data to protect its confidentiality) and a lot to gain (free results about my data which only I would be able to interpret!). I hope data owners think a little bit about that . . .

Regards,

Alex

=================================================
Alex A. Freitas, Ph.D.
PUC-PR (Pontificia Universidade Catolica - PR)
PPGIA - CCET
Rua Imaculada Conceicao, 1155
Curitiba - PR, 80215-901
Brasil
alex@ppgia.pucpr.br
http://www.ppgia.pucpr.br/~alex
=================================================


Richard Dybowski wrote:

> Most of the research done by the DM community involves 
> real-world databases, but there is a problem for researchers 
> involved with restricted-access data.
> Suppose that the owner of a database (not me) has invested a 
> large amountof money to build it, and s/he, therefore, wants 
to maintain control of how
> the database is used. In particular, s/he does not want the 
> world to have unrestricted access to it. This creates a problem 
> for me, for if I publish a paper based on this database, and am 
> asked by people in the research community for copies of the 
> database in order for them to verify my published results, 
I will have to refuse. But this refusal would place me
> in a very embarrassing position, and it could considerably 
reduce the value
> of my research.
>
> This is not a hypothetical scenario. I have been presented 
> with the opportunity to analyze such a database; therefore, 
> I would welcome any suggestions from those who have been 
>confronted with a similar situation.
> Regards
>
> Richard
>
> -------------------------------
> Richard Dybowski PhD
> Research Fellow (Knowledge & Data Engineering)
> King's College London
> Medical Informatics Laboratory (Department of Medicine)
> 4th Floor
> North Wing
> St Thomas' Hospital
> Lambeth Palace Road
> London SE1 7EH
> UK
>
> Tel (office): (0)20 7928 9292 extension 6429
> Tel (mobile): 0976 250092
> Fax: +44 (0)20 7928 4458
> E-mail: richard.dybowski@kcl.ac.uk
> Web site: http://www.umds.ac.uk/microbio/richard/richard.htm
>
> {Note: Currently using e-mail address richard@n-space.co.uk 
whilst link to
> Internet is being established in my new office at 
St Thomas' Hospital}





[ Home | About Nautilus | Case Studies | Partners | Contact Nautilus ]
[ Subscribe to Lists | Recommended Books ]

logo Copyright © 1999 Nautilus Systems, Inc. All Rights Reserved.
Email: nautilus-info@nautilus-systems.com
Mail converted by MHonArc 2.2.0