DM: Re: problem of sample size

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [Subscribe]

DM: Re: problem of sample size

From: I Kopanas
Date: Tue, 29 Aug 2000 13:30:54 +0300



 > Though It is a sort of traditional question, I wonder your method to
 > deal with this kind of problem.

 > The population size is about 9,500,000 (as record). There are two
 > groups, A and B.
 > But Unfortunately, the size of A is 9,300,000 and that of B is
200,000.
 > Of course, The size of B is sufficiently enough to make sample or
 > analyze, But we have to balance the size of two groups. what is
 > appropriate sample size for two groups, What kind of sampling methods
 > could be applied?
 >
 > This problem is similar to the case of 1 bad guy and 99 good guys of
100
 > guys.
 >
 >

I had a quite similar problem:

My problem has to do with the data set. I have two classes (the good =
guys
and the bad guys) unfortunatelly the bad guys are only 20 when the good =
guys
are 99980. Anybody who knows how to deal with it?
Thanks in advance.
      Yannis

here are the answers I had:

1st

-------------------------------------------------------------------------=
-------

Balanced resampling...

Build MANY (100s) data sets, each with the 20 bad guys and 20 randomly
selected good guys. Make a model on each data set. When you have a new
sample to classify, have all your models 'vote' on good vs. bad and
declare
as the winner the decision with the most votes.

-- Aaron --

aaron.j.owens@usa.dupont.com




2nd

-------------------------------------------------------------------------=
-------

Your case is very extreme. Usually, I'd suggest playing with the prior
probabilities and misclassification costs. How important are those 20
"bad guys"?



--
T.S. Lim
tslim@recursive-partitioning.com
www.Recursive-Partitioning.com


3rd

-------------------------------------------------------------------------=
-------

Also, if your learner doesn't allow you to set prior probabilities or
misclassification costs, you might try adding 50 copies of each bad guy
to your training sample.  I wouldn't remove good guys from your sample,
because your sample isn't insanely large (and I believe this practice
encourages
over fitting).

Basically, you want to tell the learner that classifying the bad guys is
important.

Lastly, accuracy isn't an applicable metric in this domain.  By saying
everyone is a good guy, you get high accuracy, but no insight on
catching
bad guys.  Consider using precision and recall as your metrics for
measuring
the effectiveness of your rules. Informally, if some rule identifies
X members as bad and Y of them were actually bad, the rule's precision
is Y/X.
And if your sample has Z bad guys, that same rule's recall is Y/Z.

I hope this helps.

Earl Harris Jr.

=20
-------------------------------------------------------------------------=
-------

References:
- DM: problem of sample size
  - From: vinnie

Prev by Date: DM: Re: problem of sample size
Next by Date: DM: I'm back...
Prev by thread: DM: Re: problem of sample size
Next by thread: DM: SIGKDD Explorations Volume 2, Issue 1 Available Online
Index(es):
- Date
- Thread