DM: RE: Problem of Sample size

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [Subscribe]

DM: RE: Problem of Sample size

From: osborn
Date: Thu, 6 Jul 2000 11:15:21 +1000

eeek@okclub.com WHOEVER he/she is, wrote:

 > what is proper sample size?
 > How could we say that this group is too small, and the other is large?

 > If there is proper size of sample for analysis, and the sample that
 > we have is too small, please, let me know best way of increasing
 > the sample size.

It's hard to tell if this is a troll or a joke, or a serious question. It
would
be appropriate for all posters to identify themselves by name and
affiliation in future...

But to take this one seriously for a minute:

(1) Experimental design is the collection of methods behind sample
size and structure (as well as what to measure/query/etc, consistency
checking, etc). This is a technical field - ie, you have to study it.

(2) It is assumption driven, but some assumptions can be refined by
pilot studies, reviewing prior studies, introducing domain specific
constraints, etc.

(3) It also depends on what you want to determine from your analysis
and modelling, and how precisely you want to know it (mainly sensitivity
and specificity). This has to be informed by the costs/utility of type 1
and type 2 errors, and risk of prediction errors. If other insights are
desired (eg, structure in data sets), there are other issues to do
with dimensionality and discriminating power (another, large topic).

(4) For conventional hypothesis testing, or modelling using GLMs or
OLS, experimental design is fairly straight forward (as the models and
and assumptions are extensive of the population space).

Using other kinds of modelling (non-parametric, neural, hierarchical,
Bayes variants, etc), the issues get clouded as model components
are dependent on each other. In my experience, if you use the underying
design for GLMs, you will do better (more information extraction, more
precision, less risk, etc) using the appropriate more elaborate method.
This assumes you have some expertise with the more elaborate method.
[This assumption may not be valid if the modelling is done by a
"business analyst", "cowboy", or uses too many drugs].

Where modelling is done on fairly small data sets or high parameter
models, model over-fitting is a big concern, with a large literature
advising on options. Read the Neural Net FAQ (for starters).

 > I consider two possiblities. One is data duplication and the other is
 > inclusion of old data having  a little different pattern that I think.

I respectfully suggest that it's time to read a book.

Tom.

Dr Tom Osborn
Director of Modelling
The NTF Group
Decision Support Consultants
Level 7, 1 York Street
SYDNEY NSW 2000
AUSTRALIA
phone:	+61 2 9252 0600
fax:	+61 2 9251 9894

Prev by Date: DM: RE: Bagging a classification Tree
Next by Date: DM: KDD CUP 2000 - second question period starting 7/10/2000
Prev by thread: DM: KDD CUP 2000 - second question period starting 7/10/2000
Next by thread: DM: RE: Bagging a classification Tree
Index(es):
- Date
- Thread