Re: DM: Imputation of binary-valued features

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [Subscribe]
Re: DM: Imputation of binary-valued features

From: Murray Jorgensen
Date: Thu, 14 Aug 1997 18:20:38 -0400 (EDT)
At 21:22 14/08/97 +0100, Richard Dybowski <richard@n-space.co.uk> 
wrote:
>Hi
>
>I have a dataset in which all the variables (features) are binary, 
>however,
>some of the rows of the dataset have at least one value missing. Can 
>anyone
>give me details of an E-M algorithm (for which convergence is 
>guaranteed)
>that will enable me to model the underlying probability mass 
>distribution
>thus enabling me to perform imputation? There is an established 
>method of
>doing this when the variables are real-valued (i.e. by using a 
>Gaussian
>mixture model of a multivariate pdf), but what is the approved 
>method when
>the variables are binary-valued (or a mixture of real- and 
>binary-valued
>variables)?
>
>Thanking you in advance,
>
>Richard

I posted the following notice on Class-l in March. Unfortunately we
still havn't got it up on our ftp site owing to other commitments, 
but, as
they say, real-soon-now!

To answer Richard's question, the answer for binary or multi-category
variables is known as Latent Class Analysis and the answer for when
variables are both continuous and categorical is our MULTIMIX.

Our earlier announcement follows:

-------------------------------------------------------------------------
the MULTIMIX group at the University of Waikato
(Lynette Hunt and Murray Jorgensen) announce the availability of the
MULTIMIX program, which clusters data having both categorical and
continuous variables, possibly containing missing observations.

The class of models fitted is described in the (Plain) TeX code which
follows and generalizes both Latent Class Analysis and Mixtures of
Multivariate Normals.

We hope soon to have this software available on our ftp site. If you 
are
interested in downloading this software please send us you email 
address
and we will notify you when the program will be available.


Lynette Hunt and Murray Jorgensen
(the MULTIMIX group)

\def\xbold{{\bf x}}
\font\bm   =cmmib10       % bold maths
\def\thetavec{\hbox{\bm\char'022}}
\def\muvec{\hbox{\bm\char'026}}
\def\nuvec{\hbox{\bm\char'027}}
\def\phivec{\hbox{\bm\char'036}}
\def\px{\mathord{\buildrel{\lower3pt
\hbox{$\scriptscriptstyle\smile$}} \over {\bf x}}}

{\bf Models available in MULTIMIX}

We expect the data to be in the form of an $n\times p$
matrix of   observations by  variables which we regard
as a random sample from the   distribution  $f(x)=\sum
{\pi _kf_k(x)}$,   itself a finite mixture of the $K$
component distributions $f_k$   in proportions
$\pi_k\ge 0$ satisfying  $\sum \pi_k=1$.  We suppose
that the vector of   variables
$\xbold=(x_1,\ldots,x_j,\ldots,x_p)^\prime$ has   been
partitioned into  $(\px_1^\prime\quad
|\ldots|\quad\px_l^\prime\quad |
\ldots|\quad\px_L^\prime)^\prime$  and we consider
component distributions of the form
$f_k(\xbold)=\prod_l f_{kl}(\px_l)$.  This is a weak
form of `local independence':   within each of the  $K$
subpopulations the variables in the subvector $\px_l$
are   independent of the  variables in $\px_{l^\prime}$
for $1\le l<l^\prime\le L$. True `local independence' is
the  independence of each $x_j$ within subpopulations.
We can write the model for the $i$th observation as
$$f(\xbold_i;\phivec)=\sum\limits_{k=1}^K\pi_k
\prod\limits_{l=1}^Lf_{kl}(\px_{il};\thetavec_{kl})$$
where $\thetavec_{kl}$ consists of the parameters of the
distribution $f_{kl}$.  This formulation includes the
motivating examples of Latent Class analysis (Aitkin,
Anderson  \& Hinde, 1981) and mixtures of multivariate
normals  (McLachlan \& Basford, 1988). When a subvector
contains only a single variable, that variable is
independent of all other variables within each
subpopulation.       It is convenient to assume forms
for the  $f_{kl}$, and hence for the  $f_{k}$, that
belong to the exponential family.  The model is then
well suited  for maximum likelihood estimation of its
parameters by the {\it   EM} algorithm  of Dempster,
Laird \& Rubin  (1977). This approach is followed in
MULTIMIX with  the  following   distributions for the
$\px_{kl}$:\hfil\break
 (a) {\it Discrete Distribution.} Here $\px_l=\{x_j\}$
is a 1-dimensional discrete random   variable   taking
values  $1,\ldots,M_j$ with probabilities
$\lambda_{kl1},\ldots,\lambda_{klM_j}$.\hfil\break
 (b) {\it Multivariate Normal.} Here $\px_l$ is a
$p_l$-dimensional vector of continuous   random
variables  with the    $N_{p_l}(\muvec_{kl},
\Sigma_{kl})$ distribution.\hfil\break
 (c) {\it Location Model.} Here $\px_l$ is a $1+p_l$
dimensional vector of random   variables with  one
discrete variable, $x_j$, and $p_l$ continuous variables
as   elements.  The discrete  random variable takes
values $1,\ldots,M_j$ with probabilities
$\lambda_{kl1},\ldots,\lambda_{klM_j}$. Conditional on
the   discrete variable  taking value $m$ the $p_l$
continuous random   variables  have the multivariate
normal distribution   $N_{p_l}(\nuvec_{mkl},\Xi_{kl})$.
\bye
Dr Murray Jorgensen        maj@waikato.ac.nz       Phone +64-7 838 
4773
Department of Statistics         home phone 856 6705;      Fax 838 
4666
University of Waikato  
http://www.cs.waikato.ac.nz/stats/Staff/maj.html
Hamilton, New Zealand        **** Editor: New Zealand Statistician 
****
Prev by Date: RE: DM: DataMining/Knowledge Discovery N
Next by Date: Re: DM: Imputation of binary-valued features
Prev by thread: DM: Imputation of binary-valued features
Next by thread: Re: DM: Imputation of binary-valued features
Index(es):
- Date
- Thread