![]() |
|
![]() |
![]() |
|
![]() |
![]() |
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Subscribe]
Re: DM: Imputation of binary-valued featuresFrom: Murray Jorgensen Date: Thu, 14 Aug 1997 18:20:38 -0400 (EDT)
At 21:22 14/08/97 +0100, Richard Dybowski <richard@n-space.co.uk>
wrote:
>Hi
>
>I have a dataset in which all the variables (features) are binary,
>however,
>some of the rows of the dataset have at least one value missing. Can
>anyone
>give me details of an E-M algorithm (for which convergence is
>guaranteed)
>that will enable me to model the underlying probability mass
>distribution
>thus enabling me to perform imputation? There is an established
>method of
>doing this when the variables are real-valued (i.e. by using a
>Gaussian
>mixture model of a multivariate pdf), but what is the approved
>method when
>the variables are binary-valued (or a mixture of real- and
>binary-valued
>variables)?
>
>Thanking you in advance,
>
>Richard
I posted the following notice on Class-l in March. Unfortunately we
still havn't got it up on our ftp site owing to other commitments,
but, as
they say, real-soon-now!
To answer Richard's question, the answer for binary or multi-category
variables is known as Latent Class Analysis and the answer for when
variables are both continuous and categorical is our MULTIMIX.
Our earlier announcement follows:
-------------------------------------------------------------------------
the MULTIMIX group at the University of Waikato
(Lynette Hunt and Murray Jorgensen) announce the availability of the
MULTIMIX program, which clusters data having both categorical and
continuous variables, possibly containing missing observations.
The class of models fitted is described in the (Plain) TeX code which
follows and generalizes both Latent Class Analysis and Mixtures of
Multivariate Normals.
We hope soon to have this software available on our ftp site. If you
are
interested in downloading this software please send us you email
address
and we will notify you when the program will be available.
Lynette Hunt and Murray Jorgensen
(the MULTIMIX group)
\def\xbold{{\bf x}}
\font\bm =cmmib10 % bold maths
\def\thetavec{\hbox{\bm\char'022}}
\def\muvec{\hbox{\bm\char'026}}
\def\nuvec{\hbox{\bm\char'027}}
\def\phivec{\hbox{\bm\char'036}}
\def\px{\mathord{\buildrel{\lower3pt
\hbox{$\scriptscriptstyle\smile$}} \over {\bf x}}}
{\bf Models available in MULTIMIX}
We expect the data to be in the form of an $n\times p$
matrix of observations by variables which we regard
as a random sample from the distribution $f(x)=\sum
{\pi _kf_k(x)}$, itself a finite mixture of the $K$
component distributions $f_k$ in proportions
$\pi_k\ge 0$ satisfying $\sum \pi_k=1$. We suppose
that the vector of variables
$\xbold=(x_1,\ldots,x_j,\ldots,x_p)^\prime$ has been
partitioned into $(\px_1^\prime\quad
|\ldots|\quad\px_l^\prime\quad |
\ldots|\quad\px_L^\prime)^\prime$ and we consider
component distributions of the form
$f_k(\xbold)=\prod_l f_{kl}(\px_l)$. This is a weak
form of `local independence': within each of the $K$
subpopulations the variables in the subvector $\px_l$
are independent of the variables in $\px_{l^\prime}$
for $1\le l<l^\prime\le L$. True `local independence' is
the independence of each $x_j$ within subpopulations.
We can write the model for the $i$th observation as
$$f(\xbold_i;\phivec)=\sum\limits_{k=1}^K\pi_k
\prod\limits_{l=1}^Lf_{kl}(\px_{il};\thetavec_{kl})$$
where $\thetavec_{kl}$ consists of the parameters of the
distribution $f_{kl}$. This formulation includes the
motivating examples of Latent Class analysis (Aitkin,
Anderson \& Hinde, 1981) and mixtures of multivariate
normals (McLachlan \& Basford, 1988). When a subvector
contains only a single variable, that variable is
independent of all other variables within each
subpopulation. It is convenient to assume forms
for the $f_{kl}$, and hence for the $f_{k}$, that
belong to the exponential family. The model is then
well suited for maximum likelihood estimation of its
parameters by the {\it EM} algorithm of Dempster,
Laird \& Rubin (1977). This approach is followed in
MULTIMIX with the following distributions for the
$\px_{kl}$:\hfil\break
(a) {\it Discrete Distribution.} Here $\px_l=\{x_j\}$
is a 1-dimensional discrete random variable taking
values $1,\ldots,M_j$ with probabilities
$\lambda_{kl1},\ldots,\lambda_{klM_j}$.\hfil\break
(b) {\it Multivariate Normal.} Here $\px_l$ is a
$p_l$-dimensional vector of continuous random
variables with the $N_{p_l}(\muvec_{kl},
\Sigma_{kl})$ distribution.\hfil\break
(c) {\it Location Model.} Here $\px_l$ is a $1+p_l$
dimensional vector of random variables with one
discrete variable, $x_j$, and $p_l$ continuous variables
as elements. The discrete random variable takes
values $1,\ldots,M_j$ with probabilities
$\lambda_{kl1},\ldots,\lambda_{klM_j}$. Conditional on
the discrete variable taking value $m$ the $p_l$
continuous random variables have the multivariate
normal distribution $N_{p_l}(\nuvec_{mkl},\Xi_{kl})$.
\bye
Dr Murray Jorgensen maj@waikato.ac.nz Phone +64-7 838
4773
Department of Statistics home phone 856 6705; Fax 838
4666
University of Waikato
http://www.cs.waikato.ac.nz/stats/Staff/maj.html
Hamilton, New Zealand **** Editor: New Zealand Statistician
****
|
MHonArc 2.2.0