[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Subscribe]
Re: DM: Imputation of binary-valued featuresFrom: Murray Jorgensen Date: Thu, 14 Aug 1997 18:20:38 -0400 (EDT) At 21:22 14/08/97 +0100, Richard Dybowski <richard@n-space.co.uk> wrote: >Hi > >I have a dataset in which all the variables (features) are binary, >however, >some of the rows of the dataset have at least one value missing. Can >anyone >give me details of an E-M algorithm (for which convergence is >guaranteed) >that will enable me to model the underlying probability mass >distribution >thus enabling me to perform imputation? There is an established >method of >doing this when the variables are real-valued (i.e. by using a >Gaussian >mixture model of a multivariate pdf), but what is the approved >method when >the variables are binary-valued (or a mixture of real- and >binary-valued >variables)? > >Thanking you in advance, > >Richard I posted the following notice on Class-l in March. Unfortunately we still havn't got it up on our ftp site owing to other commitments, but, as they say, real-soon-now! To answer Richard's question, the answer for binary or multi-category variables is known as Latent Class Analysis and the answer for when variables are both continuous and categorical is our MULTIMIX. Our earlier announcement follows: ------------------------------------------------------------------------- the MULTIMIX group at the University of Waikato (Lynette Hunt and Murray Jorgensen) announce the availability of the MULTIMIX program, which clusters data having both categorical and continuous variables, possibly containing missing observations. The class of models fitted is described in the (Plain) TeX code which follows and generalizes both Latent Class Analysis and Mixtures of Multivariate Normals. We hope soon to have this software available on our ftp site. If you are interested in downloading this software please send us you email address and we will notify you when the program will be available. Lynette Hunt and Murray Jorgensen (the MULTIMIX group) \def\xbold{{\bf x}} \font\bm =cmmib10 % bold maths \def\thetavec{\hbox{\bm\char'022}} \def\muvec{\hbox{\bm\char'026}} \def\nuvec{\hbox{\bm\char'027}} \def\phivec{\hbox{\bm\char'036}} \def\px{\mathord{\buildrel{\lower3pt \hbox{$\scriptscriptstyle\smile$}} \over {\bf x}}} {\bf Models available in MULTIMIX} We expect the data to be in the form of an $n\times p$ matrix of observations by variables which we regard as a random sample from the distribution $f(x)=\sum {\pi _kf_k(x)}$, itself a finite mixture of the $K$ component distributions $f_k$ in proportions $\pi_k\ge 0$ satisfying $\sum \pi_k=1$. We suppose that the vector of variables $\xbold=(x_1,\ldots,x_j,\ldots,x_p)^\prime$ has been partitioned into $(\px_1^\prime\quad |\ldots|\quad\px_l^\prime\quad | \ldots|\quad\px_L^\prime)^\prime$ and we consider component distributions of the form $f_k(\xbold)=\prod_l f_{kl}(\px_l)$. This is a weak form of `local independence': within each of the $K$ subpopulations the variables in the subvector $\px_l$ are independent of the variables in $\px_{l^\prime}$ for $1\le l<l^\prime\le L$. True `local independence' is the independence of each $x_j$ within subpopulations. We can write the model for the $i$th observation as $$f(\xbold_i;\phivec)=\sum\limits_{k=1}^K\pi_k \prod\limits_{l=1}^Lf_{kl}(\px_{il};\thetavec_{kl})$$ where $\thetavec_{kl}$ consists of the parameters of the distribution $f_{kl}$. This formulation includes the motivating examples of Latent Class analysis (Aitkin, Anderson \& Hinde, 1981) and mixtures of multivariate normals (McLachlan \& Basford, 1988). When a subvector contains only a single variable, that variable is independent of all other variables within each subpopulation. It is convenient to assume forms for the $f_{kl}$, and hence for the $f_{k}$, that belong to the exponential family. The model is then well suited for maximum likelihood estimation of its parameters by the {\it EM} algorithm of Dempster, Laird \& Rubin (1977). This approach is followed in MULTIMIX with the following distributions for the $\px_{kl}$:\hfil\break (a) {\it Discrete Distribution.} Here $\px_l=\{x_j\}$ is a 1-dimensional discrete random variable taking values $1,\ldots,M_j$ with probabilities $\lambda_{kl1},\ldots,\lambda_{klM_j}$.\hfil\break (b) {\it Multivariate Normal.} Here $\px_l$ is a $p_l$-dimensional vector of continuous random variables with the $N_{p_l}(\muvec_{kl}, \Sigma_{kl})$ distribution.\hfil\break (c) {\it Location Model.} Here $\px_l$ is a $1+p_l$ dimensional vector of random variables with one discrete variable, $x_j$, and $p_l$ continuous variables as elements. The discrete random variable takes values $1,\ldots,M_j$ with probabilities $\lambda_{kl1},\ldots,\lambda_{klM_j}$. Conditional on the discrete variable taking value $m$ the $p_l$ continuous random variables have the multivariate normal distribution $N_{p_l}(\nuvec_{mkl},\Xi_{kl})$. \bye Dr Murray Jorgensen maj@waikato.ac.nz Phone +64-7 838 4773 Department of Statistics home phone 856 6705; Fax 838 4666 University of Waikato http://www.cs.waikato.ac.nz/stats/Staff/maj.html Hamilton, New Zealand **** Editor: New Zealand Statistician ****
|
MHonArc
2.2.0