You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Ankur (JIRA)" <ji...@apache.org> on 2008/04/02 14:35:24 UTC
[jira] Commented: (MAHOUT-4) Simple prototype for Expectation
Maximization (EM)
[ https://issues.apache.org/jira/browse/MAHOUT-4?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12584544#action_12584544 ]
Ankur commented on MAHOUT-4:
----------------------------
Hi,
So after scratching my head for past couple of days and
understanding EM from a general perspective I built a mental model
for EM in the context of clustering.
I thought its better to share my understanding for getting inputs before
turning out the code.
So here is a short write-up in my words, please feel free to
fill any gaps/errors found
Expectation Maximization for clustering
-----------------------------------------------------
Let
z = unobserved data, clusters in our case.
y = observed data, points in our case.
We start by initializing the model paramters p(y|z) and p(z) to appropriately
normalized random values.
For eg:- let number of points = 4 and number of clusters = 2
then for cluster z1
p(y1|z1) + p(y2|z1) + p(y3|z1) + p(y4|z1) = 1
p(z1) + p(z2) = 1
E-Step.
------
Calculate posteriori estimates of probabilities for various
values of the unknown z as follows:-
p(y,z) p(y|z)*p(z)
p(z|y) = --------- = -----------------------------------------
p(y) summation over z { p(y|z)*p(z) }
Calculate expected value of log likelihood as follows:-
Q(y) = summation over z { p(z|y) * log(p(y,z))}
Q(y) remains unchanged if the calue calculated is = previous Q(y) value.
The algorithm terminates when no improvement is seen for Q(y) for all y
Use Q(y) as the value for p(y,z) for M-Step to re-compute model parameters
p(y|z) and p(z)
M-Step
------
p(y,z) (from E-Step)
p(y|z) = --------------------------------------------------
summation over y { p(y,z) } (from E-Step)
p(z) = summation over y { p(y,z) } (from E-Step)
Questions
=========
1. When and how do we re-compute the cluster centers ?
2. As per my understanding points and clusters are simply labels with some
conditional probability assigned to them. A distance metric like one
used in K-means is nowhere involved, is that correct ?
> Simple prototype for Expectation Maximization (EM)
> --------------------------------------------------
>
> Key: MAHOUT-4
> URL: https://issues.apache.org/jira/browse/MAHOUT-4
> Project: Mahout
> Issue Type: New Feature
> Reporter: Ankur
> Attachments: Mahout_EM.patch
>
>
> Create a simple prototype implementing Expectation Maximization - EM that demonstrates the algorithm functionality given a set of (user, click-url) data.
> The prototype should be functionally complete and should serve as a basis for the Map-Reduce version of the EM algorithm.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
Re: [jira] Commented: (MAHOUT-4) Simple prototype for Expectation Maximization (EM)
Posted by Isabel Drost <ap...@isabel-drost.de>.
Sorry for the late reply, real life is trying to catch me ;)
On Wednesday 02 April 2008, Ankur (JIRA) wrote:
> [
> https://issues.apache.org/jira/browse/MAHOUT-4?page=com.atlassian.jira.plug
>in.system.issuetabpanels:comment-tabpanel&focusedCommentId=12584544#action_1
>2584544 ]
>
> Ankur commented on MAHOUT-4:
> ----------------------------
> So here is a short write-up in my words, please feel free to
> fill any gaps/errors found
I will try to do so from my perspective, maybe others can add their views.
> Expectation Maximization for clustering
> -----------------------------------------------------
> Let
> z = unobserved data, clusters in our case.
> y = observed data, points in our case.
>
> p(y1|z1) + p(y2|z1) + p(y3|z1) + p(y4|z1) = 1
> p(z1) + p(z2) = 1
Looks correct to me.
> E-Step.
> ------
> M-Step
> ------
I could not find an error in neither of the two steps so far.
> Questions
> =========
> 1. When and how do we re-compute the cluster centers ?
EM does not work with explicit cluster centers. In kmeans you iterate two
steps: Assigning points to centers and recomputing the centers. In EM you
again iterate two steps: Computing the probabilities for each point belonging
to the clusters (so you do not assign them hard to one cluster, you only say
with probability P it belongs to clusters i to k), in the second step you
recompute the parameters of each cluster - the cluster center is influenced
by each point but only weighted by its probability of belonging to this
cluster.
> 2. As per my understanding points and clusters are simply labels with some
> conditional probability assigned to them. A distance metric like one
> used in K-means is nowhere involved, is that correct ?
Yes and no: Technically no, conceptually, your computation for the probability
of assigning a point to a cluster should be based on the point's distance to
the cluster.
I hope I did not cause more confusion than helping you. Maybe others can
correct me or clarify what I left unclear...
Isabel
--
There is no sin but ignorance. -- Christopher Marlowe
|\ _,,,---,,_ Web: <http://www.isabel-drost.de>
/,`.-'`' -. ;-;;,_
|,4- ) )-,_..;\ ( `'-'
'---''(_/--' `-'\_) (fL) IM: <xm...@spaceboyz.net>