You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Denis Turdakov <tu...@ispras.ru> on 2014/07/03 17:49:44 UTC

PLSA

Hello guys,

We made pull request with PLSA and its modifications:
- https://github.com/apache/spark/pull/1269
- JIRA issue SPARK-2199
Could somebody look at the code and provide some feedback what we should
improve.

Best regards,
Denis Turdakov




--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/PLSA-tp7170.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

Re: PLSA

Posted by Denis Turdakov <tu...@ispras.ru>.

Hi, Deb.

Thanks for your idea to use ALS for PLSA training. I discussed it with our
engineers and it seems it's better to use EM. We have the following points:

1. We have some doubts that ALS is applicable to the problem. By its
definition, PLSA is a matrix decomposition with respect to Kullback–Leibler
divergence. Unlike matrix decomposition with respect to Frobenius' distance,
this problem lacks convexity. Actually, PLSA has multiple solutions and
EM-algorithm is experimentally proven to obtain "good" local maxima. By the
way, Kullback–Leibler divergence is not symmetric, does not satisfy to
triangle inequality and the objects it's defined on do not form the linear
space -- would not it be a problem for ALS? For instance, does not ALS rely
on the fact that L_2 is defined on the object that form self-dual space?

2. Using EM-algorithm is a common way for PLSA training described in most
known papers. "Contributing to Spark" page says, "a new algorithm" "should
be used and accepted" and "widely known". We have found no publications
describing PLSA training via ALS. So, we are not sure if it will provide
results of comparable quality to EM-algorithm. That could be an issue for
research.

3. PLSA objective function is non-differentiable in some points (e.g.
\phi_{wt} = 0 \forall t), EM-algorithm is theoretically proven to avoid such
points. We are afraid, ALS is likely to fall into this trap.

4. We also suppose that EM is faster for this problem. PLSA has D*T + W*T
parameters (where D stand for the number of documents, T - for the number of
topics, W - for the size of the alphabet). Thus, Hessian matrix has size
(D*T + W*T)^2 -- that's a lot. Note, EM-algorithm has O(D*T*W) complexity
for every iteration. Probably, it's possible to use a diagonalized Hessian,
but it will increase the number of iterations needed for convergence and we
think EM-algorithm will outperform it due to the fact that we have a very
simple analytical solution for E-step. (There was a story in 80's about
EM-algorithms and second-order methods. Second-order methods were believed
to outperform EM-algorithm unless someone has proven that it's not necessary
to conduct precise optimization during E-step -- it's enough to increase the
objective function. This idea allowed to speed up E-step and EM-algorithm
became superior to second-order methods. Once again, we have a very simple
analytical solution for E-step -- it's very fast).

5. As far as we can understand, RALS handles only quadratic regularizers
(maybe it's possible to substitute a quadratic approximation at every
iteration, but we've no idea why it must work). In PSLA we want to allow
user to define every regularizer she wants.

6. There are only a few broadcasts in our implementation. Only \phi matrix
is broadcasted (yes, we have to call broadcast(...) method in three
different places, but we broadcast only once in iteration).

--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/PLSA-tp7170p7194.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

Re: PLSA

Posted by Debasish Das <de...@gmail.com>.

Thanks for the pointer...

Looks like you are using EM algorithm for factorization which looks similar
to multiplicative update rules

Do you think using mllib ALS implicit feedback, you can scale the problem
further ?

We can handle L1, L2, equality and positivity constraints in ALS now...As
long as you can find the gradient and hessian from the KL divergence loss,
you can use that in place of gram matrix that is used in ALS right now

If you look in topic modeling work in Solr (Carrot is the package), they
use ALS to generate the topics...that algorithm looks like a simplified
version of what you are attempting here...

May be the EM algorithm for topic modeling is efficient than ALS but from
looking at it I don't see how...I see lot of broadcasts...while in implicit
feedback you need one broadcast of gram matrix...

On Fri, Jul 4, 2014 at 4:27 AM, Denis Turdakov <tu...@ispras.ru> wrote:

> Hi, Deb.
>
> I don't quite understand the question. PLSA is an instance of matrix
> factorization problem.
>
> If you are asking about inference algorithm, we use EM-algorithm.
> Description of this approach is, for example, here:
> http://www.machinelearning.ru/wiki/images/1/1f/Voron14aist.pdf
>
>
> Best, Denis.
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/PLSA-tp7170p7179.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>

Re: PLSA

Posted by Denis Turdakov <tu...@ispras.ru>.

Hi, Deb.

I don't quite understand the question. PLSA is an instance of matrix
factorization problem. 

If you are asking about inference algorithm, we use EM-algorithm.
Description of this approach is, for example, here:
http://www.machinelearning.ru/wiki/images/1/1f/Voron14aist.pdf


Best, Denis.



--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/PLSA-tp7170p7179.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

Re: PLSA

Posted by Debasish Das <de...@gmail.com>.

Hi Denis,

Are you using matrix factorization to generate the latent factors ?

Thanks.
Deb



On Thu, Jul 3, 2014 at 8:49 AM, Denis Turdakov <tu...@ispras.ru> wrote:

> Hello guys,
>
> We made pull request with PLSA and its modifications:
> - https://github.com/apache/spark/pull/1269
> - JIRA issue SPARK-2199
> Could somebody look at the code and provide some feedback what we should
> improve.
>
> Best regards,
> Denis Turdakov
>
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/PLSA-tp7170.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>