You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Jayati <ti...@gmail.com> on 2014/06/17 09:22:58 UTC

Contribution to Spark MLLib

Hello,

I wish to contribute some algorithms to the MLLib of Spark but at the same
time wanted to make sure that I don't try something redundant.

Will it be okay with you to let me know the set of algorithms which aren't
there in your road map in the near future ?

Also, can I use Java to write machine learning algorithms for Spark MLLib
instead of Scala ?

Regards,
Jayati



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Contribution-to-Spark-MLLib-tp7716.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Contribution to Spark MLLib

Posted by Debasish Das <de...@gmail.com>.

Dennis,

If it is PLSA with least square loss then the QuadraticMinimizer that we
open sourced should be able to solve it for modest topics (till 1000 I
believe)...if we integrate a cg solver for equality (Nocedal's KNITRO paper
is the reference) the topic size can be increased much larger than ALS
normal ranks of 50-400....

Please look at the following JIRA and see if the Formulation 4 fits your
use-case....we will be using it internally for topic modeling...

https://issues.apache.org/jira/browse/SPARK-2426

If you need convex costs like kl divergence which I believe what most PLSA
uses, then it is not supported right now but internally we decided to start
with least square loss first and then move to KL divergence if we need
further cluster purity.

Thanks.
Deb

On Jun 18, 2014 11:39 PM, "Xiangrui Meng" <me...@gmail.com> wrote:

> Denis, I think it is fine to have PLSA in MLlib. But I'm not familiar
> with the modification you mentioned since the paper is new. We may
> need to spend more time to learn the trade-offs. Feel free to create a
> JIRA for PLSA and we can move our discussion there. It would be great
> if you can share your current implementation. So it is easy for
> developers to join the discussion.
>
> Jayati, it is certainly NOT mandatory. But if you want to contribute
> something new, please create a JIRA first.
>
> Best,
> Xiangrui
>

Re: Contribution to Spark MLLib

Posted by Xiangrui Meng <me...@gmail.com>.

Denis, I think it is fine to have PLSA in MLlib. But I'm not familiar
with the modification you mentioned since the paper is new. We may
need to spend more time to learn the trade-offs. Feel free to create a
JIRA for PLSA and we can move our discussion there. It would be great
if you can share your current implementation. So it is easy for
developers to join the discussion.

Jayati, it is certainly NOT mandatory. But if you want to contribute
something new, please create a JIRA first.

Best,
Xiangrui

Re: Contribution to Spark MLLib

Posted by Jayati <ti...@gmail.com>.

Hello Xiangrui,

I am looking at the Spark Issues, but just wanted to know, if it is
mandatory for me to work on existing JIRAs before I can contribute to MLLib.

Regards,
Jayati





--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Contribution-to-Spark-MLLib-tp7716p7895.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Contribution to Spark MLLib

Posted by Denis Turdakov <tu...@ispras.ru>.

Hello everybody,

Xiangrui, thanks for the link to roadmap. I saw it is planned to implement
LDA in the MLlib 1.1. What do you think about PLSA? 

I understand that LDA is more popular now, but recent research shows that
modifications of PLSA sometimes performs better[1]. Furthermore, the most
recent paper by same authors shows that there is a clear way to extend PLSA
to LDA and beyond[2]. We can implement PLSA with this modifications in
MLlib. Is it interesting?

Actually we already have implementation of Robust PLSA over Spark. So the
task is to integrate it into MLlib.

1. A. Potapenko, K. Vorontsov. 2013. Robust PLSA performs better than LDA.
In Proceedings of ECIR'13.
2. Vorontsov, Potapenko. Tutorial on Probabilistic Topic Modeling: Additive
Regularization for Stochastic Matrix Factorization.
http://www.machinelearning.ru/wiki/images/1/1f/Voron14aist.pdf 

Best regards,
Denis.




--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Contribution-to-Spark-MLLib-tp7716p7844.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Contribution to Spark MLLib

Posted by Jayati <ti...@gmail.com>.

Hello Xiangrui,

Thanks for sharing the roadmap. I really helped.

Regards,
Jayati





--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Contribution-to-Spark-MLLib-tp7716p7826.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Contribution to Spark MLLib

Posted by Xiangrui Meng <me...@gmail.com>.

Hi Jayati,

Thanks for asking! MLlib algorithms are all implemented in Scala. It
makes us easier to maintain if we have the implementations in one
place. For the roadmap, please visit
http://www.slideshare.net/xrmeng/m-llib-hadoopsummit to see features
planned for v1.1. Before contributing new algorithms, it would be
great if you can start with working on an existing JIRA.

Best,
Xiangrui

On Tue, Jun 17, 2014 at 12:22 AM, Jayati <ti...@gmail.com> wrote:
> Hello,
>
> I wish to contribute some algorithms to the MLLib of Spark but at the same
> time wanted to make sure that I don't try something redundant.
>
> Will it be okay with you to let me know the set of algorithms which aren't
> there in your road map in the near future ?
>
> Also, can I use Java to write machine learning algorithms for Spark MLLib
> instead of Scala ?
>
> Regards,
> Jayati
>
>
>
> --
> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Contribution-to-Spark-MLLib-tp7716.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.