You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by danqing0703 <da...@berkeley.edu> on 2014/12/30 07:12:47 UTC

Problems concerning implementing machine learning algorithm from scratch based on Spark

Hi all,

I am trying to use some machine learning algorithms that are not included
in the Mllib. Like Mixture Model and LDA(Latent Dirichlet Allocation), and
I am using pyspark and Spark SQL.

My problem is: I have some scripts that implement these algorithms, but I
am not sure which part I shall change to make it fit into Big Data.

   - Like some very simple calculation may take much time if data is too
   big,but also constructing RDD or SQLContext table takes too much time. I am
   really not sure if I shall use map(), reduce() every time I need to make
   calculation.
   - Also, there are some matrix/array level calculation that can not be
   implemented easily merely using map(),reduce(), thus functions of the Numpy
   package shall be used. I am not sure when data is too big, and we simply
   use the numpy functions. Will it take too much time?

I have found some scripts that are not from Mllib and was created by other
developers(credits to Meethu Mathew from Flytxt, thanks for giving me
insights!:))

Many thanks and look forward to getting feedbacks!

Best, Danqing


GMMSpark.py (7K) <http://apache-spark-developers-list.1001551.n3.nabble.com/attachment/9964/0/GMMSpark.py>




--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Problems-concerning-implementing-machine-learning-algorithm-from-scratch-based-on-Spark-tp9964.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

Re: Problems concerning implementing machine learning algorithm from scratch based on Spark

Posted by MEETHU MATHEW <me...@yahoo.co.in>.
Hi,
The GMMSpark.py you mentioned is the old one.The new code is now added to spark-packages and is available at http://spark-packages.org/package/11 . Have a look at the new code.
We have used numpy functions in our code and didnt notice any slowdown because of this. Thanks & Regards,
Meethu M 

     On Tuesday, 30 December 2014 11:50 AM, danqing0703 <da...@berkeley.edu> wrote:
   

 Hi all,

I am trying to use some machine learning algorithms that are not included
in the Mllib. Like Mixture Model and LDA(Latent Dirichlet Allocation), and
I am using pyspark and Spark SQL.

My problem is: I have some scripts that implement these algorithms, but I
am not sure which part I shall change to make it fit into Big Data.

  - Like some very simple calculation may take much time if data is too
  big,but also constructing RDD or SQLContext table takes too much time. I am
  really not sure if I shall use map(), reduce() every time I need to make
  calculation.
  - Also, there are some matrix/array level calculation that can not be
  implemented easily merely using map(),reduce(), thus functions of the Numpy
  package shall be used. I am not sure when data is too big, and we simply
  use the numpy functions. Will it take too much time?

I have found some scripts that are not from Mllib and was created by other
developers(credits to Meethu Mathew from Flytxt, thanks for giving me
insights!:))

Many thanks and look forward to getting feedbacks!

Best, Danqing


GMMSpark.py (7K) <http://apache-spark-developers-list.1001551.n3.nabble.com/attachment/9964/0/GMMSpark.py>




--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Problems-concerning-implementing-machine-learning-algorithm-from-scratch-based-on-Spark-tp9964.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.