You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by "Venkat, Ankam" <An...@centurylink.com> on 2014/12/14 15:26:09 UTC

MLlib vs Madlib

Can somebody throw light on MLlib vs Madlib?

Which is better for machine learning? and are there any specific use case scenarios MLlib or Madlib will shine in?

Regards,
Venkat Ankam
This communication is the property of CenturyLink and may contain confidential or privileged information. Unauthorized use of this communication is strictly prohibited and may be unlawful. If you have received this communication in error, please immediately notify the sender by reply e-mail and destroy all copies of the communication and any attachments.

Re: MLlib vs Madlib

Posted by Brian Dolan <bu...@yahoo.com.INVALID>.

I don't have any solid performance numbers, no.  Let's start with some questions

* Do you have to do any feature extraction before you start the routine? E.g. NLP, NER or tokenization? Have you already vectorized?
* Which routine(s) do you wish to use?  Things like k-means do very well in a relational setting, neural networks not as much.
* Where does the data live now?  How often will you have to re-load the data and re-run the pipeline?
* The ML portion is probably the most expensive portion of the pipeline, so it may justify moving it in/out of HDFS or Greenplum for just the ML.

For processing speed, my guess is Greenplum will be fastest, then Spark + HDFS, then Greenplum + HAWQ.

I've done quite a bit of scale text analysis, and process is typically

1. Source the data. Either in Solr or HDFS or a drive somewhere
2. Annotation / Feature Extraction (just get the bits you need from the data)
3. Create vectors from the data.  Tf/Idf is the most popular format.
4. Run the routine
5. Shout "Damn" when you realize you did it wrong.
6. Do 1-5 again. And again.
7. Create a report of some sort.
8. Visualize.

When asking about performance, most people focus on (4).  When focused on production, you need to consider the total cost of the pipeline.  So my basic recommendation is to do the whole thing on a small scale first.  If you end up with very "relational" questions, put everything in Greenplum.  If it all comes down to a query on a single table, use Spark RDD and maybe Spark SQL.

Just as an example, I've seen standard Postgres run extremely fast on Weighted Dictionaries.  This demands just two tables, the weighted dictionary and a table with your documents.   Though it's possible (and I've been foolish enough to do it), you don't want to spend the time embedding Stanford NLP into Postgres, the performance is awful.

Let me know how it goes!
b
https://twitter.com/buddha_314

~~~~~~
May All Your Sequences Converge

On Dec 14, 2014, at 4:07 PM, "Venkat, Ankam" <An...@centurylink.com> wrote:

> Thanks for the info Brian.
>  
> I am trying to compare performance difference between “Pivotal HAWQ/Greenplum with MADlib” vs “HDFS with MLlib”. 
>  
> Do you think Spark MLlib will perform better because of in-memory, caching and iterative processing capabilities?   
>  
> I need to perform large scale text analytics and I can data store on HDFS or on Pivotal Greenplum/Hawq. 
>  
> Regards,
> Venkat Ankam
>  
> From: Brian Dolan [mailto:buddha_314@yahoo.com] 
> Sent: Sunday, December 14, 2014 10:02 AM
> To: Venkat, Ankam
> Cc: 'user@spark.apache.org'
> Subject: Re: MLlib vs Madlib
>  
> MADLib (http://madlib.net/) was designed to bring large-scale ML techniques to a relational database, primarily postgresql.  MLlib assumes the data exists in some Spark-compatible data format.
>  
> I would suggest you pick the library that matches your data platform first.
>  
> DISCLAIMER: I am the original author of MADLib, though EMC/Pivotal assumed ownership rather quickly.
>  
>  
> ~~~~~~
> May All Your Sequences Converge
>  
>  
>  
> On Dec 14, 2014, at 6:26 AM, "Venkat, Ankam" <An...@centurylink.com> wrote:
> 
> 
> Can somebody throw light on MLlib vs Madlib? 
>  
> Which is better for machine learning? and are there any specific use case scenarios MLlib or Madlib will shine in?
>  
> Regards,
> Venkat Ankam
> This communication is the property of CenturyLink and may contain confidential or privileged information. Unauthorized use of this communication is strictly prohibited and may be unlawful. If you have received this communication in error, please immediately notify the sender by reply e-mail and destroy all copies of the communication and any attachments.
>  
> This communication is the property of CenturyLink and may contain confidential or privileged information. Unauthorized use of this communication is strictly prohibited and may be unlawful. If you have received this communication in error, please immediately notify the sender by reply e-mail and destroy all copies of the communication and any attachments.

RE: MLlib vs Madlib

Posted by "Venkat, Ankam" <An...@centurylink.com>.

Thanks for the info Brian.

I am trying to compare performance difference between "Pivotal HAWQ/Greenplum with MADlib" vs "HDFS with MLlib".

Do you think Spark MLlib will perform better because of in-memory, caching and iterative processing capabilities?

I need to perform large scale text analytics and I can data store on HDFS or on Pivotal Greenplum/Hawq.

Regards,
Venkat Ankam

From: Brian Dolan [mailto:buddha_314@yahoo.com]
Sent: Sunday, December 14, 2014 10:02 AM
To: Venkat, Ankam
Cc: 'user@spark.apache.org'
Subject: Re: MLlib vs Madlib

MADLib (http://madlib.net/) was designed to bring large-scale ML techniques to a relational database, primarily postgresql.  MLlib assumes the data exists in some Spark-compatible data format.

I would suggest you pick the library that matches your data platform first.

DISCLAIMER: I am the original author of MADLib, though EMC/Pivotal assumed ownership rather quickly.


~~~~~~
May All Your Sequences Converge



On Dec 14, 2014, at 6:26 AM, "Venkat, Ankam" <An...@centurylink.com>> wrote:


Can somebody throw light on MLlib vs Madlib?

Which is better for machine learning? and are there any specific use case scenarios MLlib or Madlib will shine in?

Regards,
Venkat Ankam
This communication is the property of CenturyLink and may contain confidential or privileged information. Unauthorized use of this communication is strictly prohibited and may be unlawful. If you have received this communication in error, please immediately notify the sender by reply e-mail and destroy all copies of the communication and any attachments.

This communication is the property of CenturyLink and may contain confidential or privileged information. Unauthorized use of this communication is strictly prohibited and may be unlawful. If you have received this communication in error, please immediately notify the sender by reply e-mail and destroy all copies of the communication and any attachments.

Re: MLlib vs Madlib

Posted by Brian Dolan <bu...@yahoo.com.INVALID>.

MADLib (http://madlib.net/) was designed to bring large-scale ML techniques to a relational database, primarily postgresql.  MLlib assumes the data exists in some Spark-compatible data format.

I would suggest you pick the library that matches your data platform first.

DISCLAIMER: I am the original author of MADLib, though EMC/Pivotal assumed ownership rather quickly.

~~~~~~
May All Your Sequences Converge

On Dec 14, 2014, at 6:26 AM, "Venkat, Ankam" <An...@centurylink.com> wrote:

> Can somebody throw light on MLlib vs Madlib? 
>  
> Which is better for machine learning? and are there any specific use case scenarios MLlib or Madlib will shine in?
>  
> Regards,
> Venkat Ankam
> This communication is the property of CenturyLink and may contain confidential or privileged information. Unauthorized use of this communication is strictly prohibited and may be unlawful. If you have received this communication in error, please immediately notify the sender by reply e-mail and destroy all copies of the communication and any attachments.