You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Jean Georges Perrin <jg...@jgp.net> on 2016/07/22 03:41:47 UTC

MLlib, Java, and DataFrame

Hi,

I am looking for some really super basic examples of MLlib (like a linear regression over a list of values) in Java. I have found a few, but I only saw them using JavaRDD... and not DataFrame.

I was kind of hoping to take my current DataFrame and send them in MLlib. Am I too optimistic? Do you know/have any example like that?

Thanks!

jg


Jean Georges Perrin
jgp@jgp.net <ma...@jgp.net> / @jgperrin





Re: MLlib, Java, and DataFrame

Posted by VG <vl...@gmail.com>.
Interesting. thanks for this information.

On Fri, Jul 22, 2016 at 11:26 AM, Bryan Cutler <cu...@gmail.com> wrote:

> ML has a DataFrame based API, while MLlib is RDDs and will be deprecated
> as of Spark 2.0.
>
> On Thu, Jul 21, 2016 at 10:41 PM, VG <vl...@gmail.com> wrote:
>
>> Why do we have these 2 packages ... ml and mlib?
>> What is the difference in these
>>
>>
>>
>> On Fri, Jul 22, 2016 at 11:09 AM, Bryan Cutler <cu...@gmail.com> wrote:
>>
>>> Hi JG,
>>>
>>> If you didn't know this, Spark MLlib has 2 APIs, one of which uses
>>> DataFrames.  Take a look at this example
>>> https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/ml/JavaLinearRegressionWithElasticNetExample.java
>>>
>>> This example uses a Dataset<Row>, which is type equivalent to a
>>> DataFrame.
>>>
>>>
>>> On Thu, Jul 21, 2016 at 8:41 PM, Jean Georges Perrin <jg...@jgp.net>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> I am looking for some really super basic examples of MLlib (like a
>>>> linear regression over a list of values) in Java. I have found a few, but I
>>>> only saw them using JavaRDD... and not DataFrame.
>>>>
>>>> I was kind of hoping to take my current DataFrame and send them in
>>>> MLlib. Am I too optimistic? Do you know/have any example like that?
>>>>
>>>> Thanks!
>>>>
>>>> jg
>>>>
>>>>
>>>> Jean Georges Perrin
>>>> jgp@jgp.net / @jgperrin
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>
>

Re: MLlib, Java, and DataFrame

Posted by Bryan Cutler <cu...@gmail.com>.
ML has a DataFrame based API, while MLlib is RDDs and will be deprecated as
of Spark 2.0.

On Thu, Jul 21, 2016 at 10:41 PM, VG <vl...@gmail.com> wrote:

> Why do we have these 2 packages ... ml and mlib?
> What is the difference in these
>
>
>
> On Fri, Jul 22, 2016 at 11:09 AM, Bryan Cutler <cu...@gmail.com> wrote:
>
>> Hi JG,
>>
>> If you didn't know this, Spark MLlib has 2 APIs, one of which uses
>> DataFrames.  Take a look at this example
>> https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/ml/JavaLinearRegressionWithElasticNetExample.java
>>
>> This example uses a Dataset<Row>, which is type equivalent to a DataFrame.
>>
>>
>> On Thu, Jul 21, 2016 at 8:41 PM, Jean Georges Perrin <jg...@jgp.net> wrote:
>>
>>> Hi,
>>>
>>> I am looking for some really super basic examples of MLlib (like a
>>> linear regression over a list of values) in Java. I have found a few, but I
>>> only saw them using JavaRDD... and not DataFrame.
>>>
>>> I was kind of hoping to take my current DataFrame and send them in
>>> MLlib. Am I too optimistic? Do you know/have any example like that?
>>>
>>> Thanks!
>>>
>>> jg
>>>
>>>
>>> Jean Georges Perrin
>>> jgp@jgp.net / @jgperrin
>>>
>>>
>>>
>>>
>>>
>>
>

Re: MLlib, Java, and DataFrame

Posted by VG <vl...@gmail.com>.
Why do we have these 2 packages ... ml and mlib?
What is the difference in these



On Fri, Jul 22, 2016 at 11:09 AM, Bryan Cutler <cu...@gmail.com> wrote:

> Hi JG,
>
> If you didn't know this, Spark MLlib has 2 APIs, one of which uses
> DataFrames.  Take a look at this example
> https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/ml/JavaLinearRegressionWithElasticNetExample.java
>
> This example uses a Dataset<Row>, which is type equivalent to a DataFrame.
>
>
> On Thu, Jul 21, 2016 at 8:41 PM, Jean Georges Perrin <jg...@jgp.net> wrote:
>
>> Hi,
>>
>> I am looking for some really super basic examples of MLlib (like a linear
>> regression over a list of values) in Java. I have found a few, but I only
>> saw them using JavaRDD... and not DataFrame.
>>
>> I was kind of hoping to take my current DataFrame and send them in MLlib.
>> Am I too optimistic? Do you know/have any example like that?
>>
>> Thanks!
>>
>> jg
>>
>>
>> Jean Georges Perrin
>> jgp@jgp.net / @jgperrin
>>
>>
>>
>>
>>
>

Re: MLlib, Java, and DataFrame

Posted by Jean Georges Perrin <jg...@jgp.net>.
Thanks Bryan - I keep forgetting about the examples... This is almost it :) I can work with that :)


> On Jul 22, 2016, at 1:39 AM, Bryan Cutler <cu...@gmail.com> wrote:
> 
> Hi JG,
> 
> If you didn't know this, Spark MLlib has 2 APIs, one of which uses DataFrames.  Take a look at this example https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/ml/JavaLinearRegressionWithElasticNetExample.java <https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/ml/JavaLinearRegressionWithElasticNetExample.java>
> 
> This example uses a Dataset<Row>, which is type equivalent to a DataFrame.
> 
> 
> On Thu, Jul 21, 2016 at 8:41 PM, Jean Georges Perrin <jgp@jgp.net <ma...@jgp.net>> wrote:
> Hi,
> 
> I am looking for some really super basic examples of MLlib (like a linear regression over a list of values) in Java. I have found a few, but I only saw them using JavaRDD... and not DataFrame.
> 
> I was kind of hoping to take my current DataFrame and send them in MLlib. Am I too optimistic? Do you know/have any example like that?
> 
> Thanks!
> 
> jg
> 
> 
> Jean Georges Perrin
> jgp@jgp.net <ma...@jgp.net> / @jgperrin
> 
> 
> 
> 
> 


Re: MLlib, Java, and DataFrame

Posted by Bryan Cutler <cu...@gmail.com>.
Hi JG,

If you didn't know this, Spark MLlib has 2 APIs, one of which uses
DataFrames.  Take a look at this example
https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/ml/JavaLinearRegressionWithElasticNetExample.java

This example uses a Dataset<Row>, which is type equivalent to a DataFrame.


On Thu, Jul 21, 2016 at 8:41 PM, Jean Georges Perrin <jg...@jgp.net> wrote:

> Hi,
>
> I am looking for some really super basic examples of MLlib (like a linear
> regression over a list of values) in Java. I have found a few, but I only
> saw them using JavaRDD... and not DataFrame.
>
> I was kind of hoping to take my current DataFrame and send them in MLlib.
> Am I too optimistic? Do you know/have any example like that?
>
> Thanks!
>
> jg
>
>
> Jean Georges Perrin
> jgp@jgp.net / @jgperrin
>
>
>
>
>

Re: MLlib, Java, and DataFrame

Posted by Marco Mistroni <mm...@gmail.com>.
Hi Inam
  i sorted it.
 i reply to all, in case anyone else follow the blog and get into the same
issue

- First off, the Environment.I have tested the sample using purely
spark-1.6.1, no hive, no hadoop. I launched pyspark as follow  pyspark
--packages com.databricks:spark-csv_2.10:1.4.0

- Secondly, please note that when i do printSchema (at step 1) the column
'Churn' is listed as 'boolean', not as string like in the blog. this might
be due to the spark-csv version i am using (1.4.0)

>>> CV_data.printSchema()
root
 |-- State: string (nullable = true)
 |-- Account length: integer (nullable = true)
 |-- Area code: integer (nullable = true)
 |-- International plan: string (nullable = true)
 |-- Voice mail plan: string (nullable = true)
 |-- Number vmail messages: integer (nullable = true)
 |-- Total day minutes: double (nullable = true)
 |-- Total day calls: integer (nullable = true)
 |-- Total day charge: double (nullable = true)
 |-- Total eve minutes: double (nullable = true)
 |-- Total eve calls: integer (nullable = true)
 |-- Total eve charge: double (nullable = true)
 |-- Total night minutes: double (nullable = true)
 |-- Total night calls: integer (nullable = true)
 |-- Total night charge: double (nullable = true)
 |-- Total intl minutes: double (nullable = true)
 |-- Total intl calls: integer (nullable = true)
 |-- Total intl charge: double (nullable = true)
 |-- Customer service calls: integer (nullable = true)
 |-- Churn: boolean (nullable = true)



- Thirdly, at step 6, please replace the binary_map function with the
folloiwng

as i said,Churn is not a string columb but a boolean, and thefefore the
toNum function will fail big time.

binary_map = {'Yes':1.0, 'No':0.0, True:1.0, False:0.0}

I managed to arrive at step 7 without any issues (uhm i dont have
matplotlib so i skipped step 5, which i guess is irrelevant as it just
display the data rather than doing any logic)

Pls let me know if this fixes your problems..

hth

 marco












On Fri, Jul 22, 2016 at 6:34 PM, Inam Ur Rehman <in...@gmail.com>
wrote:

> Hello guys..i know its irrelevant to this topic but i've been looking
> desperately for the solution. I am facing en exception
> http://apache-spark-user-list.1001560.n3.nabble.com/how-to-resolve-you-must-build-spark-with-hive-exception-td27390.html
>
> plz help me.. I couldn't find any solution..plz
>
> On Fri, Jul 22, 2016 at 5:50 PM, Jean Georges Perrin <jg...@jgp.net> wrote:
>
>> Thanks Marco - I like the idea of sticking with DataFrames ;)
>>
>>
>> On Jul 22, 2016, at 7:07 AM, Marco Mistroni <mm...@gmail.com> wrote:
>>
>> Hello Jean
>>  you can take ur current DataFrame and send them to mllib (i was doing
>> that coz i dindt know the ml package),but the process is littlebit
>> cumbersome
>>
>>
>> 1. go from DataFrame to Rdd of Rdd of [LabeledVectorPoint]
>> 2. run your ML model
>>
>> i'd suggest you stick to DataFrame + ml package :)
>>
>> hth
>>
>>
>>
>> On Fri, Jul 22, 2016 at 4:41 AM, Jean Georges Perrin <jg...@jgp.net> wrote:
>>
>>> Hi,
>>>
>>> I am looking for some really super basic examples of MLlib (like a
>>> linear regression over a list of values) in Java. I have found a few, but I
>>> only saw them using JavaRDD... and not DataFrame.
>>>
>>> I was kind of hoping to take my current DataFrame and send them in
>>> MLlib. Am I too optimistic? Do you know/have any example like that?
>>>
>>> Thanks!
>>>
>>> jg
>>>
>>>
>>> Jean Georges Perrin
>>> jgp@jgp.net / @jgperrin
>>>
>>>
>>>
>>>
>>>
>>
>>
>

Re: MLlib, Java, and DataFrame

Posted by Marco Mistroni <mm...@gmail.com>.
How did you build your spark distribution?
Could you detail the steps?
Hive afaik is dependent on hadoop. If you don't configure ur spark
correctly it will assume hadoop is ur filesystem...
I m not using hadoop or hive.....u might want to get a cloudera
distribution which has spark hadoop and hive by default....
Hth

On 22 Jul 2016 6:34 pm, "Inam Ur Rehman" <in...@gmail.com> wrote:

> Hello guys..i know its irrelevant to this topic but i've been looking
> desperately for the solution. I am facing en exception
> http://apache-spark-user-list.1001560.n3.nabble.com/how-to-resolve-you-must-build-spark-with-hive-exception-td27390.html
>
> plz help me.. I couldn't find any solution..plz
>
> On Fri, Jul 22, 2016 at 5:50 PM, Jean Georges Perrin <jg...@jgp.net> wrote:
>
>> Thanks Marco - I like the idea of sticking with DataFrames ;)
>>
>>
>> On Jul 22, 2016, at 7:07 AM, Marco Mistroni <mm...@gmail.com> wrote:
>>
>> Hello Jean
>>  you can take ur current DataFrame and send them to mllib (i was doing
>> that coz i dindt know the ml package),but the process is littlebit
>> cumbersome
>>
>>
>> 1. go from DataFrame to Rdd of Rdd of [LabeledVectorPoint]
>> 2. run your ML model
>>
>> i'd suggest you stick to DataFrame + ml package :)
>>
>> hth
>>
>>
>>
>> On Fri, Jul 22, 2016 at 4:41 AM, Jean Georges Perrin <jg...@jgp.net> wrote:
>>
>>> Hi,
>>>
>>> I am looking for some really super basic examples of MLlib (like a
>>> linear regression over a list of values) in Java. I have found a few, but I
>>> only saw them using JavaRDD... and not DataFrame.
>>>
>>> I was kind of hoping to take my current DataFrame and send them in
>>> MLlib. Am I too optimistic? Do you know/have any example like that?
>>>
>>> Thanks!
>>>
>>> jg
>>>
>>>
>>> Jean Georges Perrin
>>> jgp@jgp.net / @jgperrin
>>>
>>>
>>>
>>>
>>>
>>
>>
>

Re: MLlib, Java, and DataFrame

Posted by Inam Ur Rehman <in...@gmail.com>.
Hello guys..i know its irrelevant to this topic but i've been looking
desperately for the solution. I am facing en exception
http://apache-spark-user-list.1001560.n3.nabble.com/how-to-resolve-you-must-build-spark-with-hive-exception-td27390.html

plz help me.. I couldn't find any solution..plz

On Fri, Jul 22, 2016 at 5:50 PM, Jean Georges Perrin <jg...@jgp.net> wrote:

> Thanks Marco - I like the idea of sticking with DataFrames ;)
>
>
> On Jul 22, 2016, at 7:07 AM, Marco Mistroni <mm...@gmail.com> wrote:
>
> Hello Jean
>  you can take ur current DataFrame and send them to mllib (i was doing
> that coz i dindt know the ml package),but the process is littlebit
> cumbersome
>
>
> 1. go from DataFrame to Rdd of Rdd of [LabeledVectorPoint]
> 2. run your ML model
>
> i'd suggest you stick to DataFrame + ml package :)
>
> hth
>
>
>
> On Fri, Jul 22, 2016 at 4:41 AM, Jean Georges Perrin <jg...@jgp.net> wrote:
>
>> Hi,
>>
>> I am looking for some really super basic examples of MLlib (like a linear
>> regression over a list of values) in Java. I have found a few, but I only
>> saw them using JavaRDD... and not DataFrame.
>>
>> I was kind of hoping to take my current DataFrame and send them in MLlib.
>> Am I too optimistic? Do you know/have any example like that?
>>
>> Thanks!
>>
>> jg
>>
>>
>> Jean Georges Perrin
>> jgp@jgp.net / @jgperrin
>>
>>
>>
>>
>>
>
>

Re: MLlib, Java, and DataFrame

Posted by Jean Georges Perrin <jg...@jgp.net>.
Thanks Marco - I like the idea of sticking with DataFrames ;)


> On Jul 22, 2016, at 7:07 AM, Marco Mistroni <mm...@gmail.com> wrote:
> 
> Hello Jean
>  you can take ur current DataFrame and send them to mllib (i was doing that coz i dindt know the ml package),but the process is littlebit cumbersome
> 
> 
> 1. go from DataFrame to Rdd of Rdd of [LabeledVectorPoint]
> 2. run your ML model
> 
> i'd suggest you stick to DataFrame + ml package :)
> 
> hth
> 
> 
> 
> On Fri, Jul 22, 2016 at 4:41 AM, Jean Georges Perrin <jgp@jgp.net <ma...@jgp.net>> wrote:
> Hi,
> 
> I am looking for some really super basic examples of MLlib (like a linear regression over a list of values) in Java. I have found a few, but I only saw them using JavaRDD... and not DataFrame.
> 
> I was kind of hoping to take my current DataFrame and send them in MLlib. Am I too optimistic? Do you know/have any example like that?
> 
> Thanks!
> 
> jg
> 
> 
> Jean Georges Perrin
> jgp@jgp.net <ma...@jgp.net> / @jgperrin
> 
> 
> 
> 
> 


Re: MLlib, Java, and DataFrame

Posted by Marco Mistroni <mm...@gmail.com>.
Hello Jean
 you can take ur current DataFrame and send them to mllib (i was doing that
coz i dindt know the ml package),but the process is littlebit cumbersome


1. go from DataFrame to Rdd of Rdd of [LabeledVectorPoint]
2. run your ML model

i'd suggest you stick to DataFrame + ml package :)

hth



On Fri, Jul 22, 2016 at 4:41 AM, Jean Georges Perrin <jg...@jgp.net> wrote:

> Hi,
>
> I am looking for some really super basic examples of MLlib (like a linear
> regression over a list of values) in Java. I have found a few, but I only
> saw them using JavaRDD... and not DataFrame.
>
> I was kind of hoping to take my current DataFrame and send them in MLlib.
> Am I too optimistic? Do you know/have any example like that?
>
> Thanks!
>
> jg
>
>
> Jean Georges Perrin
> jgp@jgp.net / @jgperrin
>
>
>
>
>

Re: MLlib, Java, and DataFrame

Posted by Jean Georges Perrin <jg...@jgp.net>.
Hi Jules,

Thanks but not really: I know what DataFrames are and I actually use them - specially as the RDD will slowly fade. A lot of the example I see are focusing on cleaning / prep the data, which is an important part, but not really on "after"... Sorry if I am not completely clear.

> On Jul 22, 2016, at 1:08 AM, Jules Damji <ju...@databricks.com> wrote:
> 
> Is this what you had in mind?
> 
> https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html <https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html>
> 
> Cheers 
> Jules 
> 
> Sent from my iPhone
> Pardon the dumb thumb typos :)
> 
> 
> 
> Sent from my iPhone
> Pardon the dumb thumb typos :)
> On Jul 21, 2016, at 8:41 PM, Jean Georges Perrin <jgp@jgp.net <ma...@jgp.net>> wrote:
> 
>> Hi,
>> 
>> I am looking for some really super basic examples of MLlib (like a linear regression over a list of values) in Java. I have found a few, but I only saw them using JavaRDD... and not DataFrame.
>> 
>> I was kind of hoping to take my current DataFrame and send them in MLlib. Am I too optimistic? Do you know/have any example like that?
>> 
>> Thanks!
>> 
>> jg
>> 
>> 
>> Jean Georges Perrin
>> jgp@jgp.net <ma...@jgp.net> / @jgperrin
>> 
>> 
>> 
>>