You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@predictionio.apache.org by Shane Johnson <sh...@gmail.com> on 2018/01/04 19:55:24 UTC

Using Dataframe API vs. RDD API?

Hello group, Happy new year! Does anyone have a working example or template
using the DataFrame API vs. the RDD based APIs. We are wanting to migrate
to using the new DataFrame APIs to take advantage of the *Feature
Importance* function for our Regression Random Forest Models.

We are wanting to move from

import org.apache.spark.mllib.tree.RandomForestimport
org.apache.spark.mllib.tree.model.RandomForestModelimport
org.apache.spark.mllib.util.MLUtils

to

import org.apache.spark.ml.regression.{RandomForestRegressionModel,
RandomForestRegressor}


Is this something that should be fairly straightforward by adjusting
parameters and calling new classes within DASE or is it much more involved
development.

Thank You!

*Shane Johnson | 801.360.3350*
LinkedIn <https://www.linkedin.com/in/shanewjohnson> | Facebook
<https://www.facebook.com/shane.johnson.71653>

Re: Using Dataframe API vs. RDD API?

Posted by Donald Szeto <do...@apache.org>.
Hi Shane,

You are correct about Spark ML requiring DataFrame/Dataset because the
generated model is in fact a Transformer which requires those input types.
Before we finished adding Spark ML support the workaround would be Daniel’s
suggestion.

Regards,
Donald

On Tue, Jan 30, 2018 at 10:53 AM Shane Johnson <sh...@gmail.com>
wrote:

> I remember this now. Thanks Daniel. Does this confirm that I do indeed
> need to use a spark context when using the new dataframe API (ml vs mllib)?
> I wanted to make sure there wasn't a way to use the new ml library to
> predict without using a dataframe.
>
> *Shane Johnson | 801.360.3350*
> LinkedIn <https://www.linkedin.com/in/shanewjohnson> | Facebook
> <https://www.facebook.com/shane.johnson.71653>
>
> 2018-01-30 7:09 GMT-10:00 Daniel O' Shaughnessy <
> danieljamesdavid@gmail.com>:
>
>> Hi Shane,
>>
>> You need to use PAlgorithm instead of P2Algorithm and save/load the spark
>> context accordingly. This way you can use spark context in the predict
>> function.
>>
>> There are examples of using PAlgorithm on the predictionio Site. It’s
>> slightly more complicated but not too bad!
>>
>>
>> On Tue, 30 Jan 2018 at 17:06, Shane Johnson <sh...@gmail.com>
>> wrote:
>>
>>> Thanks team! We are close to having our models working with the
>>> Dataframe API. One additional roadblock we are hitting is the fundamental
>>> difference in the RDD based API vs the Dataframe API. It seems that the old
>>> mllib API would allow a simple vector to get predictions where in the new
>>> ml API a dataframe is required. This presents a challenge as the predict
>>> function in PredictionIO does not have a spark context.
>>>
>>> Any ideas how to overcome this? Am I thinking through this correctly or
>>> are there other ways to get predictions with the new ml Dataframe API
>>> without having a dataframe as input?
>>>
>>> Best,
>>>
>>> Shane
>>>
>>> *Shane Johnson | 801.360.3350 <(801)%20360-3350>*
>>> LinkedIn <https://www.linkedin.com/in/shanewjohnson> | Facebook
>>> <https://www.facebook.com/shane.johnson.71653>
>>>
>>> 2018-01-08 20:37 GMT-10:00 Donald Szeto <do...@apache.org>:
>>>
>>>> We do have work-in-progress for DataFrame API tracked at
>>>> https://issues.apache.org/jira/browse/PIO-71.
>>>>
>>>> Chan, it would be nice if you could create a branch on your personal
>>>> fork if you want to hand it off to someone else. Thanks!
>>>>
>>>> On Fri, Jan 5, 2018 at 2:02 PM, Pat Ferrel <pa...@occamsmachete.com>
>>>> wrote:
>>>>
>>>>> Yes and I do not recommend that because the EventServer schema is not
>>>>> a developer contract. It may change at any time. Use the conversion method
>>>>> and go through the PIO API to get the RDD then convert to DF for now.
>>>>>
>>>>> I’m not sure what PIO uses to get an RDD from Postgres but if they do
>>>>> not use something like the lib you mention, a PR would be nice. Also if you
>>>>> have an interest in adding the DF APIs to the EventServer contributions are
>>>>> encouraged. Committers will give some guidance I’m sure—once that know more
>>>>> than me on the subject.
>>>>>
>>>>> If you want to donate some DF code, create a Jira and we’ll easily
>>>>> find a mentor to make suggestions. There are many benefits to this
>>>>> including not having to support a fork of PIO through subsequent versions.
>>>>> Also others are interested in this too.
>>>>>
>>>>>
>>>>>
>>>>> On Jan 5, 2018, at 7:39 AM, Daniel O' Shaughnessy <
>>>>> danieljamesdavid@gmail.com> wrote:
>>>>>
>>>>> ....Should have mentioned that I used org.apache.spark.rdd.JdbcRDD to
>>>>> read in the RDD from a postgres DB initially.
>>>>>
>>>>> This was you don't need to use an EventServer!
>>>>>
>>>>> On Fri, 5 Jan 2018 at 15:37 Daniel O' Shaughnessy <
>>>>> danieljamesdavid@gmail.com> wrote:
>>>>>
>>>>>> Hi Shane,
>>>>>>
>>>>>> I've successfully used :
>>>>>>
>>>>>> import org.apache.spark.ml.classification.{
>>>>>> RandomForestClassificationModel, RandomForestClassifier }
>>>>>>
>>>>>> with pio. You can access feature importance through the
>>>>>> RandomForestClassifier also.
>>>>>>
>>>>>> Very simple to convert RDDs to DFs as Pat mentioned, something like:
>>>>>>
>>>>>> val RDD_2_DF = sqlContext.createDataFrame(yourRDD).toDF("col1",
>>>>>> "col2")
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Thu, 4 Jan 2018 at 23:10 Pat Ferrel <pa...@occamsmachete.com> wrote:
>>>>>>
>>>>>>> Actually there are libs that will read DFs from HBase
>>>>>>> https://svn.apache.org/repos/asf/hbase/hbase.apache.org/trunk/_chapters/spark.html
>>>>>>>
>>>>>>> This is out of band with PIO and should not be used IMO because the
>>>>>>> schema of the EventStore is not guaranteed to remain as-is. The safest way
>>>>>>> is to translate or get DFs integrated to PIO. I think there is an existing
>>>>>>> Jira that request Spark ML support, which assumes DFs.
>>>>>>>
>>>>>>>
>>>>>>> On Jan 4, 2018, at 12:25 PM, Pat Ferrel <pa...@occamsmachete.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>> Funny you should ask this. Yes, we are working on a DF based
>>>>>>> Universal Recommender but you have to convert the RDD into a DF since PIO
>>>>>>> does not read out data in the form of a DF (yet). This is a fairly simple
>>>>>>> step of maybe one line of code but would be better supported in PIO itself.
>>>>>>> The issue is that the EventStore uses libs that may not read out DFs, but
>>>>>>> RDDs. This is certainly the case with Elasticsearch, which provides an RDD
>>>>>>> lib. I haven’t seen one from them that read out DFs though it would make a
>>>>>>> lot of sense for ES especially.
>>>>>>>
>>>>>>> So TLDR; yes, just convert the RDD into a DF for now.
>>>>>>>
>>>>>>> Also please add a feature request as a PIO Jira ticket to look into
>>>>>>> this. I for one would +1
>>>>>>>
>>>>>>>
>>>>>>> On Jan 4, 2018, at 11:55 AM, Shane Johnson <
>>>>>>> shanewaldenjohnson@gmail.com> wrote:
>>>>>>>
>>>>>>> Hello group, Happy new year! Does anyone have a working example or
>>>>>>> template using the DataFrame API vs. the RDD based APIs. We are wanting to
>>>>>>> migrate to using the new DataFrame APIs to take advantage of the *Feature
>>>>>>> Importance* function for our Regression Random Forest Models.
>>>>>>>
>>>>>>> We are wanting to move from
>>>>>>>
>>>>>>> import org.apache.spark.mllib.tree.RandomForestimport org.apache.spark.mllib.tree.model.RandomForestModelimport org.apache.spark.mllib.util.MLUtils
>>>>>>>
>>>>>>> to
>>>>>>>
>>>>>>> import org.apache.spark.ml.regression.{RandomForestRegressionModel, RandomForestRegressor}
>>>>>>>
>>>>>>>
>>>>>>> Is this something that should be fairly straightforward by adjusting
>>>>>>> parameters and calling new classes within DASE or is it much more involved
>>>>>>> development.
>>>>>>>
>>>>>>> Thank You!
>>>>>>>
>>>>>>> *Shane Johnson | 801.360.3350 <(801)%20360-3350>*
>>>>>>> LinkedIn <https://www.linkedin.com/in/shanewjohnson> | Facebook
>>>>>>> <https://www.facebook.com/shane.johnson.71653>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>
>>>>
>>>
>

Re: Using Dataframe API vs. RDD API?

Posted by Shane Johnson <sh...@gmail.com>.
I remember this now. Thanks Daniel. Does this confirm that I do indeed need
to use a spark context when using the new dataframe API (ml vs mllib)? I
wanted to make sure there wasn't a way to use the new ml library to predict
without using a dataframe.

*Shane Johnson | 801.360.3350*
LinkedIn <https://www.linkedin.com/in/shanewjohnson> | Facebook
<https://www.facebook.com/shane.johnson.71653>

2018-01-30 7:09 GMT-10:00 Daniel O' Shaughnessy <da...@gmail.com>
:

> Hi Shane,
>
> You need to use PAlgorithm instead of P2Algorithm and save/load the spark
> context accordingly. This way you can use spark context in the predict
> function.
>
> There are examples of using PAlgorithm on the predictionio Site. It’s
> slightly more complicated but not too bad!
>
>
> On Tue, 30 Jan 2018 at 17:06, Shane Johnson <sh...@gmail.com>
> wrote:
>
>> Thanks team! We are close to having our models working with the Dataframe
>> API. One additional roadblock we are hitting is the fundamental difference
>> in the RDD based API vs the Dataframe API. It seems that the old mllib API
>> would allow a simple vector to get predictions where in the new ml API a
>> dataframe is required. This presents a challenge as the predict function in
>> PredictionIO does not have a spark context.
>>
>> Any ideas how to overcome this? Am I thinking through this correctly or
>> are there other ways to get predictions with the new ml Dataframe API
>> without having a dataframe as input?
>>
>> Best,
>>
>> Shane
>>
>> *Shane Johnson | 801.360.3350 <(801)%20360-3350>*
>> LinkedIn <https://www.linkedin.com/in/shanewjohnson> | Facebook
>> <https://www.facebook.com/shane.johnson.71653>
>>
>> 2018-01-08 20:37 GMT-10:00 Donald Szeto <do...@apache.org>:
>>
>>> We do have work-in-progress for DataFrame API tracked at
>>> https://issues.apache.org/jira/browse/PIO-71.
>>>
>>> Chan, it would be nice if you could create a branch on your personal
>>> fork if you want to hand it off to someone else. Thanks!
>>>
>>> On Fri, Jan 5, 2018 at 2:02 PM, Pat Ferrel <pa...@occamsmachete.com>
>>> wrote:
>>>
>>>> Yes and I do not recommend that because the EventServer schema is not a
>>>> developer contract. It may change at any time. Use the conversion method
>>>> and go through the PIO API to get the RDD then convert to DF for now.
>>>>
>>>> I’m not sure what PIO uses to get an RDD from Postgres but if they do
>>>> not use something like the lib you mention, a PR would be nice. Also if you
>>>> have an interest in adding the DF APIs to the EventServer contributions are
>>>> encouraged. Committers will give some guidance I’m sure—once that know more
>>>> than me on the subject.
>>>>
>>>> If you want to donate some DF code, create a Jira and we’ll easily find
>>>> a mentor to make suggestions. There are many benefits to this including not
>>>> having to support a fork of PIO through subsequent versions. Also others
>>>> are interested in this too.
>>>>
>>>>
>>>>
>>>> On Jan 5, 2018, at 7:39 AM, Daniel O' Shaughnessy <
>>>> danieljamesdavid@gmail.com> wrote:
>>>>
>>>> ....Should have mentioned that I used org.apache.spark.rdd.JdbcRDD to
>>>> read in the RDD from a postgres DB initially.
>>>>
>>>> This was you don't need to use an EventServer!
>>>>
>>>> On Fri, 5 Jan 2018 at 15:37 Daniel O' Shaughnessy <
>>>> danieljamesdavid@gmail.com> wrote:
>>>>
>>>>> Hi Shane,
>>>>>
>>>>> I've successfully used :
>>>>>
>>>>> import org.apache.spark.ml.classification.{
>>>>> RandomForestClassificationModel, RandomForestClassifier }
>>>>>
>>>>> with pio. You can access feature importance through the
>>>>> RandomForestClassifier also.
>>>>>
>>>>> Very simple to convert RDDs to DFs as Pat mentioned, something like:
>>>>>
>>>>> val RDD_2_DF = sqlContext.createDataFrame(yourRDD).toDF("col1", "col2"
>>>>> )
>>>>>
>>>>>
>>>>>
>>>>> On Thu, 4 Jan 2018 at 23:10 Pat Ferrel <pa...@occamsmachete.com> wrote:
>>>>>
>>>>>> Actually there are libs that will read DFs from HBase
>>>>>> https://svn.apache.org/repos/asf/hbase/hbase.apache.
>>>>>> org/trunk/_chapters/spark.html
>>>>>>
>>>>>> This is out of band with PIO and should not be used IMO because the
>>>>>> schema of the EventStore is not guaranteed to remain as-is. The safest way
>>>>>> is to translate or get DFs integrated to PIO. I think there is an existing
>>>>>> Jira that request Spark ML support, which assumes DFs.
>>>>>>
>>>>>>
>>>>>> On Jan 4, 2018, at 12:25 PM, Pat Ferrel <pa...@occamsmachete.com>
>>>>>> wrote:
>>>>>>
>>>>>> Funny you should ask this. Yes, we are working on a DF based
>>>>>> Universal Recommender but you have to convert the RDD into a DF since PIO
>>>>>> does not read out data in the form of a DF (yet). This is a fairly simple
>>>>>> step of maybe one line of code but would be better supported in PIO itself.
>>>>>> The issue is that the EventStore uses libs that may not read out DFs, but
>>>>>> RDDs. This is certainly the case with Elasticsearch, which provides an RDD
>>>>>> lib. I haven’t seen one from them that read out DFs though it would make a
>>>>>> lot of sense for ES especially.
>>>>>>
>>>>>> So TLDR; yes, just convert the RDD into a DF for now.
>>>>>>
>>>>>> Also please add a feature request as a PIO Jira ticket to look into
>>>>>> this. I for one would +1
>>>>>>
>>>>>>
>>>>>> On Jan 4, 2018, at 11:55 AM, Shane Johnson <
>>>>>> shanewaldenjohnson@gmail.com> wrote:
>>>>>>
>>>>>> Hello group, Happy new year! Does anyone have a working example or
>>>>>> template using the DataFrame API vs. the RDD based APIs. We are wanting to
>>>>>> migrate to using the new DataFrame APIs to take advantage of the *Feature
>>>>>> Importance* function for our Regression Random Forest Models.
>>>>>>
>>>>>> We are wanting to move from
>>>>>>
>>>>>> import org.apache.spark.mllib.tree.RandomForestimport org.apache.spark.mllib.tree.model.RandomForestModelimport org.apache.spark.mllib.util.MLUtils
>>>>>>
>>>>>> to
>>>>>>
>>>>>> import org.apache.spark.ml.regression.{RandomForestRegressionModel, RandomForestRegressor}
>>>>>>
>>>>>>
>>>>>> Is this something that should be fairly straightforward by adjusting
>>>>>> parameters and calling new classes within DASE or is it much more involved
>>>>>> development.
>>>>>>
>>>>>> Thank You!
>>>>>>
>>>>>> *Shane Johnson | 801.360.3350 <(801)%20360-3350>*
>>>>>> LinkedIn <https://www.linkedin.com/in/shanewjohnson> | Facebook
>>>>>> <https://www.facebook.com/shane.johnson.71653>
>>>>>>
>>>>>>
>>>>>>
>>>>
>>>
>>

Re: Using Dataframe API vs. RDD API?

Posted by Daniel O' Shaughnessy <da...@gmail.com>.
Hi Shane,

You need to use PAlgorithm instead of P2Algorithm and save/load the spark
context accordingly. This way you can use spark context in the predict
function.

There are examples of using PAlgorithm on the predictionio Site. It’s
slightly more complicated but not too bad!


On Tue, 30 Jan 2018 at 17:06, Shane Johnson <sh...@gmail.com>
wrote:

> Thanks team! We are close to having our models working with the Dataframe
> API. One additional roadblock we are hitting is the fundamental difference
> in the RDD based API vs the Dataframe API. It seems that the old mllib API
> would allow a simple vector to get predictions where in the new ml API a
> dataframe is required. This presents a challenge as the predict function in
> PredictionIO does not have a spark context.
>
> Any ideas how to overcome this? Am I thinking through this correctly or
> are there other ways to get predictions with the new ml Dataframe API
> without having a dataframe as input?
>
> Best,
>
> Shane
>
> *Shane Johnson | 801.360.3350*
> LinkedIn <https://www.linkedin.com/in/shanewjohnson> | Facebook
> <https://www.facebook.com/shane.johnson.71653>
>
> 2018-01-08 20:37 GMT-10:00 Donald Szeto <do...@apache.org>:
>
>> We do have work-in-progress for DataFrame API tracked at
>> https://issues.apache.org/jira/browse/PIO-71.
>>
>> Chan, it would be nice if you could create a branch on your personal fork
>> if you want to hand it off to someone else. Thanks!
>>
>> On Fri, Jan 5, 2018 at 2:02 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:
>>
>>> Yes and I do not recommend that because the EventServer schema is not a
>>> developer contract. It may change at any time. Use the conversion method
>>> and go through the PIO API to get the RDD then convert to DF for now.
>>>
>>> I’m not sure what PIO uses to get an RDD from Postgres but if they do
>>> not use something like the lib you mention, a PR would be nice. Also if you
>>> have an interest in adding the DF APIs to the EventServer contributions are
>>> encouraged. Committers will give some guidance I’m sure—once that know more
>>> than me on the subject.
>>>
>>> If you want to donate some DF code, create a Jira and we’ll easily find
>>> a mentor to make suggestions. There are many benefits to this including not
>>> having to support a fork of PIO through subsequent versions. Also others
>>> are interested in this too.
>>>
>>>
>>>
>>> On Jan 5, 2018, at 7:39 AM, Daniel O' Shaughnessy <
>>> danieljamesdavid@gmail.com> wrote:
>>>
>>> ....Should have mentioned that I used org.apache.spark.rdd.JdbcRDD to
>>> read in the RDD from a postgres DB initially.
>>>
>>> This was you don't need to use an EventServer!
>>>
>>> On Fri, 5 Jan 2018 at 15:37 Daniel O' Shaughnessy <
>>> danieljamesdavid@gmail.com> wrote:
>>>
>>>> Hi Shane,
>>>>
>>>> I've successfully used :
>>>>
>>>> import org.apache.spark.ml.classification.{
>>>> RandomForestClassificationModel, RandomForestClassifier }
>>>>
>>>> with pio. You can access feature importance through the
>>>> RandomForestClassifier also.
>>>>
>>>> Very simple to convert RDDs to DFs as Pat mentioned, something like:
>>>>
>>>> val RDD_2_DF = sqlContext.createDataFrame(yourRDD).toDF("col1", "col2")
>>>>
>>>>
>>>>
>>>> On Thu, 4 Jan 2018 at 23:10 Pat Ferrel <pa...@occamsmachete.com> wrote:
>>>>
>>>>> Actually there are libs that will read DFs from HBase
>>>>> https://svn.apache.org/repos/asf/hbase/hbase.apache.org/trunk/_chapters/spark.html
>>>>>
>>>>> This is out of band with PIO and should not be used IMO because the
>>>>> schema of the EventStore is not guaranteed to remain as-is. The safest way
>>>>> is to translate or get DFs integrated to PIO. I think there is an existing
>>>>> Jira that request Spark ML support, which assumes DFs.
>>>>>
>>>>>
>>>>> On Jan 4, 2018, at 12:25 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:
>>>>>
>>>>> Funny you should ask this. Yes, we are working on a DF based Universal
>>>>> Recommender but you have to convert the RDD into a DF since PIO does not
>>>>> read out data in the form of a DF (yet). This is a fairly simple step of
>>>>> maybe one line of code but would be better supported in PIO itself. The
>>>>> issue is that the EventStore uses libs that may not read out DFs, but RDDs.
>>>>> This is certainly the case with Elasticsearch, which provides an RDD lib. I
>>>>> haven’t seen one from them that read out DFs though it would make a lot of
>>>>> sense for ES especially.
>>>>>
>>>>> So TLDR; yes, just convert the RDD into a DF for now.
>>>>>
>>>>> Also please add a feature request as a PIO Jira ticket to look into
>>>>> this. I for one would +1
>>>>>
>>>>>
>>>>> On Jan 4, 2018, at 11:55 AM, Shane Johnson <
>>>>> shanewaldenjohnson@gmail.com> wrote:
>>>>>
>>>>> Hello group, Happy new year! Does anyone have a working example or
>>>>> template using the DataFrame API vs. the RDD based APIs. We are wanting to
>>>>> migrate to using the new DataFrame APIs to take advantage of the *Feature
>>>>> Importance* function for our Regression Random Forest Models.
>>>>>
>>>>> We are wanting to move from
>>>>>
>>>>> import org.apache.spark.mllib.tree.RandomForestimport org.apache.spark.mllib.tree.model.RandomForestModelimport org.apache.spark.mllib.util.MLUtils
>>>>>
>>>>> to
>>>>>
>>>>> import org.apache.spark.ml.regression.{RandomForestRegressionModel, RandomForestRegressor}
>>>>>
>>>>>
>>>>> Is this something that should be fairly straightforward by adjusting
>>>>> parameters and calling new classes within DASE or is it much more involved
>>>>> development.
>>>>>
>>>>> Thank You!
>>>>>
>>>>> *Shane Johnson | 801.360.3350 <(801)%20360-3350>*
>>>>> LinkedIn <https://www.linkedin.com/in/shanewjohnson> | Facebook
>>>>> <https://www.facebook.com/shane.johnson.71653>
>>>>>
>>>>>
>>>>>
>>>
>>
>

Re: Using Dataframe API vs. RDD API?

Posted by Pat Ferrel <pa...@occamsmachete.com>.
What template are you using? If it is one of the templates in the Apache repos, you may want to file a bug report. If PIO supports Spark 2.x, the Apache Templates should also IMHO.


From: Daniel O' Shaughnessy <da...@gmail.com>
Reply: user@predictionio.apache.org <us...@predictionio.apache.org>
Date: January 30, 2018 at 9:09:49 AM
To: user@predictionio.apache.org <us...@predictionio.apache.org>
Subject:  Re: Using Dataframe API vs. RDD API?  

Hi Shane,

You need to use PAlgorithm instead of P2Algorithm and save/load the spark context accordingly. This way you can use spark context in the predict function.

There are examples of using PAlgorithm on the predictionio Site. It’s slightly more complicated but not too bad!


On Tue, 30 Jan 2018 at 17:06, Shane Johnson <sh...@gmail.com> wrote:
Thanks team! We are close to having our models working with the Dataframe API. One additional roadblock we are hitting is the fundamental difference in the RDD based API vs the Dataframe API. It seems that the old mllib API would allow a simple vector to get predictions where in the new ml API a dataframe is required. This presents a challenge as the predict function in PredictionIO does not have a spark context. 

Any ideas how to overcome this? Am I thinking through this correctly or are there other ways to get predictions with the new ml Dataframe API without having a dataframe as input?

Best,

Shane

Shane Johnson | 801.360.3350

LinkedIn | Facebook

2018-01-08 20:37 GMT-10:00 Donald Szeto <do...@apache.org>:
We do have work-in-progress for DataFrame API tracked at https://issues.apache.org/jira/browse/PIO-71.

Chan, it would be nice if you could create a branch on your personal fork if you want to hand it off to someone else. Thanks!

On Fri, Jan 5, 2018 at 2:02 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:
Yes and I do not recommend that because the EventServer schema is not a developer contract. It may change at any time. Use the conversion method and go through the PIO API to get the RDD then convert to DF for now.

I’m not sure what PIO uses to get an RDD from Postgres but if they do not use something like the lib you mention, a PR would be nice. Also if you have an interest in adding the DF APIs to the EventServer contributions are encouraged. Committers will give some guidance I’m sure—once that know more than me on the subject.

If you want to donate some DF code, create a Jira and we’ll easily find a mentor to make suggestions. There are many benefits to this including not having to support a fork of PIO through subsequent versions. Also others are interested in this too.

 

On Jan 5, 2018, at 7:39 AM, Daniel O' Shaughnessy <da...@gmail.com> wrote:

....Should have mentioned that I used  
org.apache.spark.rdd.JdbcRDD to read in the RDD from a postgres DB initially.

This was you don't need to use an EventServer!

On Fri, 5 Jan 2018 at 15:37 Daniel O' Shaughnessy <da...@gmail.com> wrote:
Hi Shane, 

I've successfully used : 


import  
org.apache.spark.ml.classification.{  
RandomForestClassificationModel,  
RandomForestClassifier  
}

with pio. You can access feature importance through the RandomForestClassifier also.

Very simple to convert RDDs to DFs as Pat mentioned, something like:


val RDD_2_DF =
sqlContext.createDataFrame(yourRDD).toDF("col1", "col2")




On Thu, 4 Jan 2018 at 23:10 Pat Ferrel <pa...@occamsmachete.com> wrote:
Actually there are libs that will read DFs from HBase https://svn.apache.org/repos/asf/hbase/hbase.apache.org/trunk/_chapters/spark.html

This is out of band with PIO and should not be used IMO because the schema of the EventStore is not guaranteed to remain as-is. The safest way is to translate or get DFs integrated to PIO. I think there is an existing Jira that request Spark ML support, which assumes DFs. 


On Jan 4, 2018, at 12:25 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:

Funny you should ask this. Yes, we are working on a DF based Universal Recommender but you have to convert the RDD into a DF since PIO does not read out data in the form of a DF (yet). This is a fairly simple step of maybe one line of code but would be better supported in PIO itself. The issue is that the EventStore uses libs that may not read out DFs, but RDDs. This is certainly the case with Elasticsearch, which provides an RDD lib. I haven’t seen one from them that read out DFs though it would make a lot of sense for ES especially.

So TLDR; yes, just convert the RDD into a DF for now.

Also please add a feature request as a PIO Jira ticket to look into this. I for one would +1


On Jan 4, 2018, at 11:55 AM, Shane Johnson <sh...@gmail.com> wrote:

Hello group, Happy new year! Does anyone have a working example or template using the DataFrame API vs. the RDD based APIs. We are wanting to migrate to using the new DataFrame APIs to take advantage of the Feature Importance function for our Regression Random Forest Models.

We are wanting to move from 

import org.apache.spark.mllib.tree.RandomForest
import org.apache.spark.mllib.tree.model.RandomForestModel
import org.apache.spark.mllib.util.MLUtils
to
import org.apache.spark.ml.regression.{RandomForestRegressionModel, RandomForestRegressor}

Is this something that should be fairly straightforward by adjusting parameters and calling new classes within DASE or is it much more involved development.

Thank You!
Shane Johnson | 801.360.3350

LinkedIn | Facebook






Re: Using Dataframe API vs. RDD API?

Posted by Shane Johnson <sh...@gmail.com>.
Thanks team! We are close to having our models working with the Dataframe
API. One additional roadblock we are hitting is the fundamental difference
in the RDD based API vs the Dataframe API. It seems that the old mllib API
would allow a simple vector to get predictions where in the new ml API a
dataframe is required. This presents a challenge as the predict function in
PredictionIO does not have a spark context.

Any ideas how to overcome this? Am I thinking through this correctly or are
there other ways to get predictions with the new ml Dataframe API without
having a dataframe as input?

Best,

Shane

*Shane Johnson | 801.360.3350*
LinkedIn <https://www.linkedin.com/in/shanewjohnson> | Facebook
<https://www.facebook.com/shane.johnson.71653>

2018-01-08 20:37 GMT-10:00 Donald Szeto <do...@apache.org>:

> We do have work-in-progress for DataFrame API tracked at
> https://issues.apache.org/jira/browse/PIO-71.
>
> Chan, it would be nice if you could create a branch on your personal fork
> if you want to hand it off to someone else. Thanks!
>
> On Fri, Jan 5, 2018 at 2:02 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:
>
>> Yes and I do not recommend that because the EventServer schema is not a
>> developer contract. It may change at any time. Use the conversion method
>> and go through the PIO API to get the RDD then convert to DF for now.
>>
>> I’m not sure what PIO uses to get an RDD from Postgres but if they do not
>> use something like the lib you mention, a PR would be nice. Also if you
>> have an interest in adding the DF APIs to the EventServer contributions are
>> encouraged. Committers will give some guidance I’m sure—once that know more
>> than me on the subject.
>>
>> If you want to donate some DF code, create a Jira and we’ll easily find a
>> mentor to make suggestions. There are many benefits to this including not
>> having to support a fork of PIO through subsequent versions. Also others
>> are interested in this too.
>>
>>
>>
>> On Jan 5, 2018, at 7:39 AM, Daniel O' Shaughnessy <
>> danieljamesdavid@gmail.com> wrote:
>>
>> ....Should have mentioned that I used org.apache.spark.rdd.JdbcRDD to
>> read in the RDD from a postgres DB initially.
>>
>> This was you don't need to use an EventServer!
>>
>> On Fri, 5 Jan 2018 at 15:37 Daniel O' Shaughnessy <
>> danieljamesdavid@gmail.com> wrote:
>>
>>> Hi Shane,
>>>
>>> I've successfully used :
>>>
>>> import org.apache.spark.ml.classification.{
>>> RandomForestClassificationModel, RandomForestClassifier }
>>>
>>> with pio. You can access feature importance through the
>>> RandomForestClassifier also.
>>>
>>> Very simple to convert RDDs to DFs as Pat mentioned, something like:
>>>
>>> val RDD_2_DF = sqlContext.createDataFrame(yourRDD).toDF("col1", "col2")
>>>
>>>
>>>
>>> On Thu, 4 Jan 2018 at 23:10 Pat Ferrel <pa...@occamsmachete.com> wrote:
>>>
>>>> Actually there are libs that will read DFs from HBase
>>>> https://svn.apache.org/repos/asf/hbase/hbase.apache.or
>>>> g/trunk/_chapters/spark.html
>>>>
>>>> This is out of band with PIO and should not be used IMO because the
>>>> schema of the EventStore is not guaranteed to remain as-is. The safest way
>>>> is to translate or get DFs integrated to PIO. I think there is an existing
>>>> Jira that request Spark ML support, which assumes DFs.
>>>>
>>>>
>>>> On Jan 4, 2018, at 12:25 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:
>>>>
>>>> Funny you should ask this. Yes, we are working on a DF based Universal
>>>> Recommender but you have to convert the RDD into a DF since PIO does not
>>>> read out data in the form of a DF (yet). This is a fairly simple step of
>>>> maybe one line of code but would be better supported in PIO itself. The
>>>> issue is that the EventStore uses libs that may not read out DFs, but RDDs.
>>>> This is certainly the case with Elasticsearch, which provides an RDD lib. I
>>>> haven’t seen one from them that read out DFs though it would make a lot of
>>>> sense for ES especially.
>>>>
>>>> So TLDR; yes, just convert the RDD into a DF for now.
>>>>
>>>> Also please add a feature request as a PIO Jira ticket to look into
>>>> this. I for one would +1
>>>>
>>>>
>>>> On Jan 4, 2018, at 11:55 AM, Shane Johnson <
>>>> shanewaldenjohnson@gmail.com> wrote:
>>>>
>>>> Hello group, Happy new year! Does anyone have a working example or
>>>> template using the DataFrame API vs. the RDD based APIs. We are wanting to
>>>> migrate to using the new DataFrame APIs to take advantage of the *Feature
>>>> Importance* function for our Regression Random Forest Models.
>>>>
>>>> We are wanting to move from
>>>>
>>>> import org.apache.spark.mllib.tree.RandomForestimport org.apache.spark.mllib.tree.model.RandomForestModelimport org.apache.spark.mllib.util.MLUtils
>>>>
>>>> to
>>>>
>>>> import org.apache.spark.ml.regression.{RandomForestRegressionModel, RandomForestRegressor}
>>>>
>>>>
>>>> Is this something that should be fairly straightforward by adjusting
>>>> parameters and calling new classes within DASE or is it much more involved
>>>> development.
>>>>
>>>> Thank You!
>>>>
>>>> *Shane Johnson | 801.360.3350 <(801)%20360-3350>*
>>>> LinkedIn <https://www.linkedin.com/in/shanewjohnson> | Facebook
>>>> <https://www.facebook.com/shane.johnson.71653>
>>>>
>>>>
>>>>
>>
>

Re: Using Dataframe API vs. RDD API?

Posted by Donald Szeto <do...@apache.org>.
We do have work-in-progress for DataFrame API tracked at
https://issues.apache.org/jira/browse/PIO-71.

Chan, it would be nice if you could create a branch on your personal fork
if you want to hand it off to someone else. Thanks!

On Fri, Jan 5, 2018 at 2:02 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:

> Yes and I do not recommend that because the EventServer schema is not a
> developer contract. It may change at any time. Use the conversion method
> and go through the PIO API to get the RDD then convert to DF for now.
>
> I’m not sure what PIO uses to get an RDD from Postgres but if they do not
> use something like the lib you mention, a PR would be nice. Also if you
> have an interest in adding the DF APIs to the EventServer contributions are
> encouraged. Committers will give some guidance I’m sure—once that know more
> than me on the subject.
>
> If you want to donate some DF code, create a Jira and we’ll easily find a
> mentor to make suggestions. There are many benefits to this including not
> having to support a fork of PIO through subsequent versions. Also others
> are interested in this too.
>
>
>
> On Jan 5, 2018, at 7:39 AM, Daniel O' Shaughnessy <
> danieljamesdavid@gmail.com> wrote:
>
> ....Should have mentioned that I used org.apache.spark.rdd.JdbcRDD to
> read in the RDD from a postgres DB initially.
>
> This was you don't need to use an EventServer!
>
> On Fri, 5 Jan 2018 at 15:37 Daniel O' Shaughnessy <
> danieljamesdavid@gmail.com> wrote:
>
>> Hi Shane,
>>
>> I've successfully used :
>>
>> import org.apache.spark.ml.classification.{
>> RandomForestClassificationModel, RandomForestClassifier }
>>
>> with pio. You can access feature importance through the
>> RandomForestClassifier also.
>>
>> Very simple to convert RDDs to DFs as Pat mentioned, something like:
>>
>> val RDD_2_DF = sqlContext.createDataFrame(yourRDD).toDF("col1", "col2")
>>
>>
>>
>> On Thu, 4 Jan 2018 at 23:10 Pat Ferrel <pa...@occamsmachete.com> wrote:
>>
>>> Actually there are libs that will read DFs from HBase
>>> https://svn.apache.org/repos/asf/hbase/hbase.apache.
>>> org/trunk/_chapters/spark.html
>>>
>>> This is out of band with PIO and should not be used IMO because the
>>> schema of the EventStore is not guaranteed to remain as-is. The safest way
>>> is to translate or get DFs integrated to PIO. I think there is an existing
>>> Jira that request Spark ML support, which assumes DFs.
>>>
>>>
>>> On Jan 4, 2018, at 12:25 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:
>>>
>>> Funny you should ask this. Yes, we are working on a DF based Universal
>>> Recommender but you have to convert the RDD into a DF since PIO does not
>>> read out data in the form of a DF (yet). This is a fairly simple step of
>>> maybe one line of code but would be better supported in PIO itself. The
>>> issue is that the EventStore uses libs that may not read out DFs, but RDDs.
>>> This is certainly the case with Elasticsearch, which provides an RDD lib. I
>>> haven’t seen one from them that read out DFs though it would make a lot of
>>> sense for ES especially.
>>>
>>> So TLDR; yes, just convert the RDD into a DF for now.
>>>
>>> Also please add a feature request as a PIO Jira ticket to look into
>>> this. I for one would +1
>>>
>>>
>>> On Jan 4, 2018, at 11:55 AM, Shane Johnson <sh...@gmail.com>
>>> wrote:
>>>
>>> Hello group, Happy new year! Does anyone have a working example or
>>> template using the DataFrame API vs. the RDD based APIs. We are wanting to
>>> migrate to using the new DataFrame APIs to take advantage of the *Feature
>>> Importance* function for our Regression Random Forest Models.
>>>
>>> We are wanting to move from
>>>
>>> import org.apache.spark.mllib.tree.RandomForestimport org.apache.spark.mllib.tree.model.RandomForestModelimport org.apache.spark.mllib.util.MLUtils
>>>
>>> to
>>>
>>> import org.apache.spark.ml.regression.{RandomForestRegressionModel, RandomForestRegressor}
>>>
>>>
>>> Is this something that should be fairly straightforward by adjusting
>>> parameters and calling new classes within DASE or is it much more involved
>>> development.
>>>
>>> Thank You!
>>>
>>> *Shane Johnson | 801.360.3350 <(801)%20360-3350>*
>>> LinkedIn <https://www.linkedin.com/in/shanewjohnson> | Facebook
>>> <https://www.facebook.com/shane.johnson.71653>
>>>
>>>
>>>
>

Re: Using Dataframe API vs. RDD API?

Posted by Pat Ferrel <pa...@occamsmachete.com>.
Yes and I do not recommend that because the EventServer schema is not a developer contract. It may change at any time. Use the conversion method and go through the PIO API to get the RDD then convert to DF for now.

I’m not sure what PIO uses to get an RDD from Postgres but if they do not use something like the lib you mention, a PR would be nice. Also if you have an interest in adding the DF APIs to the EventServer contributions are encouraged. Committers will give some guidance I’m sure—once that know more than me on the subject.

If you want to donate some DF code, create a Jira and we’ll easily find a mentor to make suggestions. There are many benefits to this including not having to support a fork of PIO through subsequent versions. Also others are interested in this too.

 

On Jan 5, 2018, at 7:39 AM, Daniel O' Shaughnessy <da...@gmail.com> wrote:

....Should have mentioned that I used org.apache.spark.rdd.JdbcRDD to read in the RDD from a postgres DB initially.

This was you don't need to use an EventServer!

On Fri, 5 Jan 2018 at 15:37 Daniel O' Shaughnessy <danieljamesdavid@gmail.com <ma...@gmail.com>> wrote:
Hi Shane, 

I've successfully used : 

import org.apache.spark.ml.classification.{ RandomForestClassificationModel, RandomForestClassifier }

with pio. You can access feature importance through the RandomForestClassifier also.

Very simple to convert RDDs to DFs as Pat mentioned, something like:

val RDD_2_DF = sqlContext.createDataFrame(yourRDD).toDF("col1", "col2")



On Thu, 4 Jan 2018 at 23:10 Pat Ferrel <pat@occamsmachete.com <ma...@occamsmachete.com>> wrote:
Actually there are libs that will read DFs from HBase https://svn.apache.org/repos/asf/hbase/hbase.apache.org/trunk/_chapters/spark.html <https://svn.apache.org/repos/asf/hbase/hbase.apache.org/trunk/_chapters/spark.html>

This is out of band with PIO and should not be used IMO because the schema of the EventStore is not guaranteed to remain as-is. The safest way is to translate or get DFs integrated to PIO. I think there is an existing Jira that request Spark ML support, which assumes DFs. 


On Jan 4, 2018, at 12:25 PM, Pat Ferrel <pat@occamsmachete.com <ma...@occamsmachete.com>> wrote:

Funny you should ask this. Yes, we are working on a DF based Universal Recommender but you have to convert the RDD into a DF since PIO does not read out data in the form of a DF (yet). This is a fairly simple step of maybe one line of code but would be better supported in PIO itself. The issue is that the EventStore uses libs that may not read out DFs, but RDDs. This is certainly the case with Elasticsearch, which provides an RDD lib. I haven’t seen one from them that read out DFs though it would make a lot of sense for ES especially.

So TLDR; yes, just convert the RDD into a DF for now.

Also please add a feature request as a PIO Jira ticket to look into this. I for one would +1


On Jan 4, 2018, at 11:55 AM, Shane Johnson <shanewaldenjohnson@gmail.com <ma...@gmail.com>> wrote:

Hello group, Happy new year! Does anyone have a working example or template using the DataFrame API vs. the RDD based APIs. We are wanting to migrate to using the new DataFrame APIs to take advantage of the Feature Importance function for our Regression Random Forest Models.

We are wanting to move from 

import org.apache.spark.mllib.tree.RandomForest
import org.apache.spark.mllib.tree.model.RandomForestModel
import org.apache.spark.mllib.util.MLUtils
to
import org.apache.spark.ml.regression.{RandomForestRegressionModel, RandomForestRegressor}

Is this something that should be fairly straightforward by adjusting parameters and calling new classes within DASE or is it much more involved development.

Thank You!
Shane Johnson | 801.360.3350 <tel:(801)%20360-3350>
LinkedIn <https://www.linkedin.com/in/shanewjohnson> | Facebook <https://www.facebook.com/shane.johnson.71653>



Re: Using Dataframe API vs. RDD API?

Posted by Daniel O' Shaughnessy <da...@gmail.com>.
....Should have mentioned that I used org.apache.spark.rdd.JdbcRDD to read
in the RDD from a postgres DB initially.

This was you don't need to use an EventServer!

On Fri, 5 Jan 2018 at 15:37 Daniel O' Shaughnessy <
danieljamesdavid@gmail.com> wrote:

> Hi Shane,
>
> I've successfully used :
>
> import org.apache.spark.ml.classification.{
> RandomForestClassificationModel, RandomForestClassifier }
>
> with pio. You can access feature importance through the
> RandomForestClassifier also.
>
> Very simple to convert RDDs to DFs as Pat mentioned, something like:
>
> val RDD_2_DF = sqlContext.createDataFrame(yourRDD).toDF("col1", "col2")
>
>
>
> On Thu, 4 Jan 2018 at 23:10 Pat Ferrel <pa...@occamsmachete.com> wrote:
>
>> Actually there are libs that will read DFs from HBase
>> https://svn.apache.org/repos/asf/hbase/hbase.apache.org/trunk/_chapters/spark.html
>>
>> This is out of band with PIO and should not be used IMO because the
>> schema of the EventStore is not guaranteed to remain as-is. The safest way
>> is to translate or get DFs integrated to PIO. I think there is an existing
>> Jira that request Spark ML support, which assumes DFs.
>>
>>
>> On Jan 4, 2018, at 12:25 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:
>>
>> Funny you should ask this. Yes, we are working on a DF based Universal
>> Recommender but you have to convert the RDD into a DF since PIO does not
>> read out data in the form of a DF (yet). This is a fairly simple step of
>> maybe one line of code but would be better supported in PIO itself. The
>> issue is that the EventStore uses libs that may not read out DFs, but RDDs.
>> This is certainly the case with Elasticsearch, which provides an RDD lib. I
>> haven’t seen one from them that read out DFs though it would make a lot of
>> sense for ES especially.
>>
>> So TLDR; yes, just convert the RDD into a DF for now.
>>
>> Also please add a feature request as a PIO Jira ticket to look into this.
>> I for one would +1
>>
>>
>> On Jan 4, 2018, at 11:55 AM, Shane Johnson <sh...@gmail.com>
>> wrote:
>>
>> Hello group, Happy new year! Does anyone have a working example or
>> template using the DataFrame API vs. the RDD based APIs. We are wanting to
>> migrate to using the new DataFrame APIs to take advantage of the *Feature
>> Importance* function for our Regression Random Forest Models.
>>
>> We are wanting to move from
>>
>> import org.apache.spark.mllib.tree.RandomForestimport org.apache.spark.mllib.tree.model.RandomForestModelimport org.apache.spark.mllib.util.MLUtils
>>
>> to
>>
>> import org.apache.spark.ml.regression.{RandomForestRegressionModel, RandomForestRegressor}
>>
>>
>> Is this something that should be fairly straightforward by adjusting
>> parameters and calling new classes within DASE or is it much more involved
>> development.
>>
>> Thank You!
>>
>> *Shane Johnson | 801.360.3350 <(801)%20360-3350>*
>> LinkedIn <https://www.linkedin.com/in/shanewjohnson> | Facebook
>> <https://www.facebook.com/shane.johnson.71653>
>>
>>
>>

Re: Using Dataframe API vs. RDD API?

Posted by Daniel O' Shaughnessy <da...@gmail.com>.
Hi Shane,

I've successfully used :

import org.apache.spark.ml.classification.{ RandomForestClassificationModel,
RandomForestClassifier }

with pio. You can access feature importance through the
RandomForestClassifier also.

Very simple to convert RDDs to DFs as Pat mentioned, something like:

val RDD_2_DF = sqlContext.createDataFrame(yourRDD).toDF("col1", "col2")



On Thu, 4 Jan 2018 at 23:10 Pat Ferrel <pa...@occamsmachete.com> wrote:

> Actually there are libs that will read DFs from HBase
> https://svn.apache.org/repos/asf/hbase/hbase.apache.org/trunk/_chapters/spark.html
>
> This is out of band with PIO and should not be used IMO because the schema
> of the EventStore is not guaranteed to remain as-is. The safest way is to
> translate or get DFs integrated to PIO. I think there is an existing Jira
> that request Spark ML support, which assumes DFs.
>
>
> On Jan 4, 2018, at 12:25 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:
>
> Funny you should ask this. Yes, we are working on a DF based Universal
> Recommender but you have to convert the RDD into a DF since PIO does not
> read out data in the form of a DF (yet). This is a fairly simple step of
> maybe one line of code but would be better supported in PIO itself. The
> issue is that the EventStore uses libs that may not read out DFs, but RDDs.
> This is certainly the case with Elasticsearch, which provides an RDD lib. I
> haven’t seen one from them that read out DFs though it would make a lot of
> sense for ES especially.
>
> So TLDR; yes, just convert the RDD into a DF for now.
>
> Also please add a feature request as a PIO Jira ticket to look into this.
> I for one would +1
>
>
> On Jan 4, 2018, at 11:55 AM, Shane Johnson <sh...@gmail.com>
> wrote:
>
> Hello group, Happy new year! Does anyone have a working example or
> template using the DataFrame API vs. the RDD based APIs. We are wanting to
> migrate to using the new DataFrame APIs to take advantage of the *Feature
> Importance* function for our Regression Random Forest Models.
>
> We are wanting to move from
>
> import org.apache.spark.mllib.tree.RandomForestimport org.apache.spark.mllib.tree.model.RandomForestModelimport org.apache.spark.mllib.util.MLUtils
>
> to
>
> import org.apache.spark.ml.regression.{RandomForestRegressionModel, RandomForestRegressor}
>
>
> Is this something that should be fairly straightforward by adjusting
> parameters and calling new classes within DASE or is it much more involved
> development.
>
> Thank You!
>
> *Shane Johnson | 801.360.3350 <(801)%20360-3350>*
> LinkedIn <https://www.linkedin.com/in/shanewjohnson> | Facebook
> <https://www.facebook.com/shane.johnson.71653>
>
>
>

Re: Using Dataframe API vs. RDD API?

Posted by Pat Ferrel <pa...@occamsmachete.com>.
Actually there are libs that will read DFs from HBase https://svn.apache.org/repos/asf/hbase/hbase.apache.org/trunk/_chapters/spark.html <https://svn.apache.org/repos/asf/hbase/hbase.apache.org/trunk/_chapters/spark.html>

This is out of band with PIO and should not be used IMO because the schema of the EventStore is not guaranteed to remain as-is. The safest way is to translate or get DFs integrated to PIO. I think there is an existing Jira that request Spark ML support, which assumes DFs. 


On Jan 4, 2018, at 12:25 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:

Funny you should ask this. Yes, we are working on a DF based Universal Recommender but you have to convert the RDD into a DF since PIO does not read out data in the form of a DF (yet). This is a fairly simple step of maybe one line of code but would be better supported in PIO itself. The issue is that the EventStore uses libs that may not read out DFs, but RDDs. This is certainly the case with Elasticsearch, which provides an RDD lib. I haven’t seen one from them that read out DFs though it would make a lot of sense for ES especially.

So TLDR; yes, just convert the RDD into a DF for now.

Also please add a feature request as a PIO Jira ticket to look into this. I for one would +1


On Jan 4, 2018, at 11:55 AM, Shane Johnson <shanewaldenjohnson@gmail.com <ma...@gmail.com>> wrote:

Hello group, Happy new year! Does anyone have a working example or template using the DataFrame API vs. the RDD based APIs. We are wanting to migrate to using the new DataFrame APIs to take advantage of the Feature Importance function for our Regression Random Forest Models.

We are wanting to move from 

import org.apache.spark.mllib.tree.RandomForest
import org.apache.spark.mllib.tree.model.RandomForestModel
import org.apache.spark.mllib.util.MLUtils
to
import org.apache.spark.ml.regression.{RandomForestRegressionModel, RandomForestRegressor}

Is this something that should be fairly straightforward by adjusting parameters and calling new classes within DASE or is it much more involved development.

Thank You!
Shane Johnson | 801.360.3350

LinkedIn <https://www.linkedin.com/in/shanewjohnson> | Facebook <https://www.facebook.com/shane.johnson.71653>


Re: Using Dataframe API vs. RDD API?

Posted by Pat Ferrel <pa...@occamsmachete.com>.
Funny you should ask this. Yes, we are working on a DF based Universal Recommender but you have to convert the RDD into a DF since PIO does not read out data in the form of a DF (yet). This is a fairly simple step of maybe one line of code but would be better supported in PIO itself. The issue is that the EventStore uses libs that may not read out DFs, but RDDs. This is certainly the case with Elasticsearch, which provides an RDD lib. I haven’t seen one from them that read out DFs though it would make a lot of sense for ES especially.

So TLDR; yes, just convert the RDD into a DF for now.

Also please add a feature request as a PIO Jira ticket to look into this. I for one would +1


On Jan 4, 2018, at 11:55 AM, Shane Johnson <sh...@gmail.com> wrote:

Hello group, Happy new year! Does anyone have a working example or template using the DataFrame API vs. the RDD based APIs. We are wanting to migrate to using the new DataFrame APIs to take advantage of the Feature Importance function for our Regression Random Forest Models.

We are wanting to move from 

import org.apache.spark.mllib.tree.RandomForest
import org.apache.spark.mllib.tree.model.RandomForestModel
import org.apache.spark.mllib.util.MLUtils
to
import org.apache.spark.ml.regression.{RandomForestRegressionModel, RandomForestRegressor}

Is this something that should be fairly straightforward by adjusting parameters and calling new classes within DASE or is it much more involved development.

Thank You!
Shane Johnson | 801.360.3350

LinkedIn <https://www.linkedin.com/in/shanewjohnson> | Facebook <https://www.facebook.com/shane.johnson.71653>