You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@predictionio.apache.org by Hasan Can Saral <ha...@gmail.com> on 2016/09/22 11:34:04 UTC

How to access Spark Context in predict?

Hi!

I am trying to query Event Server with PEventStore api in predict method to
fetch events per entity to create my features. PEventStore needs sc, and
for this, I have:

- Extended PAlgorithm
- Extended LocalFileSystemPersistentModel and
LocalFileSystemPersistentModelLoader
- Put a dummy emptyRDD into my model
- Tried to access sc with model.dummyRDD.context to receive this error:

org.apache.spark.SparkException: RDD transformations and actions can only
be invoked by the driver, not inside of other transformations; for example,
rdd1.map(x => rdd2.values.count() * x) is invalid because the values
transformation and count action cannot be performed inside of the rdd1.map
transformation. For more information, see SPARK-5063.

Just like this user got it here
<https://groups.google.com/forum/#!topic/predictionio-user/h4kIltGIIYE> in
predictionio-user group. Any suggestions?

Here's a more of my predict method:

def predict(model: SomeModel, query: Query): PredictedResult = {

  def predict(model: SomeModel, query: Query): PredictedResult = {


  val appName = sys.env.getOrElse[String]("APP_NAME", ap.appName)

      var previousEvents = try {
        PEventStore.find(
          appName = appName,
          entityType = Some(ap.entityType),
          entityId = Some(query.entityId.getOrElse(""))
        )(model.dummyRDD.context).map(event => {

          Try(new CustomEvent(
            Some(event.event),
            Some(event.entityType),
            Some(event.entityId),
            Some(event.eventTime),
            Some(event.creationTime),
            Some(new Properties(
              *...*
            ))
          ))
        }).filter(_.isSuccess).map(_.get)
      } catch {
        case e: Exception => // fatal because of error, an empty query
          logger.error(s"Error when reading events: ${e}")
          throw e
      }

     ...

}

Re: How to access Spark Context in predict?

Posted by Marcin Ziemiński <zi...@gmail.com>.
Hi Hasan,

In case of the third point injecting dummy RDD may cause the original
errors when serialized, as you described at the beginning. If you use
PAlgorithm, take a look at PersistenModelLoader. Implementing
PersistentModel together with PersistentModelLoader could help you with an
access to SparkContext. As you can see PersistentModelLoader is provided
with Option[SparkContext] in its apply method, which in this case should be
Some(sc).
Your model can be something encapsulating both RandomForestModel and
SparkContext. With this you would save only RandomForestModel, but during
model loading you should be able to put given SparkContext inside your
model, which you would later get in predict method.

class MyModel(val rfm: RandomForestModel, sc: SparkContext) extends
PersistentModel[MyParams] {
      def save(id: String, params: MyParams, sc: SparkContext): Boolean = {
            // Here you save rfm
      }
}

object MyModel extends PersistentModelLoader[MyParams, MyModel] {
     def apply(id: String, params: MyParams, sc: Option[SparkContext]):
MyModel = {
         // use provided sc to recreate you model and use later in predict
method
     }
}

I hope this helps.

Regards,
Marcin


wt., 27.09.2016 o 13:04 użytkownik Hasan Can Saral <ha...@gmail.com>
napisał:

> Hi Kenneth & Donald,
>
> That was really clarifying, thank you. I really appreciate it. So now I
> know that;
>
> 1- I should use LEventStore and query HBase without sc and with the
> smallest processing as possible in predict,
> 2- In this case I don't have to extend PersistentModel, since I will not
> need sc in predict.
> 3- If I need sc and batch processing in predict, I can save RandomForest
> trees to a file, then I can load it from there. As far as I can see, mt
> only option to access sc for PEventStore is to add a dummy RDD to the
> model, and use dummyRDD.context.
>
> Am I correct, especially in the 3rd point?
>
> Thank you again,
> Hasan
>
> On Tue, Sep 27, 2016 at 9:00 AM, Kenneth Chan <ke...@apache.org> wrote:
>
>> Hasan,
>>
>> Spark randomforest algo doesn't need RDD. much simpler to simply
>> serialize it and use in local memory in predict().
>> see example here.
>>
>> https://github.com/PredictionIO/template-scala-parallel-leadscoring/blob/develop/src/main/scala/RFAlgorithm.scala
>>
>> For accessing evernt store in predict(), you should use LEventStore API
>> (not PEventStore API) to have fast query for specific events.
>>
>> (use PEventStore API if you really want to do batch processing again in
>> predict() and need RDD for it)
>>
>>
>> Kenneth
>>
>>
>> On Mon, Sep 26, 2016 at 9:19 PM, Donald Szeto <do...@apache.org> wrote:
>>
>>> Hi Hasan,
>>>
>>> Does your randomForestModel contain any RDD?
>>>
>>> If so, implement your algorithm by extending PAlgorithm, have your model
>>> extend PersistentModel, and implement PersistentModelLoader to save and
>>> load your model. You will be able to perform RDD operations within
>>> predict() by using the model's RDD.
>>>
>>> If not, implement your algorithm by extending P2LAlgorithm, and see if
>>> PredictionIO can automatically persist the model for you. The convention
>>> assumes that a non-RDD model does not require Spark to perform any RDD
>>> operations, so there will be no SparkContext access.
>>>
>>> Are these conventions not fitting your use case? Feedbacks are always
>>> welcome for improving PredictionIO.
>>>
>>> Regards,
>>> Donald
>>>
>>>
>>> On Mon, Sep 26, 2016 at 9:05 AM, Hasan Can Saral <
>>> hasancansaral@gmail.com> wrote:
>>>
>>>> Hi Marcin,
>>>>
>>>> I did look at the definition of PersistentModel, and indeed replaced
>>>> LocalFileSystemPersistentModel with PersistenModel. Thank you for this, I
>>>> really appreciate your help.
>>>>
>>>> However, I am having quite hard time understanding how I can access sc
>>>> object that is provided by PredictionIO to save and apply methods within
>>>> predict method.
>>>>
>>>> class SomeModel(randomForestModel: RandomForestModel,dummyRDD: RDD) extends PersistentModel[SomeAlgorithmParams] {
>>>>
>>>>   override def save(id: String, params: SomeAlgorithmParams, sc: SparkContext): Boolean = {
>>>>
>>>> // Here I should save randomForestModel to a file, but how to?// Tried saveAsObjectFile but no luck.
>>>>
>>>> true
>>>>
>>>>   }
>>>> }
>>>>
>>>> object SomeModel extends PersistentModelLoader[SomeAlgorithmParams, FraudModel] {
>>>>   override def apply(id: String, params: SomeAlgorithmParams, sc: Option[SparkContext]): SomeModel = {
>>>>
>>>> // // Here should I load randomForestModel from file? How?
>>>>     new SomeModel(randomForestModel)
>>>>
>>>>   }
>>>> }
>>>>
>>>> So, my questions have become:
>>>> 1- Can I save randomForestModel? If yes, how? If I cannot, I will have
>>>> to return false and retrain upon deployment. How do I skip pio train in
>>>> this case?
>>>> 2- How do I load saved randomForestModel from file? If I cannot, will I
>>>> remove object SomeModel extends PersistentModelLoader all together?
>>>> 3- How do I access sc within predict? Do I save a dummy RDD, load it in
>>>> apply, and say .context? In this case what happens to randomForestModel?
>>>>
>>>> I am really quite confused and could really appreciate some help/sample
>>>> code if you have time.
>>>> Thank you.
>>>> Hasan
>>>>
>>>>
>>>> On Mon, Sep 26, 2016 at 2:56 PM, Marcin Ziemiński <zi...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Hasan,
>>>>>
>>>>> So I guess, there are two things here:
>>>>> 1. You need SparkContext for predictions
>>>>> 2. You also need to retrain you model during loading
>>>>>
>>>>> Please, look at the definition of PersistentModel and the comments
>>>>> attached:
>>>>>
>>>>> trait PersistentModel[AP <: Params] {/** Save the model to some persistent storage.
>>>>> *
>>>>> * This method should return true if the model has been saved successfully so
>>>>> * that PredictionIO knows that it can be restored later during deployment.
>>>>> * This method should return false if the model cannot be saved (or should
>>>>> * not be saved due to configuration) so that PredictionIO will re-train the
>>>>> * model during deployment. All arguments of this method are provided by
>>>>> * automatically by PredictionIO.
>>>>> *
>>>>> * @param id ID of the run that trained this model.
>>>>> * @param params Algorithm parameters that were used to train this model.
>>>>> * @param sc An Apache Spark context.
>>>>> */def save(id: String, params: AP, sc: SparkContext): Boolean}
>>>>>
>>>>> In order to achieve the desired result you could simply use
>>>>> PersistentModel instead of LocalFileSystemPersistentModel and return false
>>>>> from save. Then during deployment your model will be retrained through your
>>>>> Algorithm implementation. You shouldn't need to retrain your model in
>>>>> implementations of PersistentModelLoader - this is rather for loading
>>>>> models, that are already trained and stored somewhere.
>>>>> You can save SparkContext instance provided to the train method for
>>>>> usage in predict(...) (assuming that your algorithm is an instance of
>>>>> PAlgorithm or P2LAlgorithm). Thus you should have what you need.
>>>>>
>>>>> Regards,
>>>>> Marcin
>>>>>
>>>>>
>>>>>
>>>>> pt., 23.09.2016 o 17:46 użytkownik Hasan Can Saral <
>>>>> hasancansaral@gmail.com> napisał:
>>>>>
>>>>>> Hi Marcin!
>>>>>>
>>>>>> Thank you for your answer.
>>>>>>
>>>>>> I do only need SparkContext, but have no idea on:
>>>>>> 1- How to retrieve it from PersitentModelLoader?
>>>>>> 2- How do I access sc in predict method using the configuration
>>>>>> below?
>>>>>>
>>>>>> class SomeModel() extends LocalFileSystemPersistentModel[SomeAlgorithmParams] {
>>>>>>   override def save(id: String, params: SomeAlgorithmParams, sc: SparkContext): Boolean = {
>>>>>>     false
>>>>>>   }
>>>>>> }
>>>>>>
>>>>>> object SomeModel extends LocalFileSystemPersistentModelLoader[SomeAlgorithmParams, FraudModel] {
>>>>>>   override def apply(id: String, params: SomeAlgorithmParams, sc: Option[SparkContext]): SomeModel = {
>>>>>>     new SomeModel() // HERE I TRAIN AND RETURN THE TRAINED MODEL
>>>>>>   }
>>>>>> }
>>>>>>
>>>>>> Thank you very much, I really appreciate it!
>>>>>>
>>>>>> Hasan
>>>>>>
>>>>>>
>>>>>> On Thu, Sep 22, 2016 at 7:05 PM, Marcin Ziemiński <zi...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Hasan,
>>>>>>>
>>>>>>> I think that you problem comes from using deserialized RDD, which
>>>>>>> already lost its connection with SparkContext.
>>>>>>> Similar case could be found here:
>>>>>>> http://stackoverflow.com/questions/29567247/serializing-rdd
>>>>>>>
>>>>>>> If you only really need SparkContext you could probably use the one
>>>>>>> provided to PersitentModelLoader, which would be implemented by your model.
>>>>>>> Alternatively you could also implement PersistentModel to return
>>>>>>> false from save method. In this case your algorithm would be retrained on
>>>>>>> deploy, what would also provide you with the instance of SparkContext.
>>>>>>>
>>>>>>> Regards,
>>>>>>> Marcin
>>>>>>>
>>>>>>>
>>>>>>> czw., 22.09.2016 o 13:34 użytkownik Hasan Can Saral <
>>>>>>> hasancansaral@gmail.com> napisał:
>>>>>>>
>>>>>>>> Hi!
>>>>>>>>
>>>>>>>> I am trying to query Event Server with PEventStore api in predict
>>>>>>>> method to fetch events per entity to create my features. PEventStore needs
>>>>>>>> sc, and for this, I have:
>>>>>>>>
>>>>>>>> - Extended PAlgorithm
>>>>>>>> - Extended LocalFileSystemPersistentModel and
>>>>>>>> LocalFileSystemPersistentModelLoader
>>>>>>>> - Put a dummy emptyRDD into my model
>>>>>>>> - Tried to access sc with model.dummyRDD.context to receive this
>>>>>>>> error:
>>>>>>>>
>>>>>>>> org.apache.spark.SparkException: RDD transformations and actions
>>>>>>>> can only be invoked by the driver, not inside of other transformations; for
>>>>>>>> example, rdd1.map(x => rdd2.values.count() * x) is invalid because the
>>>>>>>> values transformation and count action cannot be performed inside of the
>>>>>>>> rdd1.map transformation. For more information, see SPARK-5063.
>>>>>>>>
>>>>>>>> Just like this user got it here
>>>>>>>> <https://groups.google.com/forum/#!topic/predictionio-user/h4kIltGIIYE> in
>>>>>>>> predictionio-user group. Any suggestions?
>>>>>>>>
>>>>>>>> Here's a more of my predict method:
>>>>>>>>
>>>>>>>> def predict(model: SomeModel, query: Query): PredictedResult = {
>>>>>>>>
>>>>>>>>   def predict(model: SomeModel, query: Query): PredictedResult = {
>>>>>>>>
>>>>>>>>
>>>>>>>>   val appName = sys.env.getOrElse[String]("APP_NAME", ap.appName)
>>>>>>>>
>>>>>>>>       var previousEvents = try {
>>>>>>>>         PEventStore.find(
>>>>>>>>           appName = appName,
>>>>>>>>           entityType = Some(ap.entityType),
>>>>>>>>           entityId = Some(query.entityId.getOrElse(""))
>>>>>>>>         )(model.dummyRDD.context).map(event => {
>>>>>>>>
>>>>>>>>           Try(new CustomEvent(
>>>>>>>>             Some(event.event),
>>>>>>>>             Some(event.entityType),
>>>>>>>>             Some(event.entityId),
>>>>>>>>             Some(event.eventTime),
>>>>>>>>             Some(event.creationTime),
>>>>>>>>             Some(new Properties(
>>>>>>>>               *...*
>>>>>>>>             ))
>>>>>>>>           ))
>>>>>>>>         }).filter(_.isSuccess).map(_.get)
>>>>>>>>       } catch {
>>>>>>>>         case e: Exception => // fatal because of error, an empty query
>>>>>>>>           logger.error(s"Error when reading events: ${e}")
>>>>>>>>           throw e
>>>>>>>>       }
>>>>>>>>
>>>>>>>>      ...
>>>>>>>>
>>>>>>>> }
>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>>
>>>>>> Hasan Can Saral
>>>>>> hasancansaral@gmail.com
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> Hasan Can Saral
>>>> hasancansaral@gmail.com
>>>>
>>>
>>>
>>
>
>
> --
>
> Hasan Can Saral
> hasancansaral@gmail.com
>

Re: How to access Spark Context in predict?

Posted by Hasan Can Saral <ha...@gmail.com>.
Hi Kenneth & Donald,

That was really clarifying, thank you. I really appreciate it. So now I
know that;

1- I should use LEventStore and query HBase without sc and with the
smallest processing as possible in predict,
2- In this case I don't have to extend PersistentModel, since I will not
need sc in predict.
3- If I need sc and batch processing in predict, I can save RandomForest
trees to a file, then I can load it from there. As far as I can see, mt
only option to access sc for PEventStore is to add a dummy RDD to the
model, and use dummyRDD.context.

Am I correct, especially in the 3rd point?

Thank you again,
Hasan

On Tue, Sep 27, 2016 at 9:00 AM, Kenneth Chan <ke...@apache.org> wrote:

> Hasan,
>
> Spark randomforest algo doesn't need RDD. much simpler to simply serialize
> it and use in local memory in predict().
> see example here.
> https://github.com/PredictionIO/template-scala-parallel-leadscoring/blob/
> develop/src/main/scala/RFAlgorithm.scala
>
> For accessing evernt store in predict(), you should use LEventStore API
> (not PEventStore API) to have fast query for specific events.
>
> (use PEventStore API if you really want to do batch processing again in
> predict() and need RDD for it)
>
>
> Kenneth
>
>
> On Mon, Sep 26, 2016 at 9:19 PM, Donald Szeto <do...@apache.org> wrote:
>
>> Hi Hasan,
>>
>> Does your randomForestModel contain any RDD?
>>
>> If so, implement your algorithm by extending PAlgorithm, have your model
>> extend PersistentModel, and implement PersistentModelLoader to save and
>> load your model. You will be able to perform RDD operations within
>> predict() by using the model's RDD.
>>
>> If not, implement your algorithm by extending P2LAlgorithm, and see if
>> PredictionIO can automatically persist the model for you. The convention
>> assumes that a non-RDD model does not require Spark to perform any RDD
>> operations, so there will be no SparkContext access.
>>
>> Are these conventions not fitting your use case? Feedbacks are always
>> welcome for improving PredictionIO.
>>
>> Regards,
>> Donald
>>
>>
>> On Mon, Sep 26, 2016 at 9:05 AM, Hasan Can Saral <hasancansaral@gmail.com
>> > wrote:
>>
>>> Hi Marcin,
>>>
>>> I did look at the definition of PersistentModel, and indeed replaced
>>> LocalFileSystemPersistentModel with PersistenModel. Thank you for this, I
>>> really appreciate your help.
>>>
>>> However, I am having quite hard time understanding how I can access sc
>>> object that is provided by PredictionIO to save and apply methods within
>>> predict method.
>>>
>>> class SomeModel(randomForestModel: RandomForestModel,dummyRDD: RDD) extends PersistentModel[SomeAlgorithmParams] {
>>>
>>>   override def save(id: String, params: SomeAlgorithmParams, sc: SparkContext): Boolean = {
>>>
>>> // Here I should save randomForestModel to a file, but how to?// Tried saveAsObjectFile but no luck.
>>>
>>> true
>>>
>>>   }
>>> }
>>>
>>> object SomeModel extends PersistentModelLoader[SomeAlgorithmParams, FraudModel] {
>>>   override def apply(id: String, params: SomeAlgorithmParams, sc: Option[SparkContext]): SomeModel = {
>>>
>>> // // Here should I load randomForestModel from file? How?
>>>     new SomeModel(randomForestModel)
>>>
>>>   }
>>> }
>>>
>>> So, my questions have become:
>>> 1- Can I save randomForestModel? If yes, how? If I cannot, I will have
>>> to return false and retrain upon deployment. How do I skip pio train in
>>> this case?
>>> 2- How do I load saved randomForestModel from file? If I cannot, will I
>>> remove object SomeModel extends PersistentModelLoader all together?
>>> 3- How do I access sc within predict? Do I save a dummy RDD, load it in
>>> apply, and say .context? In this case what happens to randomForestModel?
>>>
>>> I am really quite confused and could really appreciate some help/sample
>>> code if you have time.
>>> Thank you.
>>> Hasan
>>>
>>>
>>> On Mon, Sep 26, 2016 at 2:56 PM, Marcin Ziemiński <zi...@gmail.com>
>>> wrote:
>>>
>>>> Hi Hasan,
>>>>
>>>> So I guess, there are two things here:
>>>> 1. You need SparkContext for predictions
>>>> 2. You also need to retrain you model during loading
>>>>
>>>> Please, look at the definition of PersistentModel and the comments
>>>> attached:
>>>>
>>>> trait PersistentModel[AP <: Params] {/** Save the model to some persistent storage.
>>>> *
>>>> * This method should return true if the model has been saved successfully so
>>>> * that PredictionIO knows that it can be restored later during deployment.
>>>> * This method should return false if the model cannot be saved (or should
>>>> * not be saved due to configuration) so that PredictionIO will re-train the
>>>> * model during deployment. All arguments of this method are provided by
>>>> * automatically by PredictionIO.
>>>> *
>>>> * @param id ID of the run that trained this model.
>>>> * @param params Algorithm parameters that were used to train this model.
>>>> * @param sc An Apache Spark context.
>>>> */def save(id: String, params: AP, sc: SparkContext): Boolean}
>>>>
>>>> In order to achieve the desired result you could simply use
>>>> PersistentModel instead of LocalFileSystemPersistentModel and return false
>>>> from save. Then during deployment your model will be retrained through your
>>>> Algorithm implementation. You shouldn't need to retrain your model in
>>>> implementations of PersistentModelLoader - this is rather for loading
>>>> models, that are already trained and stored somewhere.
>>>> You can save SparkContext instance provided to the train method for
>>>> usage in predict(...) (assuming that your algorithm is an instance of
>>>> PAlgorithm or P2LAlgorithm). Thus you should have what you need.
>>>>
>>>> Regards,
>>>> Marcin
>>>>
>>>>
>>>>
>>>> pt., 23.09.2016 o 17:46 użytkownik Hasan Can Saral <
>>>> hasancansaral@gmail.com> napisał:
>>>>
>>>>> Hi Marcin!
>>>>>
>>>>> Thank you for your answer.
>>>>>
>>>>> I do only need SparkContext, but have no idea on:
>>>>> 1- How to retrieve it from PersitentModelLoader?
>>>>> 2- How do I access sc in predict method using the configuration below?
>>>>>
>>>>> class SomeModel() extends LocalFileSystemPersistentModel[SomeAlgorithmParams] {
>>>>>   override def save(id: String, params: SomeAlgorithmParams, sc: SparkContext): Boolean = {
>>>>>     false
>>>>>   }
>>>>> }
>>>>>
>>>>> object SomeModel extends LocalFileSystemPersistentModelLoader[SomeAlgorithmParams, FraudModel] {
>>>>>   override def apply(id: String, params: SomeAlgorithmParams, sc: Option[SparkContext]): SomeModel = {
>>>>>     new SomeModel() // HERE I TRAIN AND RETURN THE TRAINED MODEL
>>>>>   }
>>>>> }
>>>>>
>>>>> Thank you very much, I really appreciate it!
>>>>>
>>>>> Hasan
>>>>>
>>>>>
>>>>> On Thu, Sep 22, 2016 at 7:05 PM, Marcin Ziemiński <zi...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi Hasan,
>>>>>>
>>>>>> I think that you problem comes from using deserialized RDD, which
>>>>>> already lost its connection with SparkContext.
>>>>>> Similar case could be found here: http://stackoverflow.com/quest
>>>>>> ions/29567247/serializing-rdd
>>>>>>
>>>>>> If you only really need SparkContext you could probably use the one
>>>>>> provided to PersitentModelLoader, which would be implemented by your model.
>>>>>> Alternatively you could also implement PersistentModel to return
>>>>>> false from save method. In this case your algorithm would be retrained on
>>>>>> deploy, what would also provide you with the instance of SparkContext.
>>>>>>
>>>>>> Regards,
>>>>>> Marcin
>>>>>>
>>>>>>
>>>>>> czw., 22.09.2016 o 13:34 użytkownik Hasan Can Saral <
>>>>>> hasancansaral@gmail.com> napisał:
>>>>>>
>>>>>>> Hi!
>>>>>>>
>>>>>>> I am trying to query Event Server with PEventStore api in predict
>>>>>>> method to fetch events per entity to create my features. PEventStore needs
>>>>>>> sc, and for this, I have:
>>>>>>>
>>>>>>> - Extended PAlgorithm
>>>>>>> - Extended LocalFileSystemPersistentModel and LocalFileSystemP
>>>>>>> ersistentModelLoader
>>>>>>> - Put a dummy emptyRDD into my model
>>>>>>> - Tried to access sc with model.dummyRDD.context to receive this
>>>>>>> error:
>>>>>>>
>>>>>>> org.apache.spark.SparkException: RDD transformations and actions
>>>>>>> can only be invoked by the driver, not inside of other transformations; for
>>>>>>> example, rdd1.map(x => rdd2.values.count() * x) is invalid because the
>>>>>>> values transformation and count action cannot be performed inside of the
>>>>>>> rdd1.map transformation. For more information, see SPARK-5063.
>>>>>>>
>>>>>>> Just like this user got it here
>>>>>>> <https://groups.google.com/forum/#!topic/predictionio-user/h4kIltGIIYE> in
>>>>>>> predictionio-user group. Any suggestions?
>>>>>>>
>>>>>>> Here's a more of my predict method:
>>>>>>>
>>>>>>> def predict(model: SomeModel, query: Query): PredictedResult = {
>>>>>>>
>>>>>>>   def predict(model: SomeModel, query: Query): PredictedResult = {
>>>>>>>
>>>>>>>
>>>>>>>   val appName = sys.env.getOrElse[String]("APP_NAME", ap.appName)
>>>>>>>
>>>>>>>       var previousEvents = try {
>>>>>>>         PEventStore.find(
>>>>>>>           appName = appName,
>>>>>>>           entityType = Some(ap.entityType),
>>>>>>>           entityId = Some(query.entityId.getOrElse(""))
>>>>>>>         )(model.dummyRDD.context).map(event => {
>>>>>>>
>>>>>>>           Try(new CustomEvent(
>>>>>>>             Some(event.event),
>>>>>>>             Some(event.entityType),
>>>>>>>             Some(event.entityId),
>>>>>>>             Some(event.eventTime),
>>>>>>>             Some(event.creationTime),
>>>>>>>             Some(new Properties(
>>>>>>>               *...*
>>>>>>>             ))
>>>>>>>           ))
>>>>>>>         }).filter(_.isSuccess).map(_.get)
>>>>>>>       } catch {
>>>>>>>         case e: Exception => // fatal because of error, an empty query
>>>>>>>           logger.error(s"Error when reading events: ${e}")
>>>>>>>           throw e
>>>>>>>       }
>>>>>>>
>>>>>>>      ...
>>>>>>>
>>>>>>> }
>>>>>>>
>>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>>
>>>>> Hasan Can Saral
>>>>> hasancansaral@gmail.com
>>>>>
>>>>
>>>
>>>
>>> --
>>>
>>> Hasan Can Saral
>>> hasancansaral@gmail.com
>>>
>>
>>
>


-- 

Hasan Can Saral
hasancansaral@gmail.com

Re: How to access Spark Context in predict?

Posted by Kenneth Chan <ke...@apache.org>.
Hasan,

Spark randomforest algo doesn't need RDD. much simpler to simply serialize
it and use in local memory in predict().
see example here.
https://github.com/PredictionIO/template-scala-parallel-leadscoring/blob/develop/src/main/scala/RFAlgorithm.scala

For accessing evernt store in predict(), you should use LEventStore API
(not PEventStore API) to have fast query for specific events.

(use PEventStore API if you really want to do batch processing again in
predict() and need RDD for it)


Kenneth


On Mon, Sep 26, 2016 at 9:19 PM, Donald Szeto <do...@apache.org> wrote:

> Hi Hasan,
>
> Does your randomForestModel contain any RDD?
>
> If so, implement your algorithm by extending PAlgorithm, have your model
> extend PersistentModel, and implement PersistentModelLoader to save and
> load your model. You will be able to perform RDD operations within
> predict() by using the model's RDD.
>
> If not, implement your algorithm by extending P2LAlgorithm, and see if
> PredictionIO can automatically persist the model for you. The convention
> assumes that a non-RDD model does not require Spark to perform any RDD
> operations, so there will be no SparkContext access.
>
> Are these conventions not fitting your use case? Feedbacks are always
> welcome for improving PredictionIO.
>
> Regards,
> Donald
>
>
> On Mon, Sep 26, 2016 at 9:05 AM, Hasan Can Saral <ha...@gmail.com>
> wrote:
>
>> Hi Marcin,
>>
>> I did look at the definition of PersistentModel, and indeed replaced
>> LocalFileSystemPersistentModel with PersistenModel. Thank you for this, I
>> really appreciate your help.
>>
>> However, I am having quite hard time understanding how I can access sc
>> object that is provided by PredictionIO to save and apply methods within
>> predict method.
>>
>> class SomeModel(randomForestModel: RandomForestModel,dummyRDD: RDD) extends PersistentModel[SomeAlgorithmParams] {
>>
>>   override def save(id: String, params: SomeAlgorithmParams, sc: SparkContext): Boolean = {
>>
>> // Here I should save randomForestModel to a file, but how to?// Tried saveAsObjectFile but no luck.
>>
>> true
>>
>>   }
>> }
>>
>> object SomeModel extends PersistentModelLoader[SomeAlgorithmParams, FraudModel] {
>>   override def apply(id: String, params: SomeAlgorithmParams, sc: Option[SparkContext]): SomeModel = {
>>
>> // // Here should I load randomForestModel from file? How?
>>     new SomeModel(randomForestModel)
>>
>>   }
>> }
>>
>> So, my questions have become:
>> 1- Can I save randomForestModel? If yes, how? If I cannot, I will have to
>> return false and retrain upon deployment. How do I skip pio train in this
>> case?
>> 2- How do I load saved randomForestModel from file? If I cannot, will I
>> remove object SomeModel extends PersistentModelLoader all together?
>> 3- How do I access sc within predict? Do I save a dummy RDD, load it in
>> apply, and say .context? In this case what happens to randomForestModel?
>>
>> I am really quite confused and could really appreciate some help/sample
>> code if you have time.
>> Thank you.
>> Hasan
>>
>>
>> On Mon, Sep 26, 2016 at 2:56 PM, Marcin Ziemiński <zi...@gmail.com>
>> wrote:
>>
>>> Hi Hasan,
>>>
>>> So I guess, there are two things here:
>>> 1. You need SparkContext for predictions
>>> 2. You also need to retrain you model during loading
>>>
>>> Please, look at the definition of PersistentModel and the comments
>>> attached:
>>>
>>> trait PersistentModel[AP <: Params] {/** Save the model to some persistent storage.
>>> *
>>> * This method should return true if the model has been saved successfully so
>>> * that PredictionIO knows that it can be restored later during deployment.
>>> * This method should return false if the model cannot be saved (or should
>>> * not be saved due to configuration) so that PredictionIO will re-train the
>>> * model during deployment. All arguments of this method are provided by
>>> * automatically by PredictionIO.
>>> *
>>> * @param id ID of the run that trained this model.
>>> * @param params Algorithm parameters that were used to train this model.
>>> * @param sc An Apache Spark context.
>>> */def save(id: String, params: AP, sc: SparkContext): Boolean}
>>>
>>> In order to achieve the desired result you could simply use
>>> PersistentModel instead of LocalFileSystemPersistentModel and return false
>>> from save. Then during deployment your model will be retrained through your
>>> Algorithm implementation. You shouldn't need to retrain your model in
>>> implementations of PersistentModelLoader - this is rather for loading
>>> models, that are already trained and stored somewhere.
>>> You can save SparkContext instance provided to the train method for
>>> usage in predict(...) (assuming that your algorithm is an instance of
>>> PAlgorithm or P2LAlgorithm). Thus you should have what you need.
>>>
>>> Regards,
>>> Marcin
>>>
>>>
>>>
>>> pt., 23.09.2016 o 17:46 użytkownik Hasan Can Saral <
>>> hasancansaral@gmail.com> napisał:
>>>
>>>> Hi Marcin!
>>>>
>>>> Thank you for your answer.
>>>>
>>>> I do only need SparkContext, but have no idea on:
>>>> 1- How to retrieve it from PersitentModelLoader?
>>>> 2- How do I access sc in predict method using the configuration below?
>>>>
>>>> class SomeModel() extends LocalFileSystemPersistentModel[SomeAlgorithmParams] {
>>>>   override def save(id: String, params: SomeAlgorithmParams, sc: SparkContext): Boolean = {
>>>>     false
>>>>   }
>>>> }
>>>>
>>>> object SomeModel extends LocalFileSystemPersistentModelLoader[SomeAlgorithmParams, FraudModel] {
>>>>   override def apply(id: String, params: SomeAlgorithmParams, sc: Option[SparkContext]): SomeModel = {
>>>>     new SomeModel() // HERE I TRAIN AND RETURN THE TRAINED MODEL
>>>>   }
>>>> }
>>>>
>>>> Thank you very much, I really appreciate it!
>>>>
>>>> Hasan
>>>>
>>>>
>>>> On Thu, Sep 22, 2016 at 7:05 PM, Marcin Ziemiński <zi...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Hasan,
>>>>>
>>>>> I think that you problem comes from using deserialized RDD, which
>>>>> already lost its connection with SparkContext.
>>>>> Similar case could be found here: http://stackoverflow.com/quest
>>>>> ions/29567247/serializing-rdd
>>>>>
>>>>> If you only really need SparkContext you could probably use the one
>>>>> provided to PersitentModelLoader, which would be implemented by your model.
>>>>> Alternatively you could also implement PersistentModel to return false
>>>>> from save method. In this case your algorithm would be retrained on deploy,
>>>>> what would also provide you with the instance of SparkContext.
>>>>>
>>>>> Regards,
>>>>> Marcin
>>>>>
>>>>>
>>>>> czw., 22.09.2016 o 13:34 użytkownik Hasan Can Saral <
>>>>> hasancansaral@gmail.com> napisał:
>>>>>
>>>>>> Hi!
>>>>>>
>>>>>> I am trying to query Event Server with PEventStore api in predict
>>>>>> method to fetch events per entity to create my features. PEventStore needs
>>>>>> sc, and for this, I have:
>>>>>>
>>>>>> - Extended PAlgorithm
>>>>>> - Extended LocalFileSystemPersistentModel and LocalFileSystemP
>>>>>> ersistentModelLoader
>>>>>> - Put a dummy emptyRDD into my model
>>>>>> - Tried to access sc with model.dummyRDD.context to receive this
>>>>>> error:
>>>>>>
>>>>>> org.apache.spark.SparkException: RDD transformations and actions can
>>>>>> only be invoked by the driver, not inside of other transformations; for
>>>>>> example, rdd1.map(x => rdd2.values.count() * x) is invalid because the
>>>>>> values transformation and count action cannot be performed inside of the
>>>>>> rdd1.map transformation. For more information, see SPARK-5063.
>>>>>>
>>>>>> Just like this user got it here
>>>>>> <https://groups.google.com/forum/#!topic/predictionio-user/h4kIltGIIYE> in
>>>>>> predictionio-user group. Any suggestions?
>>>>>>
>>>>>> Here's a more of my predict method:
>>>>>>
>>>>>> def predict(model: SomeModel, query: Query): PredictedResult = {
>>>>>>
>>>>>>   def predict(model: SomeModel, query: Query): PredictedResult = {
>>>>>>
>>>>>>
>>>>>>   val appName = sys.env.getOrElse[String]("APP_NAME", ap.appName)
>>>>>>
>>>>>>       var previousEvents = try {
>>>>>>         PEventStore.find(
>>>>>>           appName = appName,
>>>>>>           entityType = Some(ap.entityType),
>>>>>>           entityId = Some(query.entityId.getOrElse(""))
>>>>>>         )(model.dummyRDD.context).map(event => {
>>>>>>
>>>>>>           Try(new CustomEvent(
>>>>>>             Some(event.event),
>>>>>>             Some(event.entityType),
>>>>>>             Some(event.entityId),
>>>>>>             Some(event.eventTime),
>>>>>>             Some(event.creationTime),
>>>>>>             Some(new Properties(
>>>>>>               *...*
>>>>>>             ))
>>>>>>           ))
>>>>>>         }).filter(_.isSuccess).map(_.get)
>>>>>>       } catch {
>>>>>>         case e: Exception => // fatal because of error, an empty query
>>>>>>           logger.error(s"Error when reading events: ${e}")
>>>>>>           throw e
>>>>>>       }
>>>>>>
>>>>>>      ...
>>>>>>
>>>>>> }
>>>>>>
>>>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> Hasan Can Saral
>>>> hasancansaral@gmail.com
>>>>
>>>
>>
>>
>> --
>>
>> Hasan Can Saral
>> hasancansaral@gmail.com
>>
>
>

Re: How to access Spark Context in predict?

Posted by Donald Szeto <do...@apache.org>.
Hi Hasan,

Does your randomForestModel contain any RDD?

If so, implement your algorithm by extending PAlgorithm, have your model
extend PersistentModel, and implement PersistentModelLoader to save and
load your model. You will be able to perform RDD operations within
predict() by using the model's RDD.

If not, implement your algorithm by extending P2LAlgorithm, and see if
PredictionIO can automatically persist the model for you. The convention
assumes that a non-RDD model does not require Spark to perform any RDD
operations, so there will be no SparkContext access.

Are these conventions not fitting your use case? Feedbacks are always
welcome for improving PredictionIO.

Regards,
Donald


On Mon, Sep 26, 2016 at 9:05 AM, Hasan Can Saral <ha...@gmail.com>
wrote:

> Hi Marcin,
>
> I did look at the definition of PersistentModel, and indeed replaced
> LocalFileSystemPersistentModel with PersistenModel. Thank you for this, I
> really appreciate your help.
>
> However, I am having quite hard time understanding how I can access sc
> object that is provided by PredictionIO to save and apply methods within
> predict method.
>
> class SomeModel(randomForestModel: RandomForestModel,dummyRDD: RDD) extends PersistentModel[SomeAlgorithmParams] {
>
>   override def save(id: String, params: SomeAlgorithmParams, sc: SparkContext): Boolean = {
>
> // Here I should save randomForestModel to a file, but how to?// Tried saveAsObjectFile but no luck.
>
> true
>
>   }
> }
>
> object SomeModel extends PersistentModelLoader[SomeAlgorithmParams, FraudModel] {
>   override def apply(id: String, params: SomeAlgorithmParams, sc: Option[SparkContext]): SomeModel = {
>
> // // Here should I load randomForestModel from file? How?
>     new SomeModel(randomForestModel)
>
>   }
> }
>
> So, my questions have become:
> 1- Can I save randomForestModel? If yes, how? If I cannot, I will have to
> return false and retrain upon deployment. How do I skip pio train in this
> case?
> 2- How do I load saved randomForestModel from file? If I cannot, will I
> remove object SomeModel extends PersistentModelLoader all together?
> 3- How do I access sc within predict? Do I save a dummy RDD, load it in
> apply, and say .context? In this case what happens to randomForestModel?
>
> I am really quite confused and could really appreciate some help/sample
> code if you have time.
> Thank you.
> Hasan
>
>
> On Mon, Sep 26, 2016 at 2:56 PM, Marcin Ziemiński <zi...@gmail.com>
> wrote:
>
>> Hi Hasan,
>>
>> So I guess, there are two things here:
>> 1. You need SparkContext for predictions
>> 2. You also need to retrain you model during loading
>>
>> Please, look at the definition of PersistentModel and the comments
>> attached:
>>
>> trait PersistentModel[AP <: Params] {/** Save the model to some persistent storage.
>> *
>> * This method should return true if the model has been saved successfully so
>> * that PredictionIO knows that it can be restored later during deployment.
>> * This method should return false if the model cannot be saved (or should
>> * not be saved due to configuration) so that PredictionIO will re-train the
>> * model during deployment. All arguments of this method are provided by
>> * automatically by PredictionIO.
>> *
>> * @param id ID of the run that trained this model.
>> * @param params Algorithm parameters that were used to train this model.
>> * @param sc An Apache Spark context.
>> */def save(id: String, params: AP, sc: SparkContext): Boolean}
>>
>> In order to achieve the desired result you could simply use
>> PersistentModel instead of LocalFileSystemPersistentModel and return false
>> from save. Then during deployment your model will be retrained through your
>> Algorithm implementation. You shouldn't need to retrain your model in
>> implementations of PersistentModelLoader - this is rather for loading
>> models, that are already trained and stored somewhere.
>> You can save SparkContext instance provided to the train method for usage
>> in predict(...) (assuming that your algorithm is an instance of PAlgorithm
>> or P2LAlgorithm). Thus you should have what you need.
>>
>> Regards,
>> Marcin
>>
>>
>>
>> pt., 23.09.2016 o 17:46 użytkownik Hasan Can Saral <
>> hasancansaral@gmail.com> napisał:
>>
>>> Hi Marcin!
>>>
>>> Thank you for your answer.
>>>
>>> I do only need SparkContext, but have no idea on:
>>> 1- How to retrieve it from PersitentModelLoader?
>>> 2- How do I access sc in predict method using the configuration below?
>>>
>>> class SomeModel() extends LocalFileSystemPersistentModel[SomeAlgorithmParams] {
>>>   override def save(id: String, params: SomeAlgorithmParams, sc: SparkContext): Boolean = {
>>>     false
>>>   }
>>> }
>>>
>>> object SomeModel extends LocalFileSystemPersistentModelLoader[SomeAlgorithmParams, FraudModel] {
>>>   override def apply(id: String, params: SomeAlgorithmParams, sc: Option[SparkContext]): SomeModel = {
>>>     new SomeModel() // HERE I TRAIN AND RETURN THE TRAINED MODEL
>>>   }
>>> }
>>>
>>> Thank you very much, I really appreciate it!
>>>
>>> Hasan
>>>
>>>
>>> On Thu, Sep 22, 2016 at 7:05 PM, Marcin Ziemiński <zi...@gmail.com>
>>> wrote:
>>>
>>>> Hi Hasan,
>>>>
>>>> I think that you problem comes from using deserialized RDD, which
>>>> already lost its connection with SparkContext.
>>>> Similar case could be found here: http://stackoverflow.com/quest
>>>> ions/29567247/serializing-rdd
>>>>
>>>> If you only really need SparkContext you could probably use the one
>>>> provided to PersitentModelLoader, which would be implemented by your model.
>>>> Alternatively you could also implement PersistentModel to return false
>>>> from save method. In this case your algorithm would be retrained on deploy,
>>>> what would also provide you with the instance of SparkContext.
>>>>
>>>> Regards,
>>>> Marcin
>>>>
>>>>
>>>> czw., 22.09.2016 o 13:34 użytkownik Hasan Can Saral <
>>>> hasancansaral@gmail.com> napisał:
>>>>
>>>>> Hi!
>>>>>
>>>>> I am trying to query Event Server with PEventStore api in predict
>>>>> method to fetch events per entity to create my features. PEventStore needs
>>>>> sc, and for this, I have:
>>>>>
>>>>> - Extended PAlgorithm
>>>>> - Extended LocalFileSystemPersistentModel and LocalFileSystemP
>>>>> ersistentModelLoader
>>>>> - Put a dummy emptyRDD into my model
>>>>> - Tried to access sc with model.dummyRDD.context to receive this
>>>>> error:
>>>>>
>>>>> org.apache.spark.SparkException: RDD transformations and actions can
>>>>> only be invoked by the driver, not inside of other transformations; for
>>>>> example, rdd1.map(x => rdd2.values.count() * x) is invalid because the
>>>>> values transformation and count action cannot be performed inside of the
>>>>> rdd1.map transformation. For more information, see SPARK-5063.
>>>>>
>>>>> Just like this user got it here
>>>>> <https://groups.google.com/forum/#!topic/predictionio-user/h4kIltGIIYE> in
>>>>> predictionio-user group. Any suggestions?
>>>>>
>>>>> Here's a more of my predict method:
>>>>>
>>>>> def predict(model: SomeModel, query: Query): PredictedResult = {
>>>>>
>>>>>   def predict(model: SomeModel, query: Query): PredictedResult = {
>>>>>
>>>>>
>>>>>   val appName = sys.env.getOrElse[String]("APP_NAME", ap.appName)
>>>>>
>>>>>       var previousEvents = try {
>>>>>         PEventStore.find(
>>>>>           appName = appName,
>>>>>           entityType = Some(ap.entityType),
>>>>>           entityId = Some(query.entityId.getOrElse(""))
>>>>>         )(model.dummyRDD.context).map(event => {
>>>>>
>>>>>           Try(new CustomEvent(
>>>>>             Some(event.event),
>>>>>             Some(event.entityType),
>>>>>             Some(event.entityId),
>>>>>             Some(event.eventTime),
>>>>>             Some(event.creationTime),
>>>>>             Some(new Properties(
>>>>>               *...*
>>>>>             ))
>>>>>           ))
>>>>>         }).filter(_.isSuccess).map(_.get)
>>>>>       } catch {
>>>>>         case e: Exception => // fatal because of error, an empty query
>>>>>           logger.error(s"Error when reading events: ${e}")
>>>>>           throw e
>>>>>       }
>>>>>
>>>>>      ...
>>>>>
>>>>> }
>>>>>
>>>>>
>>>
>>>
>>> --
>>>
>>> Hasan Can Saral
>>> hasancansaral@gmail.com
>>>
>>
>
>
> --
>
> Hasan Can Saral
> hasancansaral@gmail.com
>

Re: How to access Spark Context in predict?

Posted by Hasan Can Saral <ha...@gmail.com>.
Hi Marcin,

I did look at the definition of PersistentModel, and indeed replaced
LocalFileSystemPersistentModel with PersistenModel. Thank you for this, I
really appreciate your help.

However, I am having quite hard time understanding how I can access sc
object that is provided by PredictionIO to save and apply methods within
predict method.

class SomeModel(randomForestModel: RandomForestModel,dummyRDD: RDD)
extends PersistentModel[SomeAlgorithmParams] {

  override def save(id: String, params: SomeAlgorithmParams, sc:
SparkContext): Boolean = {

// Here I should save randomForestModel to a file, but how to?// Tried
saveAsObjectFile but no luck.

true

  }
}

object SomeModel extends PersistentModelLoader[SomeAlgorithmParams,
FraudModel] {
  override def apply(id: String, params: SomeAlgorithmParams, sc:
Option[SparkContext]): SomeModel = {

// // Here should I load randomForestModel from file? How?
    new SomeModel(randomForestModel)

  }
}

So, my questions have become:
1- Can I save randomForestModel? If yes, how? If I cannot, I will have to
return false and retrain upon deployment. How do I skip pio train in this
case?
2- How do I load saved randomForestModel from file? If I cannot, will I
remove object SomeModel extends PersistentModelLoader all together?
3- How do I access sc within predict? Do I save a dummy RDD, load it in
apply, and say .context? In this case what happens to randomForestModel?

I am really quite confused and could really appreciate some help/sample
code if you have time.
Thank you.
Hasan


On Mon, Sep 26, 2016 at 2:56 PM, Marcin Ziemiński <zi...@gmail.com> wrote:

> Hi Hasan,
>
> So I guess, there are two things here:
> 1. You need SparkContext for predictions
> 2. You also need to retrain you model during loading
>
> Please, look at the definition of PersistentModel and the comments
> attached:
>
> trait PersistentModel[AP <: Params] {/** Save the model to some persistent storage.
> *
> * This method should return true if the model has been saved successfully so
> * that PredictionIO knows that it can be restored later during deployment.
> * This method should return false if the model cannot be saved (or should
> * not be saved due to configuration) so that PredictionIO will re-train the
> * model during deployment. All arguments of this method are provided by
> * automatically by PredictionIO.
> *
> * @param id ID of the run that trained this model.
> * @param params Algorithm parameters that were used to train this model.
> * @param sc An Apache Spark context.
> */def save(id: String, params: AP, sc: SparkContext): Boolean}
>
> In order to achieve the desired result you could simply use
> PersistentModel instead of LocalFileSystemPersistentModel and return false
> from save. Then during deployment your model will be retrained through your
> Algorithm implementation. You shouldn't need to retrain your model in
> implementations of PersistentModelLoader - this is rather for loading
> models, that are already trained and stored somewhere.
> You can save SparkContext instance provided to the train method for usage
> in predict(...) (assuming that your algorithm is an instance of PAlgorithm
> or P2LAlgorithm). Thus you should have what you need.
>
> Regards,
> Marcin
>
>
>
> pt., 23.09.2016 o 17:46 użytkownik Hasan Can Saral <
> hasancansaral@gmail.com> napisał:
>
>> Hi Marcin!
>>
>> Thank you for your answer.
>>
>> I do only need SparkContext, but have no idea on:
>> 1- How to retrieve it from PersitentModelLoader?
>> 2- How do I access sc in predict method using the configuration below?
>>
>> class SomeModel() extends LocalFileSystemPersistentModel[SomeAlgorithmParams] {
>>   override def save(id: String, params: SomeAlgorithmParams, sc: SparkContext): Boolean = {
>>     false
>>   }
>> }
>>
>> object SomeModel extends LocalFileSystemPersistentModelLoader[SomeAlgorithmParams, FraudModel] {
>>   override def apply(id: String, params: SomeAlgorithmParams, sc: Option[SparkContext]): SomeModel = {
>>     new SomeModel() // HERE I TRAIN AND RETURN THE TRAINED MODEL
>>   }
>> }
>>
>> Thank you very much, I really appreciate it!
>>
>> Hasan
>>
>>
>> On Thu, Sep 22, 2016 at 7:05 PM, Marcin Ziemiński <zi...@gmail.com>
>> wrote:
>>
>>> Hi Hasan,
>>>
>>> I think that you problem comes from using deserialized RDD, which
>>> already lost its connection with SparkContext.
>>> Similar case could be found here: http://stackoverflow.com/
>>> questions/29567247/serializing-rdd
>>>
>>> If you only really need SparkContext you could probably use the one
>>> provided to PersitentModelLoader, which would be implemented by your model.
>>> Alternatively you could also implement PersistentModel to return false
>>> from save method. In this case your algorithm would be retrained on deploy,
>>> what would also provide you with the instance of SparkContext.
>>>
>>> Regards,
>>> Marcin
>>>
>>>
>>> czw., 22.09.2016 o 13:34 użytkownik Hasan Can Saral <
>>> hasancansaral@gmail.com> napisał:
>>>
>>>> Hi!
>>>>
>>>> I am trying to query Event Server with PEventStore api in predict
>>>> method to fetch events per entity to create my features. PEventStore needs
>>>> sc, and for this, I have:
>>>>
>>>> - Extended PAlgorithm
>>>> - Extended LocalFileSystemPersistentModel and LocalFileSystemP
>>>> ersistentModelLoader
>>>> - Put a dummy emptyRDD into my model
>>>> - Tried to access sc with model.dummyRDD.context to receive this
>>>> error:
>>>>
>>>> org.apache.spark.SparkException: RDD transformations and actions can
>>>> only be invoked by the driver, not inside of other transformations; for
>>>> example, rdd1.map(x => rdd2.values.count() * x) is invalid because the
>>>> values transformation and count action cannot be performed inside of the
>>>> rdd1.map transformation. For more information, see SPARK-5063.
>>>>
>>>> Just like this user got it here
>>>> <https://groups.google.com/forum/#!topic/predictionio-user/h4kIltGIIYE> in
>>>> predictionio-user group. Any suggestions?
>>>>
>>>> Here's a more of my predict method:
>>>>
>>>> def predict(model: SomeModel, query: Query): PredictedResult = {
>>>>
>>>>   def predict(model: SomeModel, query: Query): PredictedResult = {
>>>>
>>>>
>>>>   val appName = sys.env.getOrElse[String]("APP_NAME", ap.appName)
>>>>
>>>>       var previousEvents = try {
>>>>         PEventStore.find(
>>>>           appName = appName,
>>>>           entityType = Some(ap.entityType),
>>>>           entityId = Some(query.entityId.getOrElse(""))
>>>>         )(model.dummyRDD.context).map(event => {
>>>>
>>>>           Try(new CustomEvent(
>>>>             Some(event.event),
>>>>             Some(event.entityType),
>>>>             Some(event.entityId),
>>>>             Some(event.eventTime),
>>>>             Some(event.creationTime),
>>>>             Some(new Properties(
>>>>               *...*
>>>>             ))
>>>>           ))
>>>>         }).filter(_.isSuccess).map(_.get)
>>>>       } catch {
>>>>         case e: Exception => // fatal because of error, an empty query
>>>>           logger.error(s"Error when reading events: ${e}")
>>>>           throw e
>>>>       }
>>>>
>>>>      ...
>>>>
>>>> }
>>>>
>>>>
>>
>>
>> --
>>
>> Hasan Can Saral
>> hasancansaral@gmail.com
>>
>


-- 

Hasan Can Saral
hasancansaral@gmail.com

Re: How to access Spark Context in predict?

Posted by Marcin Ziemiński <zi...@gmail.com>.
Hi Hasan,

So I guess, there are two things here:
1. You need SparkContext for predictions
2. You also need to retrain you model during loading

Please, look at the definition of PersistentModel and the comments attached:

trait PersistentModel[AP <: Params] {/** Save the model to some
persistent storage.
*
* This method should return true if the model has been saved successfully so
* that PredictionIO knows that it can be restored later during deployment.
* This method should return false if the model cannot be saved (or should
* not be saved due to configuration) so that PredictionIO will re-train the
* model during deployment. All arguments of this method are provided by
* automatically by PredictionIO.
*
* @param id ID of the run that trained this model.
* @param params Algorithm parameters that were used to train this model.
* @param sc An Apache Spark context.
*/def save(id: String, params: AP, sc: SparkContext): Boolean}

In order to achieve the desired result you could simply use PersistentModel
instead of LocalFileSystemPersistentModel and return false from save. Then
during deployment your model will be retrained through your Algorithm
implementation. You shouldn't need to retrain your model in implementations
of PersistentModelLoader - this is rather for loading models, that are
already trained and stored somewhere.
You can save SparkContext instance provided to the train method for usage
in predict(...) (assuming that your algorithm is an instance of PAlgorithm
or P2LAlgorithm). Thus you should have what you need.

Regards,
Marcin


pt., 23.09.2016 o 17:46 użytkownik Hasan Can Saral <ha...@gmail.com>
napisał:

> Hi Marcin!
>
> Thank you for your answer.
>
> I do only need SparkContext, but have no idea on:
> 1- How to retrieve it from PersitentModelLoader?
> 2- How do I access sc in predict method using the configuration below?
>
> class SomeModel() extends LocalFileSystemPersistentModel[SomeAlgorithmParams] {
>   override def save(id: String, params: SomeAlgorithmParams, sc: SparkContext): Boolean = {
>     false
>   }
> }
>
> object SomeModel extends LocalFileSystemPersistentModelLoader[SomeAlgorithmParams, FraudModel] {
>   override def apply(id: String, params: SomeAlgorithmParams, sc: Option[SparkContext]): SomeModel = {
>     new SomeModel() // HERE I TRAIN AND RETURN THE TRAINED MODEL
>   }
> }
>
> Thank you very much, I really appreciate it!
>
> Hasan
>
>
> On Thu, Sep 22, 2016 at 7:05 PM, Marcin Ziemiński <zi...@gmail.com>
> wrote:
>
>> Hi Hasan,
>>
>> I think that you problem comes from using deserialized RDD, which already
>> lost its connection with SparkContext.
>> Similar case could be found here:
>> http://stackoverflow.com/questions/29567247/serializing-rdd
>>
>> If you only really need SparkContext you could probably use the one
>> provided to PersitentModelLoader, which would be implemented by your model.
>> Alternatively you could also implement PersistentModel to return false
>> from save method. In this case your algorithm would be retrained on deploy,
>> what would also provide you with the instance of SparkContext.
>>
>> Regards,
>> Marcin
>>
>>
>> czw., 22.09.2016 o 13:34 użytkownik Hasan Can Saral <
>> hasancansaral@gmail.com> napisał:
>>
>>> Hi!
>>>
>>> I am trying to query Event Server with PEventStore api in predict method
>>> to fetch events per entity to create my features. PEventStore needs sc, and
>>> for this, I have:
>>>
>>> - Extended PAlgorithm
>>> - Extended LocalFileSystemPersistentModel and
>>> LocalFileSystemPersistentModelLoader
>>> - Put a dummy emptyRDD into my model
>>> - Tried to access sc with model.dummyRDD.context to receive this error:
>>>
>>> org.apache.spark.SparkException: RDD transformations and actions can
>>> only be invoked by the driver, not inside of other transformations; for
>>> example, rdd1.map(x => rdd2.values.count() * x) is invalid because the
>>> values transformation and count action cannot be performed inside of the
>>> rdd1.map transformation. For more information, see SPARK-5063.
>>>
>>> Just like this user got it here
>>> <https://groups.google.com/forum/#!topic/predictionio-user/h4kIltGIIYE> in
>>> predictionio-user group. Any suggestions?
>>>
>>> Here's a more of my predict method:
>>>
>>> def predict(model: SomeModel, query: Query): PredictedResult = {
>>>
>>>   def predict(model: SomeModel, query: Query): PredictedResult = {
>>>
>>>
>>>   val appName = sys.env.getOrElse[String]("APP_NAME", ap.appName)
>>>
>>>       var previousEvents = try {
>>>         PEventStore.find(
>>>           appName = appName,
>>>           entityType = Some(ap.entityType),
>>>           entityId = Some(query.entityId.getOrElse(""))
>>>         )(model.dummyRDD.context).map(event => {
>>>
>>>           Try(new CustomEvent(
>>>             Some(event.event),
>>>             Some(event.entityType),
>>>             Some(event.entityId),
>>>             Some(event.eventTime),
>>>             Some(event.creationTime),
>>>             Some(new Properties(
>>>               *...*
>>>             ))
>>>           ))
>>>         }).filter(_.isSuccess).map(_.get)
>>>       } catch {
>>>         case e: Exception => // fatal because of error, an empty query
>>>           logger.error(s"Error when reading events: ${e}")
>>>           throw e
>>>       }
>>>
>>>      ...
>>>
>>> }
>>>
>>>
>
>
> --
>
> Hasan Can Saral
> hasancansaral@gmail.com
>

Re: How to access Spark Context in predict?

Posted by Hasan Can Saral <ha...@gmail.com>.
Hi Marcin!

Thank you for your answer.

I do only need SparkContext, but have no idea on:
1- How to retrieve it from PersitentModelLoader?
2- How do I access sc in predict method using the configuration below?

class SomeModel() extends LocalFileSystemPersistentModel[SomeAlgorithmParams] {
  override def save(id: String, params: SomeAlgorithmParams, sc:
SparkContext): Boolean = {
    false
  }
}

object SomeModel extends
LocalFileSystemPersistentModelLoader[SomeAlgorithmParams, FraudModel]
{
  override def apply(id: String, params: SomeAlgorithmParams, sc:
Option[SparkContext]): SomeModel = {
    new SomeModel() // HERE I TRAIN AND RETURN THE TRAINED MODEL
  }
}

Thank you very much, I really appreciate it!

Hasan


On Thu, Sep 22, 2016 at 7:05 PM, Marcin Ziemiński <zi...@gmail.com> wrote:

> Hi Hasan,
>
> I think that you problem comes from using deserialized RDD, which already
> lost its connection with SparkContext.
> Similar case could be found here: http://stackoverflow.com/
> questions/29567247/serializing-rdd
>
> If you only really need SparkContext you could probably use the one
> provided to PersitentModelLoader, which would be implemented by your model.
> Alternatively you could also implement PersistentModel to return false
> from save method. In this case your algorithm would be retrained on deploy,
> what would also provide you with the instance of SparkContext.
>
> Regards,
> Marcin
>
>
> czw., 22.09.2016 o 13:34 użytkownik Hasan Can Saral <
> hasancansaral@gmail.com> napisał:
>
>> Hi!
>>
>> I am trying to query Event Server with PEventStore api in predict method
>> to fetch events per entity to create my features. PEventStore needs sc, and
>> for this, I have:
>>
>> - Extended PAlgorithm
>> - Extended LocalFileSystemPersistentModel and LocalFileSystemP
>> ersistentModelLoader
>> - Put a dummy emptyRDD into my model
>> - Tried to access sc with model.dummyRDD.context to receive this error:
>>
>> org.apache.spark.SparkException: RDD transformations and actions can
>> only be invoked by the driver, not inside of other transformations; for
>> example, rdd1.map(x => rdd2.values.count() * x) is invalid because the
>> values transformation and count action cannot be performed inside of the
>> rdd1.map transformation. For more information, see SPARK-5063.
>>
>> Just like this user got it here
>> <https://groups.google.com/forum/#!topic/predictionio-user/h4kIltGIIYE> in
>> predictionio-user group. Any suggestions?
>>
>> Here's a more of my predict method:
>>
>> def predict(model: SomeModel, query: Query): PredictedResult = {
>>
>>   def predict(model: SomeModel, query: Query): PredictedResult = {
>>
>>
>>   val appName = sys.env.getOrElse[String]("APP_NAME", ap.appName)
>>
>>       var previousEvents = try {
>>         PEventStore.find(
>>           appName = appName,
>>           entityType = Some(ap.entityType),
>>           entityId = Some(query.entityId.getOrElse(""))
>>         )(model.dummyRDD.context).map(event => {
>>
>>           Try(new CustomEvent(
>>             Some(event.event),
>>             Some(event.entityType),
>>             Some(event.entityId),
>>             Some(event.eventTime),
>>             Some(event.creationTime),
>>             Some(new Properties(
>>               *...*
>>             ))
>>           ))
>>         }).filter(_.isSuccess).map(_.get)
>>       } catch {
>>         case e: Exception => // fatal because of error, an empty query
>>           logger.error(s"Error when reading events: ${e}")
>>           throw e
>>       }
>>
>>      ...
>>
>> }
>>
>>


-- 

Hasan Can Saral
hasancansaral@gmail.com

Re: How to access Spark Context in predict?

Posted by Marcin Ziemiński <zi...@gmail.com>.
Hi Hasan,

I think that you problem comes from using deserialized RDD, which already
lost its connection with SparkContext.
Similar case could be found here:
http://stackoverflow.com/questions/29567247/serializing-rdd

If you only really need SparkContext you could probably use the one
provided to PersitentModelLoader, which would be implemented by your model.
Alternatively you could also implement PersistentModel to return false from
save method. In this case your algorithm would be retrained on deploy, what
would also provide you with the instance of SparkContext.

Regards,
Marcin


czw., 22.09.2016 o 13:34 użytkownik Hasan Can Saral <ha...@gmail.com>
napisał:

> Hi!
>
> I am trying to query Event Server with PEventStore api in predict method
> to fetch events per entity to create my features. PEventStore needs sc, and
> for this, I have:
>
> - Extended PAlgorithm
> - Extended LocalFileSystemPersistentModel and
> LocalFileSystemPersistentModelLoader
> - Put a dummy emptyRDD into my model
> - Tried to access sc with model.dummyRDD.context to receive this error:
>
> org.apache.spark.SparkException: RDD transformations and actions can only
> be invoked by the driver, not inside of other transformations; for example,
> rdd1.map(x => rdd2.values.count() * x) is invalid because the values
> transformation and count action cannot be performed inside of the rdd1.map
> transformation. For more information, see SPARK-5063.
>
> Just like this user got it here
> <https://groups.google.com/forum/#!topic/predictionio-user/h4kIltGIIYE> in
> predictionio-user group. Any suggestions?
>
> Here's a more of my predict method:
>
> def predict(model: SomeModel, query: Query): PredictedResult = {
>
>   def predict(model: SomeModel, query: Query): PredictedResult = {
>
>
>   val appName = sys.env.getOrElse[String]("APP_NAME", ap.appName)
>
>       var previousEvents = try {
>         PEventStore.find(
>           appName = appName,
>           entityType = Some(ap.entityType),
>           entityId = Some(query.entityId.getOrElse(""))
>         )(model.dummyRDD.context).map(event => {
>
>           Try(new CustomEvent(
>             Some(event.event),
>             Some(event.entityType),
>             Some(event.entityId),
>             Some(event.eventTime),
>             Some(event.creationTime),
>             Some(new Properties(
>               *...*
>             ))
>           ))
>         }).filter(_.isSuccess).map(_.get)
>       } catch {
>         case e: Exception => // fatal because of error, an empty query
>           logger.error(s"Error when reading events: ${e}")
>           throw e
>       }
>
>      ...
>
> }
>
>