You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Jianhong Xia <jx...@infoblox.com> on 2017/02/15 00:46:18 UTC

RE: Can't load a RandomForestClassificationModel in Spark job

Is there any update on this problem?

I encountered the same issue that was mentioned here.

I have CrossValidatorModel.transform(df) running on workers, which requires DataFrame as an input. However, we only have Arrays on workers. When we deploy our model into cluster mode, we could not create createDataFrame on workers. It will give me error:


17/02/13 20:21:27 ERROR Detector$: Error while detecting threats
java.lang.NullPointerException
     at org.apache.spark.sql.SparkSession.sessionState$lzycompute(SparkSession.scala:111)
     at org.apache.spark.sql.SparkSession.sessionState(SparkSession.scala:109)
     at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:62)
     at org.apache.spark.sql.SparkSession.createDataFrame(SparkSession.scala:270)
     at com.mycompany.analytics.models.app.serializable.AppModeler.detection(modeler.scala:370)



On the other hand, if we run in the local, everything works fine.

Just want to know, if there is any successful case that run machine learning models on the workers.


Thanks,
Jianhong


From: Sumona Routh [mailto:sumosha2@gmail.com]
Sent: Thursday, January 12, 2017 6:20 PM
To: ayan guha <gu...@gmail.com>; user@spark.apache.org
Subject: Re: Can't load a RandomForestClassificationModel in Spark job

Yes, I save it to S3 in a different process. It is actually the RandomForestClassificationModel.load method (passed an s3 path) where I run into problems.
When you say you load it during map stages, do you mean that you are able to directly load a model from inside of a transformation? When I try this, it passes the function to a worker, and the load method itself appears to attempt to create a new SparkContext, which causes an NPE downstream (because creating a SparkContext on the worker is not an appropriate thing to do, according to various threads I've found).
Maybe there is a different load function I should be using?
Thanks!
Sumona

On Thu, Jan 12, 2017 at 6:26 PM ayan guha <gu...@gmail.com>> wrote:
Hi

Given training and predictions are two different applications, I typically save model objects to hdfs and load it back during prediction map stages.

Best
Ayan

On Fri, 13 Jan 2017 at 5:39 am, Sumona Routh <su...@gmail.com>> wrote:
Hi all,
I've been working with Spark mllib 2.0.2 RandomForestClassificationModel.
I encountered two frustrating issues and would really appreciate some advice:

1)  RandomForestClassificationModel is effectively not serializable (I assume it's referencing something that can't be serialized, since it itself extends serializable), so I ended up with the well-known exception: org.apache.spark.SparkException: Task not serializable.
Basically, my original intention was to pass the model as a parameter

because which model we use is dynamic based on what record we are

predicting on.
Has anyone else encountered this? Is this currently being addressed? I would expect objects from Spark's own libraries be able to be used seamlessly in their applications without these types of exceptions.
2) The RandomForestClassificationModel.load method appears to hang indefinitely when executed from inside a map function (which I assume is passed to the executor). So, I basically cannot load a model from a worker. We have multiple "profiles" that use differently trained models, which are accessed from within a map function to run predictions on different sets of data.
The thread that is hanging has this as the latest (most pertinent) code:
org.apache.spark.ml.util.DefaultParamsReader$.loadMetadata(ReadWrite.scala:391)
Looking at the code in github, it appears that it is calling sc.textFile. I could not find anything stating that this particular function would not work from within a map function.
Are there any suggestions as to how I can get this model to work on a real production job (either by allowing it to be serializable and passed around or loaded from a worker)?
I've extenisvely POCed this model (saving, loading, transforming, training, etc.), however this is the first time I'm attempting to use it from within a real application.
Sumona

Re: Can't load a RandomForestClassificationModel in Spark job

Posted by Russell Jurney <ru...@gmail.com>.

When you say workers, are you using Spark Streaming? I'm not sure if this
will help, but there is an example of deploying a
RandomForestClassificationModel in Spark Streaming against Kafka that uses
createDataFrame here:
https://github.com/rjurney/Agile_Data_Code_2/blob/master/ch08/make_predictions_streaming.py


I had to create a pyspark.sql.Row in a map operation in an RDD before I
call spark.createDataFrame. Check out lines 92-138.

Not sure if this helps, but I thought I'd give it a try ;)

---
Russell Jurney @rjurney <http://twitter.com/rjurney>
russell.jurney@gmail.com LI <http://linkedin.com/in/russelljurney> FB
<http://facebook.com/jurney> datasyndrome.com

On Tue, Feb 14, 2017 at 4:46 PM, Jianhong Xia <jx...@infoblox.com> wrote:

> Is there any update on this problem?
>
>
>
> I encountered the same issue that was mentioned here.
>
>
>
> I have CrossValidatorModel.transform(df) running on workers, which
> requires DataFrame as an input. However, we only have Arrays on workers.
> When we deploy our model into cluster mode, we could not create
> createDataFrame on workers. It will give me error:
>
>
>
>
>
> 17/02/13 20:21:27 ERROR Detector$: Error while detecting threats
>
> java.lang.NullPointerException
>
>      at org.apache.spark.sql.SparkSession.sessionState$
> lzycompute(SparkSession.scala:111)
>
>      at org.apache.spark.sql.SparkSession.sessionState(
> SparkSession.scala:109)
>
>      at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:62)
>
>      at org.apache.spark.sql.SparkSession.createDataFrame(
> SparkSession.scala:270)
>
>      at com.mycompany.analytics.models.app.serializable.
> AppModeler.detection(modeler.scala:370)
>
>
>
>
>
>
>
> On the other hand, if we run in the local, everything works fine.
>
>
>
> Just want to know, if there is any successful case that run machine
> learning models on the workers.
>
>
>
>
>
> Thanks,
>
> Jianhong
>
>
>
>
>
> *From:* Sumona Routh [mailto:sumosha2@gmail.com]
> *Sent:* Thursday, January 12, 2017 6:20 PM
> *To:* ayan guha <gu...@gmail.com>; user@spark.apache.org
> *Subject:* Re: Can't load a RandomForestClassificationModel in Spark job
>
>
>
> Yes, I save it to S3 in a different process. It is actually the
> RandomForestClassificationModel.load method (passed an s3 path) where I
> run into problems.
> When you say you load it during map stages, do you mean that you are able
> to directly load a model from inside of a transformation? When I try this,
> it passes the function to a worker, and the load method itself appears to
> attempt to create a new SparkContext, which causes an NPE downstream
> (because creating a SparkContext on the worker is not an appropriate thing
> to do, according to various threads I've found).
>
> Maybe there is a different load function I should be using?
>
> Thanks!
>
> Sumona
>
>
>
> On Thu, Jan 12, 2017 at 6:26 PM ayan guha <gu...@gmail.com> wrote:
>
> Hi
>
>
>
> Given training and predictions are two different applications, I typically
> save model objects to hdfs and load it back during prediction map stages.
>
>
>
> Best
>
> Ayan
>
>
>
> On Fri, 13 Jan 2017 at 5:39 am, Sumona Routh <su...@gmail.com> wrote:
>
> Hi all,
>
> I've been working with Spark mllib 2.0.2 RandomForestClassificationModel.
>
> I encountered two frustrating issues and would really appreciate some
> advice:
>
> 1)  RandomForestClassificationModel is effectively not serializable (I
> assume it's referencing something that can't be serialized, since it itself
> extends serializable), so I ended up with the well-known exception:
> org.apache.spark.SparkException: Task not serializable.
> Basically, my original intention was to pass the model as a parameter
>
> because which model we use is dynamic based on what record we are
>
> predicting on.
>
> Has anyone else encountered this? Is this currently being addressed? I
> would expect objects from Spark's own libraries be able to be used
> seamlessly in their applications without these types of exceptions.
>
> 2) The RandomForestClassificationModel.load method appears to hang
> indefinitely when executed from inside a map function (which I assume is
> passed to the executor). So, I basically cannot load a model from a worker.
> We have multiple "profiles" that use differently trained models, which are
> accessed from within a map function to run predictions on different sets of
> data.
>
> The thread that is hanging has this as the latest (most pertinent) code:
> org.apache.spark.ml.util.DefaultParamsReader$.
> loadMetadata(ReadWrite.scala:391)
>
> Looking at the code in github, it appears that it is calling sc.textFile.
> I could not find anything stating that this particular function would not
> work from within a map function.
>
> Are there any suggestions as to how I can get this model to work on a real
> production job (either by allowing it to be serializable and passed around
> or loaded from a worker)?
>
> I've extenisvely POCed this model (saving, loading, transforming,
> training, etc.), however this is the first time I'm attempting to use it
> from within a real application.
>
> Sumona
>
>

RE: Can't load a RandomForestClassificationModel in Spark job

Posted by Jianhong Xia <jx...@infoblox.com>.

Thanks Hollin.

I will take a look at mleap and will let you know if I have any questions.

Jianhong

From: Hollin Wilkins [mailto:hollin@combust.ml]
Sent: Tuesday, February 14, 2017 11:48 PM
To: Jianhong Xia <jx...@infoblox.com>
Cc: Sumona Routh <su...@gmail.com>; ayan guha <gu...@gmail.com>; user@spark.apache.org
Subject: Re: Can't load a RandomForestClassificationModel in Spark job

Hey there,

Creating a new SparkContext on workers will not work, only the driver is allowed to own a SparkContext. Are you trying to distribute your model to workers so you can create a distributed scoring service? If so, it may be worth looking into taking your models outside of a SparkContext and serving them separately.

If this is your use case, take a look at MLeap. We use it in production to serve high-volume realtime requests from Spark-trained models: https://github.com/combust/mleap

Cheers,
Hollin

On Tue, Feb 14, 2017 at 4:46 PM, Jianhong Xia <jx...@infoblox.com>> wrote:
Is there any update on this problem?

I encountered the same issue that was mentioned here.

I have CrossValidatorModel.transform(df) running on workers, which requires DataFrame as an input. However, we only have Arrays on workers. When we deploy our model into cluster mode, we could not create createDataFrame on workers. It will give me error:

17/02/13 20:21:27 ERROR Detector$: Error while detecting threats
java.lang.NullPointerException
     at org.apache.spark.sql.SparkSession.sessionState$lzycompute(SparkSession.scala:111)
     at org.apache.spark.sql.SparkSession.sessionState(SparkSession.scala:109)
     at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:62)
     at org.apache.spark.sql.SparkSession.createDataFrame(SparkSession.scala:270)
     at com.mycompany.analytics.models.app.serializable.AppModeler.detection(modeler.scala:370)

On the other hand, if we run in the local, everything works fine.

Just want to know, if there is any successful case that run machine learning models on the workers.

Thanks,
Jianhong

From: Sumona Routh [mailto:sumosha2@gmail.com<ma...@gmail.com>]
Sent: Thursday, January 12, 2017 6:20 PM
To: ayan guha <gu...@gmail.com>>; user@spark.apache.org<ma...@spark.apache.org>
Subject: Re: Can't load a RandomForestClassificationModel in Spark job

Yes, I save it to S3 in a different process. It is actually the RandomForestClassificationModel.load method (passed an s3 path) where I run into problems.
When you say you load it during map stages, do you mean that you are able to directly load a model from inside of a transformation? When I try this, it passes the function to a worker, and the load method itself appears to attempt to create a new SparkContext, which causes an NPE downstream (because creating a SparkContext on the worker is not an appropriate thing to do, according to various threads I've found).
Maybe there is a different load function I should be using?
Thanks!
Sumona

On Thu, Jan 12, 2017 at 6:26 PM ayan guha <gu...@gmail.com>> wrote:
Hi

Given training and predictions are two different applications, I typically save model objects to hdfs and load it back during prediction map stages.

Best
Ayan

On Fri, 13 Jan 2017 at 5:39 am, Sumona Routh <su...@gmail.com>> wrote:
Hi all,
I've been working with Spark mllib 2.0.2 RandomForestClassificationModel.
I encountered two frustrating issues and would really appreciate some advice:

1)  RandomForestClassificationModel is effectively not serializable (I assume it's referencing something that can't be serialized, since it itself extends serializable), so I ended up with the well-known exception: org.apache.spark.SparkException: Task not serializable.
Basically, my original intention was to pass the model as a parameter

because which model we use is dynamic based on what record we are

predicting on.
Has anyone else encountered this? Is this currently being addressed? I would expect objects from Spark's own libraries be able to be used seamlessly in their applications without these types of exceptions.
2) The RandomForestClassificationModel.load method appears to hang indefinitely when executed from inside a map function (which I assume is passed to the executor). So, I basically cannot load a model from a worker. We have multiple "profiles" that use differently trained models, which are accessed from within a map function to run predictions on different sets of data.
The thread that is hanging has this as the latest (most pertinent) code:
org.apache.spark.ml.util.DefaultParamsReader$.loadMetadata(ReadWrite.scala:391)
Looking at the code in github, it appears that it is calling sc.textFile. I could not find anything stating that this particular function would not work from within a map function.
Are there any suggestions as to how I can get this model to work on a real production job (either by allowing it to be serializable and passed around or loaded from a worker)?
I've extenisvely POCed this model (saving, loading, transforming, training, etc.), however this is the first time I'm attempting to use it from within a real application.
Sumona

Re: Can't load a RandomForestClassificationModel in Spark job

Posted by Hollin Wilkins <ho...@combust.ml>.

Hey there,

Creating a new SparkContext on workers will not work, only the driver is
allowed to own a SparkContext. Are you trying to distribute your model to
workers so you can create a distributed scoring service? If so, it may be
worth looking into taking your models outside of a SparkContext and serving
them separately.

If this is your use case, take a look at MLeap. We use it in production to
serve high-volume realtime requests from Spark-trained models:
https://github.com/combust/mleap

Cheers,
Hollin

On Tue, Feb 14, 2017 at 4:46 PM, Jianhong Xia <jx...@infoblox.com> wrote:

> Is there any update on this problem?
>
>
>
> I encountered the same issue that was mentioned here.
>
>
>
> I have CrossValidatorModel.transform(df) running on workers, which
> requires DataFrame as an input. However, we only have Arrays on workers.
> When we deploy our model into cluster mode, we could not create
> createDataFrame on workers. It will give me error:
>
>
>
>
>
> 17/02/13 20:21:27 ERROR Detector$: Error while detecting threats
>
> java.lang.NullPointerException
>
>      at org.apache.spark.sql.SparkSession.sessionState$
> lzycompute(SparkSession.scala:111)
>
>      at org.apache.spark.sql.SparkSession.sessionState(
> SparkSession.scala:109)
>
>      at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:62)
>
>      at org.apache.spark.sql.SparkSession.createDataFrame(
> SparkSession.scala:270)
>
>      at com.mycompany.analytics.models.app.serializable.
> AppModeler.detection(modeler.scala:370)
>
>
>
>
>
>
>
> On the other hand, if we run in the local, everything works fine.
>
>
>
> Just want to know, if there is any successful case that run machine
> learning models on the workers.
>
>
>
>
>
> Thanks,
>
> Jianhong
>
>
>
>
>
> *From:* Sumona Routh [mailto:sumosha2@gmail.com]
> *Sent:* Thursday, January 12, 2017 6:20 PM
> *To:* ayan guha <gu...@gmail.com>; user@spark.apache.org
> *Subject:* Re: Can't load a RandomForestClassificationModel in Spark job
>
>
>
> Yes, I save it to S3 in a different process. It is actually the
> RandomForestClassificationModel.load method (passed an s3 path) where I
> run into problems.
> When you say you load it during map stages, do you mean that you are able
> to directly load a model from inside of a transformation? When I try this,
> it passes the function to a worker, and the load method itself appears to
> attempt to create a new SparkContext, which causes an NPE downstream
> (because creating a SparkContext on the worker is not an appropriate thing
> to do, according to various threads I've found).
>
> Maybe there is a different load function I should be using?
>
> Thanks!
>
> Sumona
>
>
>
> On Thu, Jan 12, 2017 at 6:26 PM ayan guha <gu...@gmail.com> wrote:
>
> Hi
>
>
>
> Given training and predictions are two different applications, I typically
> save model objects to hdfs and load it back during prediction map stages.
>
>
>
> Best
>
> Ayan
>
>
>
> On Fri, 13 Jan 2017 at 5:39 am, Sumona Routh <su...@gmail.com> wrote:
>
> Hi all,
>
> I've been working with Spark mllib 2.0.2 RandomForestClassificationModel.
>
> I encountered two frustrating issues and would really appreciate some
> advice:
>
> 1)  RandomForestClassificationModel is effectively not serializable (I
> assume it's referencing something that can't be serialized, since it itself
> extends serializable), so I ended up with the well-known exception:
> org.apache.spark.SparkException: Task not serializable.
> Basically, my original intention was to pass the model as a parameter
>
> because which model we use is dynamic based on what record we are
>
> predicting on.
>
> Has anyone else encountered this? Is this currently being addressed? I
> would expect objects from Spark's own libraries be able to be used
> seamlessly in their applications without these types of exceptions.
>
> 2) The RandomForestClassificationModel.load method appears to hang
> indefinitely when executed from inside a map function (which I assume is
> passed to the executor). So, I basically cannot load a model from a worker.
> We have multiple "profiles" that use differently trained models, which are
> accessed from within a map function to run predictions on different sets of
> data.
>
> The thread that is hanging has this as the latest (most pertinent) code:
> org.apache.spark.ml.util.DefaultParamsReader$.
> loadMetadata(ReadWrite.scala:391)
>
> Looking at the code in github, it appears that it is calling sc.textFile.
> I could not find anything stating that this particular function would not
> work from within a map function.
>
> Are there any suggestions as to how I can get this model to work on a real
> production job (either by allowing it to be serializable and passed around
> or loaded from a worker)?
>
> I've extenisvely POCed this model (saving, loading, transforming,
> training, etc.), however this is the first time I'm attempting to use it
> from within a real application.
>
> Sumona
>
>