You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Eugene Morozov <ev...@gmail.com> on 2016/02/12 15:57:52 UTC

[SparkML] RandomForestModel save on disk.

Hello,

I'm building simple web service that works with spark and allows users to
train random forest model (mlib API) and use it for prediction. Trained
models are stored on the local file system (web service and spark of just
one worker are run on the same machine).
I'm concerned about prediction performance and established small load
testing to measure prediction latency. That's initially, I will set up hdfs
and bigger spark cluster.

At first I run training 5 really small models (all of them can finish
within 30 seconds).
Next my perf testing framework waits for a minute and start calling
prediction method.

Sometimes I see that not all of the 5 models were saved on disk. There is a
metadata folder for them, but not the data directory that actually contains
parquet files of the models.

I've looked through spark's jira, but haven't found anything similar.
Has anyone experience smth like this?
Could you recommend where to look for?
Might it be something with flushing it to disk immediately (just a wild
idea...)?

Thanks in advance.
--
Be well!
Jean Morozov

Re: [SparkML] RandomForestModel save on disk.

Posted by Eugene Morozov <ev...@gmail.com>.
Here is the exception I discover.

java.lang.RuntimeException: error reading Scala signature of
org.apache.spark.mllib.tree.model.DecisionTreeModel:
scala.reflect.internal.Symbols$PackageClassSymbol cannot be cast to
scala.reflect.internal.Constants$Constant
        at
scala.reflect.internal.pickling.UnPickler.unpickle(UnPickler.scala:45)
~[scala-reflect-2.10.4.jar:na]
        at
scala.reflect.runtime.JavaMirrors$JavaMirror.unpickleClass(JavaMirrors.scala:565)
~[scala-reflect-2.10.4.jar:na]
        at
scala.reflect.runtime.SymbolLoaders$TopClassCompleter.complete(SymbolLoaders.scala:32)
~[scala-reflect-2.10.4.jar:na]
        at scala.reflect.internal.Symbols$Symbol.info(Symbols.scala:1231)
~[scala-reflect-2.10.4.jar:na]
        at
scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:43)
~[scala-reflect-2.10.4.jar:na]
        at
scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:61)
~[scala-reflect-2.10.4.jar:na]
        at
scala.reflect.internal.Mirrors$RootsBase.staticModuleOrClass(Mirrors.scala:72)
~[scala-reflect-2.10.4.jar:na]
        at
scala.reflect.internal.Mirrors$RootsBase.staticModule(Mirrors.scala:161)
~[scala-reflect-2.10.4.jar:na]
        at
scala.reflect.internal.Mirrors$RootsBase.staticModule(Mirrors.scala:21)
~[scala-reflect-2.10.4.jar:na]
        at
org.apache.spark.mllib.tree.model.TreeEnsembleModel$SaveLoadV1_0$$typecreator1$1.apply(treeEnsembleModels.scala:450)
~[spark-mllib_2.10-1.6.0.jar:1.6.0]
        at
scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:231)
~[scala-reflect-2.10.4.jar:na]
        at
scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:231)
~[scala-reflect-2.10.4.jar:na]
        at
org.apache.spark.sql.catalyst.ScalaReflection$class.localTypeOf(ScalaReflection.scala:642)
~[spark-catalyst_2.10-1.6.0.jar:1.6.0]


--
Be well!
Jean Morozov

On Fri, Feb 12, 2016 at 5:57 PM, Eugene Morozov <ev...@gmail.com>
wrote:

> Hello,
>
> I'm building simple web service that works with spark and allows users to
> train random forest model (mlib API) and use it for prediction. Trained
> models are stored on the local file system (web service and spark of just
> one worker are run on the same machine).
> I'm concerned about prediction performance and established small load
> testing to measure prediction latency. That's initially, I will set up hdfs
> and bigger spark cluster.
>
> At first I run training 5 really small models (all of them can finish
> within 30 seconds).
> Next my perf testing framework waits for a minute and start calling
> prediction method.
>
> Sometimes I see that not all of the 5 models were saved on disk. There is
> a metadata folder for them, but not the data directory that actually
> contains parquet files of the models.
>
> I've looked through spark's jira, but haven't found anything similar.
> Has anyone experience smth like this?
> Could you recommend where to look for?
> Might it be something with flushing it to disk immediately (just a wild
> idea...)?
>
> Thanks in advance.
> --
> Be well!
> Jean Morozov
>