You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Avenash Kabeera (Jira)" <ji...@apache.org> on 2021/05/17 09:45:00 UTC
[jira] [Comment Edited] (SPARK-35370) IllegalArgumentException when loading a PipelineModel with Spark 3

    [ https://issues.apache.org/jira/browse/SPARK-35370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17342559#comment-17342559 ] 

Avenash Kabeera edited comment on SPARK-35370 at 5/17/21, 9:44 AM:
-------------------------------------------------------------------

Some additional details.  My model was saved as  parquet format and after some research I found that parquet columns should be case insensitive.  I confirmed this by trying a workaround to load my model rename the column to "nodeData" and resave it but everything I tried ended up saving the model with the column "nodedata."  Given this logic for case insensitivity, doesn't it make more sense that the fix mentioned above to support loading spark2 models should be check for "nodedata" not "nodeData"?


was (Author: akabeera):
Some additional details.  My model was saved as  parquet format and after some research I found that parquet columns should be case insensitive.  I confirmed this by trying a workaround to load my model rename the column to "nodeData" and resave it but everything I tried ended up saving the model with the column "nodedata."  Given this logic for case insensitivity, doesn't it make more sense that the fix mentioned above to supposed loading spark2 models should be checking for "nodedata" not "nodeData"?

> IllegalArgumentException when loading a PipelineModel with Spark 3
> ------------------------------------------------------------------
>
>                 Key: SPARK-35370
>                 URL: https://issues.apache.org/jira/browse/SPARK-35370
>             Project: Spark
>          Issue Type: Bug
>          Components: MLlib
>    Affects Versions: 3.1.0, 3.1.1
>         Environment: spark 3.1.1
>            Reporter: Avenash Kabeera
>            Priority: Minor
>              Labels: V3, decisiontree, scala, treemodels
>
> Hi, 
> This is a followup of the this issue https://issues.apache.org/jira/browse/SPARK-33398 that fixed an exception when loading a model in Spark 3 that trained in Spark2.  After incorporating this fix in my project, I ran into another issue which was introduced in the fix [https://github.com/apache/spark/pull/30889/files.]  
> While loading my random forest model which was trained in Spark 2.2, I ran into the following exception:
> {code:java}
> 16:03:34 ERROR Instrumentation:73 - java.lang.IllegalArgumentException: nodeData does not exist. Available: treeid, nodedata
>  at org.apache.spark.sql.types.StructType.$anonfun$apply$1(StructType.scala:278)
>  at scala.collection.immutable.Map$Map2.getOrElse(Map.scala:147)
>  at org.apache.spark.sql.types.StructType.apply(StructType.scala:277)
>  at org.apache.spark.ml.tree.EnsembleModelReadWrite$.loadImpl(treeModels.scala:522)
>  at org.apache.spark.ml.classification.RandomForestClassificationModel$RandomForestClassificationModelReader.load(RandomForestClassifier.scala:420)
>  at org.apache.spark.ml.classification.RandomForestClassificationModel$RandomForestClassificationModelReader.load(RandomForestClassifier.scala:410)
>  at org.apache.spark.ml.Pipeline$SharedReadWrite$.$anonfun$load$5(Pipeline.scala:277)
>  at org.apache.spark.ml.MLEvents.withLoadInstanceEvent(events.scala:160)
>  at org.apache.spark.ml.MLEvents.withLoadInstanceEvent$(events.scala:155)
>  at org.apache.spark.ml.util.Instrumentation.withLoadInstanceEvent(Instrumentation.scala:42)
>  at org.apache.spark.ml.Pipeline$SharedReadWrite$.$anonfun$load$4(Pipeline.scala:277)
>  at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
>  at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
>  at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
>  at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
>  at scala.collection.TraversableLike.map(TraversableLike.scala:238)
>  at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
>  at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:198)
>  at org.apache.spark.ml.Pipeline$SharedReadWrite$.$anonfun$load$3(Pipeline.scala:274)
>  at org.apache.spark.ml.util.Instrumentation$.$anonfun$instrumented$1(Instrumentation.scala:191)
>  at scala.util.Try$.apply(Try.scala:213)
>  at org.apache.spark.ml.util.Instrumentation$.instrumented(Instrumentation.scala:191)
>  at org.apache.spark.ml.Pipeline$SharedReadWrite$.load(Pipeline.scala:268)
>  at org.apache.spark.ml.PipelineModel$PipelineModelReader.$anonfun$load$7(Pipeline.scala:356)
>  at org.apache.spark.ml.MLEvents.withLoadInstanceEvent(events.scala:160)
>  at org.apache.spark.ml.MLEvents.withLoadInstanceEvent$(events.scala:155)
>  at org.apache.spark.ml.util.Instrumentation.withLoadInstanceEvent(Instrumentation.scala:42)
>  at org.apache.spark.ml.PipelineModel$PipelineModelReader.$anonfun$load$6(Pipeline.scala:355)
>  at org.apache.spark.ml.util.Instrumentation$.$anonfun$instrumented$1(Instrumentation.scala:191)
>  at scala.util.Try$.apply(Try.scala:213)
>  at org.apache.spark.ml.util.Instrumentation$.instrumented(Instrumentation.scala:191)
>  at org.apache.spark.ml.PipelineModel$PipelineModelReader.load(Pipeline.scala:355)
>  at org.apache.spark.ml.PipelineModel$PipelineModelReader.load(Pipeline.scala:349)
>  at org.apache.spark.ml.util.MLReadable.load(ReadWrite.scala:355)
>  at org.apache.spark.ml.util.MLReadable.load$(ReadWrite.scala:355)
>  at org.apache.spark.ml.PipelineModel$.load(Pipeline.scala:337){code}
> When I looked at the data for the model, I see the schema is using "*nodedata*" instead of "*nodeData*."  Here is what my model looks like:
> {code:java}
> +------+-----------------------------------------------------------------------------------------------------------------+
> |treeid|nodedata                                                                                                         |
> +------+-----------------------------------------------------------------------------------------------------------------+
> |12    |{0, 1.0, 0.20578590428109744, [249222.0, 1890856.0], 0.046774779237015784, 1, 128, {1, [0.7468856332819338], -1}}|
> |12    |{1, 1.0, 0.49179982674596906, [173902.0, 224985.0], 0.022860340952237657, 2, 65, {4, [0.6627218934911243], -1}}  |
> |12    |{2, 0.0, 0.4912259578159168, [90905.0, 69638.0], 0.10950848921275999, 3, 34, {9, [0.13666873125270484], -1}}     |
> |12    |{3, 1.0, 0.4308078797704775, [23317.0, 50941.0], 0.04311282777881931, 4, 19, {10, [0.506218002482692], -1}}      | {code}
> I'm new to spark and the training of this model predates me so I can't say whether specifying the column as "nodedata" was specific to our code or was internal spark code.  But I'm suspecting it's internal spark code.
>  
> edit:
> cc [~podongfeng], the author of the original PR to support loading spark2 models in spark3.  Maybe have some insights on "nodedata" vs "nodeData"
> h3.  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org