You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Nicholas Brett Marcott (Jira)" <ji...@apache.org> on 2020/12/21 12:57:00 UTC
[jira] [Comment Edited] (SPARK-33398) AnalysisException when
loading a PipelineModel with Spark 3
[ https://issues.apache.org/jira/browse/SPARK-33398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17252831#comment-17252831 ]
Nicholas Brett Marcott edited comment on SPARK-33398 at 12/21/20, 12:56 PM:
----------------------------------------------------------------------------
It appears rawCount is a new field added by [this PR|https://github.com/apache/spark/pull/21632/files#diff-0fdae8a6782091746ed20ea43f77b639f9c6a5f072dd2f600fcf9a7b37db4f47] , but there doesn't seem to be logic to handle the missing column/old data.
+[~imatiach]+ [~podongfeng]
Is there some type of default value or missing column handling that should be added here?
I do not see a mention of this in the [ml lib migration guide.|https://spark.apache.org/docs/latest/ml-migration-guide.html#upgrading-from-mllib-24-to-30]
I was also able to reproduce this with the following code, saving using v2.4.7 and loading using v3.0.1
{code:java}
from pyspark.ml import Pipeline
from pyspark.ml.classification import DecisionTreeClassifier, DecisionTreeClassificationModel
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.feature import HashingTF, Tokenizer
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder, TrainValidationSplit
# Prepare training documents from a list of (id, text, label) tuples.
df = spark.createDataFrame([
(0, "a b c d e spark", 1.0),
(1, "b d", 0.0),
(2, "spark f g h", 1.0),
(3, "hadoop mapreduce", 0.0)
], ["id", "text", "label"])
tokenizer = Tokenizer(inputCol="text", outputCol="words")
hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features", numFeatures=1000)
treeClassifier = DecisionTreeClassifier()
pipeline = Pipeline(stages=[tokenizer, hashingTF])
features = pipeline.fit(df).transform(df)
model = treeClassifier.fit(features)
model.write().overwrite().save("/tmp/dc")
#loading code
from pyspark.ml.classification import DecisionTreeClassificationModel
DecisionTreeClassificationModel.read().load("/tmp/dc")
{code}
was (Author: nmarcott):
It appears rawCount is a new field added by [this PR|https://github.com/apache/spark/pull/21632/files#diff-0fdae8a6782091746ed20ea43f77b639f9c6a5f072dd2f600fcf9a7b37db4f47] , but there doesn't seem to be logic to handle the missing column/old data.
+[~imatiach] +[~podongfeng]
Is there some type of default value or missing column handling that should be added here?
I was also able to reproduce this with the following code, saving using v2.4.7 and loading using v3.0.1
{code:java}
from pyspark.ml import Pipeline
from pyspark.ml.classification import DecisionTreeClassifier, DecisionTreeClassificationModel
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.feature import HashingTF, Tokenizer
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder, TrainValidationSplit
# Prepare training documents from a list of (id, text, label) tuples.
df = spark.createDataFrame([
(0, "a b c d e spark", 1.0),
(1, "b d", 0.0),
(2, "spark f g h", 1.0),
(3, "hadoop mapreduce", 0.0)
], ["id", "text", "label"])
tokenizer = Tokenizer(inputCol="text", outputCol="words")
hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features", numFeatures=1000)
treeClassifier = DecisionTreeClassifier()
pipeline = Pipeline(stages=[tokenizer, hashingTF])
features = pipeline.fit(df).transform(df)
model = treeClassifier.fit(features)
model.write().overwrite().save("/tmp/dc")
#loading code
from pyspark.ml.classification import DecisionTreeClassificationModel
DecisionTreeClassificationModel.read().load("/tmp/dc")
{code}
> AnalysisException when loading a PipelineModel with Spark 3
> -----------------------------------------------------------
>
> Key: SPARK-33398
> URL: https://issues.apache.org/jira/browse/SPARK-33398
> Project: Spark
> Issue Type: Bug
> Components: MLlib
> Affects Versions: 3.0.1
> Environment: - Databricks runtime 7.3 ML
> - Spark 3.0.1
> - Python 3.7
> Reporter: LoicH
> Priority: Major
> Labels: V3, decisiontree, pyspark
>
> I am upgrading my Spark version from 2.4.5 to 3.0.1 and I cannot load anymore the PipelineModel objects that use a "DecisionTreeClassifier" stage.
> In my code I load several PipelineModel, all the PipelineModel with stages ["CountVectorizer_[uid]", "LinearSVC_[uid]"] are loading fine whereas the models with stages
> ["CountVectorizer_[uid]","DecisionTreeClassifier_[uid]"] are throwing the following exception:
> {noformat}
> AnalysisException: cannot resolve '`rawCount`' given input columns: [gain, id, impurity, impurityStats, leftChild, prediction, rightChild, split];{noformat}
> Here is the code I am using and the full stacktrace:
> {code:python}
> from pyspark.ml.pipeline import PipelineModel
> PipelineModel.load("/path/to/model")
> {code}
> {noformat}
> AnalysisException Traceback (most recent call last)
> <command-1278858167154148> in <module>
> ----> 1 RalentModel = PipelineModel.load(MODELES_ATTRIBUTS + "RalentModel_DT")/databricks/spark/python/pyspark/ml/util.py in load(cls, path)
> 368 def load(cls, path):
> 369 """Reads an ML instance from the input path, a shortcut of `read().load(path)`."""
> --> 370 return cls.read().load(path)
> 371
> 372 /databricks/spark/python/pyspark/ml/pipeline.py in load(self, path)
> 289 metadata = DefaultParamsReader.loadMetadata(path, self.sc)
> 290 if 'language' not in metadata['paramMap'] or metadata['paramMap']['language'] != 'Python':
> --> 291 return JavaMLReader(self.cls).load(path)
> 292 else:
> 293 uid, stages = PipelineSharedReadWrite.load(metadata, self.sc, path)/databricks/spark/python/pyspark/ml/util.py in load(self, path)
> 318 if not isinstance(path, basestring):
> 319 raise TypeError("path should be a basestring, got type %s" % type(path))
> --> 320 java_obj = self._jread.load(path)
> 321 if not hasattr(self._clazz, "_from_java"):
> 322 raise NotImplementedError("This Java ML type cannot be loaded into Python currently: %r"/databricks/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py in __call__(self, *args)
> 1303 answer = self.gateway_client.send_command(command)
> 1304 return_value = get_return_value(
> -> 1305 answer, self.gateway_client, self.target_id, self.name)
> 1306
> 1307 for temp_arg in temp_args:/databricks/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
> 131 # Hide where the exception came from that shows a non-Pythonic
> 132 # JVM exception message.
> --> 133 raise_from(converted)
> 134 else:
> 135 raise/databricks/spark/python/pyspark/sql/utils.py in raise_from(e)
> AnalysisException: cannot resolve '`rawCount`' given input columns: [gain, id, impurity, impurityStats, leftChild, prediction, rightChild, split];
> {noformat}
> These pipeline models where saved using Spark 2.4.3, I can load them fine using Spark 2.4.5.
> I tried to investigate further and load each stage separately. Loading the CountVectorizerModel with
> {code:python}
> from pyspark.ml.feature import CountVectorizerModel
> CountVectorizerModel.read().load("/path/to/model/stages/0_CountVectorizer_efce893314a9")
> {code}
> yields a CountVectorizerModel, but my code fails when trying to load the DecisionTreeClassificationModel:
> {code:python}
> DecisionTreeClassificationModel.read().load("/path/to/model/stages/1_DecisionTreeClassifier_4d2a76c565b0")
> AnalysisException: cannot resolve '`rawCount`' given input columns: [gain, id, impurity, impurityStats, leftChild, prediction, rightChild, split];
> {code}
> And here is the content of the "data" of my Decision Tree Classifier:
> {code:python}
> spark.read.parquet("/path/to/model/stages/1_DecisionTreeClassifier_4d2a76c565b0/data").show()
> +---+----------+--------------------+-------------+--------------------+---------+----------+----------------+
> | id|prediction| impurity|impurityStats| gain|leftChild|rightChild| split|
> +---+----------+--------------------+-------------+--------------------+---------+----------+----------------+
> | 0| 0.0| 0.3926234384295062| [90.0, 33.0]| 0.16011830963990054| 1| 16|[190, [0.5], -1]|
> | 1| 0.0| 0.2672722508516028| [90.0, 17.0]| 0.11434106988303855| 2| 15|[512, [0.5], -1]|
> | 2| 0.0| 0.1652892561983472| [90.0, 9.0]| 0.06959547629404085| 3| 14|[583, [0.5], -1]|
> | 3| 0.0| 0.09972299168975082| [90.0, 5.0]|0.026984966852376356| 4| 11|[480, [0.5], -1]|
> | 4| 0.0|0.043933846736523306| [87.0, 2.0]|0.021717299239076976| 5| 10|[555, [1.5], -1]|
> | 5| 0.0|0.022469008264462766| [87.0, 1.0]|0.011105371900826402| 6| 7|[833, [0.5], -1]|
> | 6| 0.0| 0.0| [86.0, 0.0]| -1.0| -1| -1| [-1, [], -1]|
> | 7| 0.0| 0.5| [1.0, 1.0]| 0.5| 8| 9| [0, [0.5], -1]|
> | 8| 0.0| 0.0| [1.0, 0.0]| -1.0| -1| -1| [-1, [], -1]|
> | 9| 1.0| 0.0| [0.0, 1.0]| -1.0| -1| -1| [-1, [], -1]|
> | 10| 1.0| 0.0| [0.0, 1.0]| -1.0| -1| -1| [-1, [], -1]|
> | 11| 0.0| 0.5| [3.0, 3.0]| 0.5| 12| 13| [14, [1.5], -1]|
> | 12| 0.0| 0.0| [3.0, 0.0]| -1.0| -1| -1| [-1, [], -1]|
> | 13| 1.0| 0.0| [0.0, 3.0]| -1.0| -1| -1| [-1, [], -1]|
> | 14| 1.0| 0.0| [0.0, 4.0]| -1.0| -1| -1| [-1, [], -1]|
> | 15| 1.0| 0.0| [0.0, 8.0]| -1.0| -1| -1| [-1, [], -1]|
> | 16| 1.0| 0.0| [0.0, 16.0]| -1.0| -1| -1| [-1, [], -1]|
> +---+----------+--------------------+-------------+--------------------+---------+----------+----------------+
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org