You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Weichen Xu (Jira)" <ji...@apache.org> on 2022/08/19 04:29:00 UTC
[jira] [Resolved] (SPARK-35542) Bucketizer created for multiple columns with parameters splitsArray, inputCols and outputCols can not be loaded after saving it.
[ https://issues.apache.org/jira/browse/SPARK-35542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Weichen Xu resolved SPARK-35542.
--------------------------------
Fix Version/s: 3.3.1
3.1.4
3.2.3
3.4.0
Resolution: Fixed
Issue resolved by pull request 37568
[https://github.com/apache/spark/pull/37568]
> Bucketizer created for multiple columns with parameters splitsArray, inputCols and outputCols can not be loaded after saving it.
> ---------------------------------------------------------------------------------------------------------------------------------
>
> Key: SPARK-35542
> URL: https://issues.apache.org/jira/browse/SPARK-35542
> Project: Spark
> Issue Type: Bug
> Components: PySpark
> Affects Versions: 3.1.1
> Environment: {color:#172b4d}DataBricks Spark 3.1.1{color}
> Reporter: Srikanth Pusarla
> Assignee: Weichen Xu
> Priority: Minor
> Fix For: 3.3.1, 3.1.4, 3.2.3, 3.4.0
>
> Attachments: Code-error.PNG, traceback.png
>
>
> Bucketizer created for multiple columns with parameters *splitsArray*, *inputCols* and *outputCols* can not be loaded after saving it.
> The problem is not seen for Bucketizer created for single column.
> *Code to reproduce*
> ###################################
> from pyspark.ml.feature import Bucketizer
> df = spark.createDataFrame([(0.1,), (0.4,), (1.2,), (1.5,)], ["values"])
> bucketizer = Bucketizer(*splitsArray*= [[-float("inf"), 0.5, 1.4, float("inf")], [-float("inf"), 0.1, 1.2, float("inf")]], *inputCols*=["values", "values"], *outputCols*=["b1", "b2"])
> bucketed = bucketizer.transform(df).collect()
> dfb = bucketizer.transform(df)
> print(dfb.show())
> bucketizerPath = "dbfs:/mnt/S3-Bucket/" + "Bucketizer"
> bucketizer.write().overwrite().save(bucketizerPath)
> loadedBucketizer = {color:#FF0000}Bucketizer.load(bucketizerPath) #### Failing here{color}
> loadedBucketizer.getSplits() == bucketizer.getSplits()
> ############################################################
> The error message is
> {color:#FF0000}*TypeError: array() argument 1 must be a unicode character, not bytes*{color}
>
> *BackTrace:*
>
> --------------------------------------------------------------------------
> TypeError Traceback (most recent call last) <command-3999490> in <module> 15 16 bucketizer.write().overwrite().save(bucketizerPath)
> ---> 17 loadedBucketizer = Bucketizer.load(bucketizerPath)
> 18 loadedBucketizer.getSplits() == bucketizer.getSplits()
>
> /databricks/spark/python/pyspark/ml/util.py in load(cls, path)
> 376 def load(cls, path):
> 377 """Reads an ML instance from the input path, a shortcut of `read().load(path)`."""
> --> 378 return cls.read().load(path)
> 379
> 380
>
> /databricks/spark/python/pyspark/ml/util.py in load(self, path)
> 330 raise NotImplementedError("This Java ML type cannot be loaded into Python currently: %r"
> 331 % self._clazz)
> --> 332 return self._clazz._from_java(java_obj)
> 333
> 334
>
> def session(self, sparkSession): /databricks/spark/python/pyspark/ml/wrapper.py in _from_java(java_stage)
> 258
> 259 py_stage._resetUid(java_stage.uid())
> --> 260 py_stage._transfer_params_from_java()
> 261 elif hasattr(py_type, "_from_java"):
> 262 py_stage = py_type._from_java(java_stage)
>
> /databricks/spark/python/pyspark/ml/wrapper.py in _transfer_params_from_java(self)
> 186 # SPARK-14931: Only check set params back to avoid default params mismatch.
> 187 if self._java_obj.isSet(java_param): -->
> 188 value = _java2py(sc, self._java_obj.getOrDefault(java_param))
> 189 self._set(**{param.name: value})
> 190 # SPARK-10931: Temporary fix for params that have a default in Java
>
> /databricks/spark/python/pyspark/ml/common.py in _java2py(sc, r, encoding)
> 107
> 108 if isinstance(r, (bytearray, bytes)):
> --> 109 r = PickleSerializer().loads(bytes(r), encoding=encoding)
> 110 return r
> 111
>
> /databricks/spark/python/pyspark/serializers.py in loads(self, obj, encoding)
> 467
> 468 def loads(self, obj, encoding="bytes"):
> --> 469 return pickle.loads(obj, encoding=encoding)
> 470
> 471
>
> TypeError: array() argument 1 must be a unicode character, not bytes
>
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org