You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Apache Spark (Jira)" <ji...@apache.org> on 2022/08/22 12:02:00 UTC
[jira] [Commented] (SPARK-40178) Rebalance/Repartition Hints Not Working in PySpark

    [ https://issues.apache.org/jira/browse/SPARK-40178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17582932#comment-17582932 ] 

Apache Spark commented on SPARK-40178:
--------------------------------------

User 'mhconradt' has created a pull request for this issue:
https://github.com/apache/spark/pull/37616

> Rebalance/Repartition Hints Not Working in PySpark
> --------------------------------------------------
>
>                 Key: SPARK-40178
>                 URL: https://issues.apache.org/jira/browse/SPARK-40178
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 3.2.0, 3.2.1, 3.3.0, 3.2.2
>         Environment: Mac OSX 11.4 Big Sur
> Python 3.9.7
> Spark version >= 3.2.0 (perhaps before as well).
>            Reporter: Maxwell Conradt
>            Priority: Major
>             Fix For: 3.2.0, 3.2.1, 3.3.0, 3.2.2, 3.4.0, 3.3.1
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Partitioning hints in PySpark do not work because the column parameters are not converted to Catalyst `Expression` instances before being passed to the hint resolver.
> The behavior of the hints is documented [here|https://spark.apache.org/docs/3.3.0/sql-ref-syntax-qry-select-hints.html#partitioning-hints-types].
> Example:
>  
> {code:java}
> >>> df = spark.range(1024)
> >>> 
> >>> df
> DataFrame[id: bigint]
> >>> df.hint("rebalance", "id")
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
>   File "/Users/maxwellconradt/spark/python/pyspark/sql/dataframe.py", line 980, in hint
>     jdf = self._jdf.hint(name, self._jseq(parameters))
>   File "/Users/maxwellconradt/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", line 1322, in __call__
>   File "/Users/maxwellconradt/spark/python/pyspark/sql/utils.py", line 196, in deco
>     raise converted from None
> pyspark.sql.utils.AnalysisException: REBALANCE Hint parameter should include columns, but id found
> >>> df.hint("repartition", "id")
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
>   File "/Users/maxwellconradt/spark/python/pyspark/sql/dataframe.py", line 980, in hint
>     jdf = self._jdf.hint(name, self._jseq(parameters))
>   File "/Users/maxwellconradt/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", line 1322, in __call__
>   File "/Users/maxwellconradt/spark/python/pyspark/sql/utils.py", line 196, in deco
>     raise converted from None
> pyspark.sql.utils.AnalysisException: REPARTITION Hint parameter should include columns, but id found {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org