You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/08/22 12:00:26 UTC

[GitHub] [spark] mhconradt opened a new pull request, #37616: [SPARK-40178][SQL][TESTS]Fix partitioning hint parameters in PySpark

mhconradt opened a new pull request, #37616:
URL: https://github.com/apache/spark/pull/37616

   I added code that converts the column parameters to Java expressions before passing them to the JVM hint method.
   Partitioning hint parameters used to raise an error:
   ```
   >>> df = spark.range(1024)
   >>> 
   >>> df
   DataFrame[id: bigint]
   >>> df.hint("rebalance", "id")
   Traceback (most recent call last):
     File "<stdin>", line 1, in <module>
     File "/Users/maxwellconradt/spark/python/pyspark/sql/dataframe.py", line 980, in hint
       jdf = self._jdf.hint(name, self._jseq(parameters))
     File "/Users/maxwellconradt/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", line 1322, in __call__
     File "/Users/maxwellconradt/spark/python/pyspark/sql/utils.py", line 196, in deco
       raise converted from None
   pyspark.sql.utils.AnalysisException: REBALANCE Hint parameter should include columns, but id found
   >>> df.hint("repartition", "id")
   Traceback (most recent call last):
     File "<stdin>", line 1, in <module>
     File "/Users/maxwellconradt/spark/python/pyspark/sql/dataframe.py", line 980, in hint
       jdf = self._jdf.hint(name, self._jseq(parameters))
     File "/Users/maxwellconradt/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", line 1322, in __call__
     File "/Users/maxwellconradt/spark/python/pyspark/sql/utils.py", line 196, in deco
       raise converted from None
   pyspark.sql.utils.AnalysisException: REPARTITION Hint parameter should include columns, but id found
   ```
   This is a bug because there's no other way to specify a column as a hint parameter in PySpark.
   
   After this MR this functionality works:
   ```
   >>> df = spark.range(1024)
   >>> df.hint("repartition", 'id')
   DataFrame[id: bigint]
   >>> df.hint('repartition', 'id').explain()
   == Physical Plan ==
   AdaptiveSparkPlan isFinalPlan=false
   +- Exchange hashpartitioning(id#0L, 200), REPARTITION_BY_COL, [plan_id=6]
      +- Range (0, 1024, step=1, splits=8)
   
   
   >>> df.hint('rebalance', 'id').explain()
   == Physical Plan ==
   AdaptiveSparkPlan isFinalPlan=false
   +- Exchange hashpartitioning(id#0L, 200), REBALANCE_PARTITIONS_BY_COL, [plan_id=14]
      +- Range (0, 1024, step=1, splits=8)
   ```
   
   ### Does this PR introduce _any_ user-facing change?
   It fixes a bug.
   
   ### How was this patch tested?
   I added a test case `test_partitioning_hints` on `DataFrameTests` to test the partitioning hint functionality in its entirety, not only ensuring it did not raise a spurious exception, but also that the repartitioning does occur.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] advancedxy commented on pull request #37616: [SPARK-40178][PYTHON][SQL] Fix partitioning hint parameters in PySpark

Posted by "advancedxy (via GitHub)" <gi...@apache.org>.

advancedxy commented on PR #37616:
URL: https://github.com/apache/spark/pull/37616#issuecomment-1654917282

   @mhconradt are you still working on this? If not, I would like to pick this up.
   
   > Came across this wanting to test out the rebalance hint in pyspark (since it looks like rebalance can only be used as a hint right now). Does it make more sense to support strings directly in ResolveHints? It is pretty awkward that SQL hints get interpreted as expressions, but DataFrame hints don't. It's definitely awkward having to use $"col".expr even on the Scala side. And in ResolveHints it already supports the number of partitions either being an Literal or an integer
   
   yeah. I think the hint method in the Dataset side should support string/integer parameters directly.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org