You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/08/22 12:00:26 UTC

[GitHub] [spark] mhconradt opened a new pull request, #37616: [SPARK-40178][SQL][TESTS]Fix partitioning hint parameters in PySpark

mhconradt opened a new pull request, #37616:
URL: https://github.com/apache/spark/pull/37616

   I added code that converts the column parameters to Java expressions before passing them to the JVM hint method.
   Partitioning hint parameters used to raise an error:
   ```
   >>> df = spark.range(1024)
   >>> 
   >>> df
   DataFrame[id: bigint]
   >>> df.hint("rebalance", "id")
   Traceback (most recent call last):
     File "<stdin>", line 1, in <module>
     File "/Users/maxwellconradt/spark/python/pyspark/sql/dataframe.py", line 980, in hint
       jdf = self._jdf.hint(name, self._jseq(parameters))
     File "/Users/maxwellconradt/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", line 1322, in __call__
     File "/Users/maxwellconradt/spark/python/pyspark/sql/utils.py", line 196, in deco
       raise converted from None
   pyspark.sql.utils.AnalysisException: REBALANCE Hint parameter should include columns, but id found
   >>> df.hint("repartition", "id")
   Traceback (most recent call last):
     File "<stdin>", line 1, in <module>
     File "/Users/maxwellconradt/spark/python/pyspark/sql/dataframe.py", line 980, in hint
       jdf = self._jdf.hint(name, self._jseq(parameters))
     File "/Users/maxwellconradt/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", line 1322, in __call__
     File "/Users/maxwellconradt/spark/python/pyspark/sql/utils.py", line 196, in deco
       raise converted from None
   pyspark.sql.utils.AnalysisException: REPARTITION Hint parameter should include columns, but id found
   ```
   This is a bug because there's no other way to specify a column as a hint parameter in PySpark.
   
   After this MR this functionality works:
   ```
   >>> df = spark.range(1024)
   >>> df.hint("repartition", 'id')
   DataFrame[id: bigint]
   >>> df.hint('repartition', 'id').explain()
   == Physical Plan ==
   AdaptiveSparkPlan isFinalPlan=false
   +- Exchange hashpartitioning(id#0L, 200), REPARTITION_BY_COL, [plan_id=6]
      +- Range (0, 1024, step=1, splits=8)
   
   
   >>> df.hint('rebalance', 'id').explain()
   == Physical Plan ==
   AdaptiveSparkPlan isFinalPlan=false
   +- Exchange hashpartitioning(id#0L, 200), REBALANCE_PARTITIONS_BY_COL, [plan_id=14]
      +- Range (0, 1024, step=1, splits=8)
   ```
   
   ### Does this PR introduce _any_ user-facing change?
   It fixes a bug.
   
   ### How was this patch tested?
   I added a test case `test_partitioning_hints` on `DataFrameTests` to test the partitioning hint functionality in its entirety, not only ensuring it did not raise a spurious exception, but also that the repartitioning does occur.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] advancedxy commented on pull request #37616: [SPARK-40178][PYTHON][SQL] Fix partitioning hint parameters in PySpark

Posted by "advancedxy (via GitHub)" <gi...@apache.org>.
advancedxy commented on PR #37616:
URL: https://github.com/apache/spark/pull/37616#issuecomment-1654917282

   @mhconradt are you still working on this? If not, I would like to pick this up.
   
   > Came across this wanting to test out the rebalance hint in pyspark (since it looks like rebalance can only be used as a hint right now). Does it make more sense to support strings directly in ResolveHints? It is pretty awkward that SQL hints get interpreted as expressions, but DataFrame hints don't. It's definitely awkward having to use $"col".expr even on the Scala side. And in ResolveHints it already supports the number of partitions either being an Literal or an integer
   
   yeah. I think the hint method in the Dataset side should support string/integer parameters directly.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Kimahriman commented on pull request #37616: [SPARK-40178][PYTHON][SQL] Fix partitioning hint parameters in PySpark

Posted by "Kimahriman (via GitHub)" <gi...@apache.org>.
Kimahriman commented on PR #37616:
URL: https://github.com/apache/spark/pull/37616#issuecomment-1399596763

   Came across this wanting to test out the `rebalance` hint in pyspark (since it looks like rebalance can only be used as a hint right now). Does it make more sense to support strings directly in the `ResolveHints`? It is pretty awkward that SQL hints get interpreted as expressions, but DataFrame hints don't. It's definitely awkward having to use `$"col".expr` even on the Scala side. And in `ResolveHints` it already supports the number of partitions either being an Literal or an integer


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] mhconradt commented on a diff in pull request #37616: [SPARK-40178][PYTHON][SQL] Fix partitioning hint parameters in PySpark

Posted by GitBox <gi...@apache.org>.
mhconradt commented on code in PR #37616:
URL: https://github.com/apache/spark/pull/37616#discussion_r952498749


##########
python/pyspark/sql/column.py:
##########
@@ -71,6 +71,18 @@ def _to_java_column(col: "ColumnOrName") -> JavaObject:
     return jcol
 
 
+def _to_java_expr(col: "ColumnOrName") -> JavaObject:
+    if isinstance(col, (Column, str)):

Review Comment:
   As is, the only allowed value for the column hint parameters is `str`: https://github.com/apache/spark/blob/master/python/pyspark/sql/dataframe.py#L971
   It probably makes sense to allow `Column` objects as well, as both are supported by the `repartition` and `repartitionByRange` methods.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] mhconradt commented on a diff in pull request #37616: [SPARK-40178][PYTHON][SQL] Fix partitioning hint parameters in PySpark

Posted by GitBox <gi...@apache.org>.
mhconradt commented on code in PR #37616:
URL: https://github.com/apache/spark/pull/37616#discussion_r952494642


##########
python/pyspark/sql/column.py:
##########
@@ -71,6 +71,18 @@ def _to_java_column(col: "ColumnOrName") -> JavaObject:
     return jcol
 
 
+def _to_java_expr(col: "ColumnOrName") -> JavaObject:
+    if isinstance(col, (Column, str)):

Review Comment:
   On the Scala side, you need to pass an `Expression`:
   ```
   scala> val df = spark.range(1024)
   df: org.apache.spark.sql.Dataset[Long] = [id: bigint]
   
   scala> df.hint("rebalance", $"id".expr)
   res0: org.apache.spark.sql.Dataset[Long] = [id: bigint]
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] github-actions[bot] closed pull request #37616: [SPARK-40178][PYTHON][SQL] Fix partitioning hint parameters in PySpark

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] closed pull request #37616: [SPARK-40178][PYTHON][SQL] Fix partitioning hint parameters in PySpark
URL: https://github.com/apache/spark/pull/37616


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a diff in pull request #37616: [SPARK-40178][PYTHON][SQL] Fix partitioning hint parameters in PySpark

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on code in PR #37616:
URL: https://github.com/apache/spark/pull/37616#discussion_r952036075


##########
python/pyspark/sql/column.py:
##########
@@ -71,6 +71,18 @@ def _to_java_column(col: "ColumnOrName") -> JavaObject:
     return jcol
 
 
+def _to_java_expr(col: "ColumnOrName") -> JavaObject:
+    if isinstance(col, (Column, str)):

Review Comment:
   I am saying that there's a duplicate if-else here so the else branch here is unreachable.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #37616: [SPARK-40178][PYTHON][SQL] Fix partitioning hint parameters in PySpark

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on PR #37616:
URL: https://github.com/apache/spark/pull/37616#issuecomment-1222359068

   Let's enable the GA in your fork (see https://github.com/apache/spark/runs/7951168069)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] felipepessoto commented on pull request #37616: [SPARK-40178][PYTHON][SQL] Fix partitioning hint parameters in PySpark

Posted by GitBox <gi...@apache.org>.
felipepessoto commented on PR #37616:
URL: https://github.com/apache/spark/pull/37616#issuecomment-1312224727

   For Scala is expected to need to call `.expr`, or we need to fix it as well?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a diff in pull request #37616: [SPARK-40178][SQL][TESTS]Fix partitioning hint parameters in PySpark

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on code in PR #37616:
URL: https://github.com/apache/spark/pull/37616#discussion_r951436056


##########
python/pyspark/sql/dataframe.py:
##########
@@ -977,7 +977,8 @@ def hint(
                     )
                 )
 
-        jdf = self._jdf.hint(name, self._jseq(parameters))
+        jdf = self._jdf.hint(name, self._jseq(parameters,
+                                              converter=lambda x: _to_java_expr(x) if isinstance(x, (Column, str)) else x))

Review Comment:
   There;s a duplicate if-else in `_to_java_expr`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #37616: [SPARK-40178][PYTHON][SQL] Fix partitioning hint parameters in PySpark

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on PR #37616:
URL: https://github.com/apache/spark/pull/37616#issuecomment-1223035399

   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] mhconradt commented on pull request #37616: [SPARK-40178][PYTHON][SQL] Fix partitioning hint parameters in PySpark

Posted by GitBox <gi...@apache.org>.
mhconradt commented on PR #37616:
URL: https://github.com/apache/spark/pull/37616#issuecomment-1223355455

   Without that additional if-else this code would raise a TypeError. The rationale is _to_java_column only supports str and Column, so we only use it to convert those types of parameters, and don’t apply additional conversions to other types.
   
   Sent from Proton Mail for iOS
   
   On Mon, Aug 22, 2022 at 08:23, Hyukjin Kwon ***@***.***> wrote:
   
   > @HyukjinKwon commented on this pull request.
   >
   > ---------------------------------------------------------------
   >
   > In [python/pyspark/sql/dataframe.py](https://github.com/apache/spark/pull/37616#discussion_r951436056):
   >
   >> +        jdf = self._jdf.hint(name, self._jseq(parameters,
   > +                                              converter=lambda x: _to_java_expr(x) if isinstance(x, (Column, str)) else x))
   >
   > There;s a duplicate if-else in _to_java_expr
   >
   > —
   > Reply to this email directly, [view it on GitHub](https://github.com/apache/spark/pull/37616#pullrequestreview-1080486055), or [unsubscribe](https://github.com/notifications/unsubscribe-auth/AKN47WEJD6JFIXEWUKVQMCDV2N5OXANCNFSM57HMQKMA).
   > You are receiving this because you authored the thread.Message ID: ***@***.***>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a diff in pull request #37616: [SPARK-40178][SQL][TESTS]Fix partitioning hint parameters in PySpark

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on code in PR #37616:
URL: https://github.com/apache/spark/pull/37616#discussion_r951434652


##########
python/pyspark/sql/column.py:
##########
@@ -71,6 +71,18 @@ def _to_java_column(col: "ColumnOrName") -> JavaObject:
     return jcol
 
 
+def _to_java_expr(col: "ColumnOrName") -> JavaObject:
+    if isinstance(col, (Column, str)):

Review Comment:
   wouldn't this disable `str` as a value? We should probably just allow `Column` only. How does it work in Scala side?



##########
python/pyspark/sql/tests/test_dataframe.py:
##########
@@ -14,7 +14,6 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 #
-

Review Comment:
   Let;s remove all unrelated changes.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a diff in pull request #37616: [SPARK-40178][PYTHON][SQL] Fix partitioning hint parameters in PySpark

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on code in PR #37616:
URL: https://github.com/apache/spark/pull/37616#discussion_r1022305164


##########
python/pyspark/sql/dataframe.py:
##########
@@ -968,7 +968,7 @@ def hint(
         if not isinstance(name, str):
             raise TypeError("name should be provided as str, got {0}".format(type(name)))
 
-        allowed_types = (str, list, float, int)
+        allowed_types = (str, Column, int)

Review Comment:
   Maybe we can add `Column` into this allowed list, and leave things as are because other custom types would be broken without this. BTW, does Scala support such hint too? e.g, `df.hint("REPARTITION", "a")`. This method calls Scala side directly.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] github-actions[bot] commented on pull request #37616: [SPARK-40178][PYTHON][SQL] Fix partitioning hint parameters in PySpark

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #37616:
URL: https://github.com/apache/spark/pull/37616#issuecomment-1532305768

   We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org