You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/09/20 16:21:05 UTC

[GitHub] [spark] tedyu opened a new pull request, #37952: Treat unknown partitioning as UnknownPartitioning

tedyu opened a new pull request, #37952:
URL: https://github.com/apache/spark/pull/37952

   ### What changes were proposed in this pull request?
   When running spark application against spark 3.3, I see the following :
   ```
   java.lang.IllegalArgumentException: Unsupported data source V2 partitioning type: CustomPartitioning
       at org.apache.spark.sql.execution.datasources.v2.V2ScanPartitioning$$anonfun$apply$1.applyOrElse(V2ScanPartitioning.scala:46)
       at org.apache.spark.sql.execution.datasources.v2.V2ScanPartitioning$$anonfun$apply$1.applyOrElse(V2ScanPartitioning.scala:34)
       at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:584)
   ```
   The `CustomPartitioning` works fine with Spark 3.2.1
   This PR proposes to relax the code and treat all unknown partitioning the same way as that for `UnknownPartitioning`.
   
   ### Why are the changes needed?
   3.3.0 doesn't seem to warrant such behavioral change (from that of 3.2.1 release).
   
   ### Does this PR introduce _any_ user-facing change?
   This would allow user's custom partitioning to continue to work with 3.3.x releases.
   
   ### How was this patch tested?
   Existing test suite.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sunchao commented on pull request #37952: [SPARK-40508][SQL] Treat unknown partitioning as UnknownPartitioning

Posted by GitBox <gi...@apache.org>.
sunchao commented on PR #37952:
URL: https://github.com/apache/spark/pull/37952#issuecomment-1253967479

   Thanks! merged to master. @tedyu could you create a PR for branch-3.3 as well?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] tedyu commented on pull request #37952: Treat unknown partitioning as UnknownPartitioning

Posted by GitBox <gi...@apache.org>.
tedyu commented on PR #37952:
URL: https://github.com/apache/spark/pull/37952#issuecomment-1252722669

   I have run the test using Cassandra Spark connector and modified Spark (with this patch).
   
   The test passes (without modification to Cassandra Spark connector or client code).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sunchao commented on pull request #37952: Treat unknown partitioning as UnknownPartitioning

Posted by GitBox <gi...@apache.org>.
sunchao commented on PR #37952:
URL: https://github.com/apache/spark/pull/37952#issuecomment-1252620344

   Can you directly report `UnknownPartitioning` to Spark?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] tedyu commented on pull request #37952: [SPARK-40508] Treat unknown partitioning as UnknownPartitioning

Posted by GitBox <gi...@apache.org>.
tedyu commented on PR #37952:
URL: https://github.com/apache/spark/pull/37952#issuecomment-1252871488

   @sunchao 
   Please take another look.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sunchao closed pull request #37952: [SPARK-40508][SQL] Treat unknown partitioning as UnknownPartitioning

Posted by GitBox <gi...@apache.org>.
sunchao closed pull request #37952: [SPARK-40508][SQL] Treat unknown partitioning as UnknownPartitioning
URL: https://github.com/apache/spark/pull/37952


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sunchao commented on pull request #37952: [SPARK-40508][SQL] Treat unknown partitioning as UnknownPartitioning

Posted by GitBox <gi...@apache.org>.
sunchao commented on PR #37952:
URL: https://github.com/apache/spark/pull/37952#issuecomment-1253980266

   Yea good point. A unit test would be nice.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] tedyu commented on pull request #37952: Treat unknown partitioning as UnknownPartitioning

Posted by GitBox <gi...@apache.org>.
tedyu commented on PR #37952:
URL: https://github.com/apache/spark/pull/37952#issuecomment-1252594388

   cc @sunchao 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] tedyu commented on pull request #37952: Treat unknown partitioning as UnknownPartitioning

Posted by GitBox <gi...@apache.org>.
tedyu commented on PR #37952:
URL: https://github.com/apache/spark/pull/37952#issuecomment-1252627778

   If custom partitioning reports `UnknownPartitioning` to Spark and can keep 3.2.1 behavior, that means the current check is not desired.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sunchao commented on pull request #37952: Treat unknown partitioning as UnknownPartitioning

Posted by GitBox <gi...@apache.org>.
sunchao commented on PR #37952:
URL: https://github.com/apache/spark/pull/37952#issuecomment-1252841711

   I guess this PR make sense. @tedyu could you:
   - create a Spark JIRA for this issue? and update the PR title to reflect it?
   - add a warning message too? clients may expect Spark to use the partitioning they reported and could be surprised that Spark internally ignores it, so a warning message would be helpful for them to debug.
   
   I think the best solution is for connectors such as Cassandra to adopt the new API, otherwise they could see severe performance penalties. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sunchao commented on a diff in pull request #37952: [SPARK-40508][SQL] Treat unknown partitioning as UnknownPartitioning

Posted by GitBox <gi...@apache.org>.
sunchao commented on code in PR #37952:
URL: https://github.com/apache/spark/pull/37952#discussion_r975791836


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2ScanPartitioningAndOrdering.scala:
##########
@@ -54,8 +55,9 @@ object V2ScanPartitioningAndOrdering extends Rule[LogicalPlan] with SQLConfHelpe
             }
           }
         case _: UnknownPartitioning => None
-        case p => throw new IllegalArgumentException("Unsupported data source V2 partitioning " +
-            "type: " + p.getClass.getSimpleName)
+        case p =>
+          logWarning("Spark ignores the partitioning. Please use KeyGroupedPartitioning for better performance")

Review Comment:
   nit: could we also log the class name here? e.g.,
   ```
   Spark ignores partitioning ${p.getClass.getSimpleName}. Please use KeyGroupedPartitioning for better performance
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] tedyu commented on pull request #37952: [SPARK-40508][SQL] Treat unknown partitioning as UnknownPartitioning

Posted by GitBox <gi...@apache.org>.
tedyu commented on PR #37952:
URL: https://github.com/apache/spark/pull/37952#issuecomment-1253011133

   @sunchao 
   https://github.com/tedyu/spark/runs/8459534296 shows that all tests have passed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] tedyu commented on pull request #37952: [SPARK-40508][SQL] Treat unknown partitioning as UnknownPartitioning

Posted by GitBox <gi...@apache.org>.
tedyu commented on PR #37952:
URL: https://github.com/apache/spark/pull/37952#issuecomment-1253904835

   @sunchao 
   Can this PR be merged ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] tedyu commented on pull request #37952: Treat unknown partitioning as UnknownPartitioning

Posted by GitBox <gi...@apache.org>.
tedyu commented on PR #37952:
URL: https://github.com/apache/spark/pull/37952#issuecomment-1252616217

   If I subclass `UnknownPartitioning` directly, I would get this compilation error:
   ```
   [error] /nfusr/dev-server/zyu/spark-cassandra-connector/connector/src/main/scala/com/datastax/spark/connector/datasource/CassandraScanBuilder.scala:327:92: not enough arguments for constructor UnknownPartitioning: (x$1: Int)org.apache.spark.sql.connector.read.partitioning.UnknownPartitioning.
   [error] Unspecified value parameter x$1.
   [error] case class CassandraPartitioning(partitionKeys: Array[String], numPartitions: Int) extends UnknownPartitioning {
   [error]                                                                                            ^
   [error] one error found
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sunchao commented on pull request #37952: Treat unknown partitioning as UnknownPartitioning

Posted by GitBox <gi...@apache.org>.
sunchao commented on PR #37952:
URL: https://github.com/apache/spark/pull/37952#issuecomment-1252603472

   @tedyu could you share some background information? if `CustomPartitioning` is handled the same way as `UnknownPartitioning`, why can't you just use the latter instead?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org