You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "chenhao-db (via GitHub)" <gi...@apache.org> on 2024/03/26 19:14:26 UTC

[PR] [SPARK-47572][SQL] Enforce Window partitionSpec is orderable. [spark]

chenhao-db opened a new pull request, #45730:
URL: https://github.com/apache/spark/pull/45730

   ### What changes were proposed in this pull request?
   
   In the `Window` node, both `partitionSpec` and `orderSpec` must be orderable, but the current type check only verifies `orderSpec` is orderable. This can cause an error in later optimizing phases.
   
   Given a query:
   
   ```
   with t as (select id, map(id, id) as m from range(0, 10))
   select rank() over (partition by m order by id) from t
   ```
   
   Before the PR, it fails with an `INTERNAL_ERROR`:
   
   ```
   org.apache.spark.SparkException: [INTERNAL_ERROR] grouping/join/window partition keys cannot be map type. SQLSTATE: XX000
   at org.apache.spark.SparkException$.internalError(SparkException.scala:92)
   at org.apache.spark.SparkException$.internalError(SparkException.scala:96)
   at org.apache.spark.sql.catalyst.optimizer.NormalizeFloatingNumbers$.needNormalize(NormalizeFloatingNumbers.scala:103)
   at org.apache.spark.sql.catalyst.optimizer.NormalizeFloatingNumbers$.org$apache$spark$sql$catalyst$optimizer$NormalizeFloatingNumbers$$needNormalize(NormalizeFloatingNumbers.scala:94)
   ...
   ```
   
   After the PR, it fails with a `DATATYPE_MISMATCH.INVALID_ORDERING_TYPE`, which is expected:
   
   ```
     org.apache.spark.sql.catalyst.ExtendedAnalysisException: [DATATYPE_MISMATCH.INVALID_ORDERING_TYPE] Cannot resolve "m" due to data type mismatch: The `attributereference` does not support ordering on type "MAP<BIGINT, BIGINT>". SQLSTATE: 42K09; line 2 pos 53;
   Project [RANK() OVER (PARTITION BY m ORDER BY id ASC NULLS FIRST ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)#4]
   +- Project [id#1L, m#0, RANK() OVER (PARTITION BY m ORDER BY id ASC NULLS FIRST ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)#4, RANK() OVER (PARTITION BY m ORDER BY id ASC NULLS FIRST ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)#4]
      +- Window [rank(id#1L) windowspecdefinition(m#0, id#1L ASC NULLS FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS RANK() OVER (PARTITION BY m ORDER BY id ASC NULLS FIRST ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)#4], [m#0], [id#1L ASC NULLS FIRST]
         +- Project [id#1L, m#0]
            +- SubqueryAlias t
               +- SubqueryAlias t
                  +- Project [id#1L, map(id#1L, id#1L) AS m#0]
                     +- Range (0, 10, step=1, splits=None)
     at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.dataTypeMismatch(package.scala:73)
   ...
   ```
   
   ### How was this patch tested?
   
   Unit test.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47572][SQL] Enforce Window partitionSpec is orderable. [spark]

Posted by "chenhao-db (via GitHub)" <gi...@apache.org>.
chenhao-db commented on code in PR #45730:
URL: https://github.com/apache/spark/pull/45730#discussion_r1544103456


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala:
##########
@@ -537,6 +537,18 @@ trait CheckAnalysis extends PredicateHelper with LookupCatalog with QueryErrorsB
               }
             }
 
+          case Window(_, partitionSpec, _, _) =>
+            // Both `partitionSpec` and `orderSpec` must be orderable. We only need an extra check
+            // for `partitionSpec` here because `orderSpec` has the type check itself.
+            partitionSpec.foreach { p =>
+              if (!RowOrdering.isOrderable(p.dataType)) {
+                p.dataTypeMismatch(p, TypeCheckResult.DataTypeMismatch(
+                  errorSubClass = "INVALID_ORDERING_TYPE",

Review Comment:
   I cannot really tell the difference between these two error classes, so I just randomly picked one. Now I changed it into `EXPRESSION_TYPE_IS_NOT_ORDERABLE` instead.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47572][SQL] Enforce Window partitionSpec is orderable. [spark]

Posted by "chenhao-db (via GitHub)" <gi...@apache.org>.
chenhao-db commented on PR #45730:
URL: https://github.com/apache/spark/pull/45730#issuecomment-2026676267

   Things get a bit more complex after #45549.
   
   > For example, map type is not orderable, but can be used as grouping keys.
   
   It can only be used as grouping keys after #45549. However, with the map normalization and comparison code introduced in this PR, the map type can also be orderable.
   
   > what is the behavior today? 
   
   Before #45549, the SQL snippet in my PR description would fail with an `INTERNAL_ERROR`. After #45549, it would be successfully executed. When a column is used as a window partition key, it is translated into a local `Sort` over a `HashPartitioning`. Although the current type check code of `SortOrder` rejects map type, it doesn't run when the `SortOrder` is constructed from a window partition key. Then, the `Sort` node will successfully compare map inputs, because #45549 supports map comparison.
   
   There are still some issues with sorting maps, because the map normalization code is only inserted into aggregate, but not sort. But I think they are fixable, and the map type is indeed orderable. However, if a type is not orderable, it cannot be a window partition key, so the code change in this PR is still necessary.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47572][SQL] Enforce Window partitionSpec is orderable. [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on PR #45730:
URL: https://github.com/apache/spark/pull/45730#issuecomment-2026853194

   thanks, merging to master!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47572][SQL] Enforce Window partitionSpec is orderable. [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on PR #45730:
URL: https://github.com/apache/spark/pull/45730#issuecomment-2026687978

   @chenhao-db thanks for the analysis! Let's be strict first and we can relax map type later if needed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47572][SQL] Enforce Window partitionSpec is orderable. [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on PR #45730:
URL: https://github.com/apache/spark/pull/45730#issuecomment-2026572282

   what is the behavior today? Do we allow to use map type as window partition key?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47572][SQL] Enforce Window partitionSpec is orderable. [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #45730:
URL: https://github.com/apache/spark/pull/45730#discussion_r1544088131


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala:
##########
@@ -537,6 +537,18 @@ trait CheckAnalysis extends PredicateHelper with LookupCatalog with QueryErrorsB
               }
             }
 
+          case Window(_, partitionSpec, _, _) =>
+            // Both `partitionSpec` and `orderSpec` must be orderable. We only need an extra check
+            // for `partitionSpec` here because `orderSpec` has the type check itself.
+            partitionSpec.foreach { p =>
+              if (!RowOrdering.isOrderable(p.dataType)) {
+                p.dataTypeMismatch(p, TypeCheckResult.DataTypeMismatch(
+                  errorSubClass = "INVALID_ORDERING_TYPE",

Review Comment:
   why not use the error class `EXPRESSION_TYPE_IS_NOT_ORDERABLE`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47572][SQL] Enforce Window partitionSpec is orderable. [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on PR #45730:
URL: https://github.com/apache/spark/pull/45730#issuecomment-2026556724

   I think ordering and equality are two abilities. For example, map type is not orderable, but can be used as grouping keys.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47572][SQL] Enforce Window partitionSpec is orderable. [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan closed pull request #45730: [SPARK-47572][SQL] Enforce Window partitionSpec is orderable.
URL: https://github.com/apache/spark/pull/45730


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org