You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2022/01/27 22:48:07 UTC

[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #3992: Spark 3.2: Fix cardinality check for alternative join implementations

RussellSpitzer commented on a change in pull request #3992:
URL: https://github.com/apache/iceberg/pull/3992#discussion_r794065099



##########
File path: spark/v3.2/spark-extensions/src/main/scala/org/apache/spark/sql/execution/datasources/v2/MergeRowsExec.scala
##########
@@ -46,6 +48,15 @@ case class MergeRowsExec(
     output: Seq[Attribute],
     child: SparkPlan) extends UnaryExecNode {
 
+  override def requiredChildOrdering: Seq[Seq[SortOrder]] = {
+    if (performCardinalityCheck) {
+      // request a local sort by the row ID attrs to co-locate matches for the same target row
+      Seq(rowIdAttrs.map(attr => SortOrder(attr, Ascending)))

Review comment:
       Feels like this is all leaning heavily on implementation details to actually do our check. While I think this fix is fine for now, maybe we should consider actually doing a group by the cardinality check or a custom aggregation which just throws an error on duplicate records?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org