You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by "aokolnychyi (via GitHub)" <gi...@apache.org> on 2023/04/24 22:09:29 UTC

[GitHub] [iceberg] aokolnychyi opened a new pull request, #7424: Spark 3.4: Add tests for SPJ when partition keys mismatch

aokolnychyi opened a new pull request, #7424:
URL: https://github.com/apache/iceberg/pull/7424

   This PR adds tests for SPJ when partition keys mismatch, which was not supported prior to 3.4. Now Iceberg supports SPJ in MERGE without limitations.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] nastra commented on a diff in pull request #7424: Spark 3.4: Add tests for SPJ when partition keys mismatch

Posted by "nastra (via GitHub)" <gi...@apache.org>.
nastra commented on code in PR #7424:
URL: https://github.com/apache/iceberg/pull/7424#discussion_r1176103017


##########
spark/v3.4/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestStoragePartitionedJoinsInRowLevelOperations.java:
##########
@@ -257,14 +261,22 @@ private void checkMerge(RowLevelOperationMode mode) {
           SparkPlan plan =
               executeAndKeepPlan(
                   "MERGE INTO %s AS t USING %s AS s "
-                      + "ON t.id = s.id AND t.dep = s.dep AND t.dep = 'hr'"
+                      + "ON t.id = s.id AND t.dep = s.dep "
                       + "WHEN MATCHED THEN "
                       + "  UPDATE SET t.salary = s.salary "
                       + "WHEN NOT MATCHED THEN "
                       + "  INSERT *",
                   tableName, tableName(OTHER_TABLE_NAME));
           String planAsString = plan.toString();
-          Assert.assertFalse("Should be no shuffles with SPJ", planAsString.contains("Exchange"));
+          if (mode == COPY_ON_WRITE) {
+            int actualNumShuffles = StringUtils.countMatches(planAsString, "Exchange");
+            Assert.assertEquals("Should be 1 shuffle with SPJ", 1, actualNumShuffles);
+            Assert.assertTrue(

Review Comment:
   nit: using `Assertions.assertThat(planAsString).contains(...)` / `Assertions.assertThat(planAsString).doesNotContain(...)` will actually give you the entire `planAsString` in case the assertion ever fails (which makes debugging much easier)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #7424: Spark 3.4: Add tests for SPJ when partition keys mismatch

Posted by "aokolnychyi (via GitHub)" <gi...@apache.org>.
aokolnychyi commented on code in PR #7424:
URL: https://github.com/apache/iceberg/pull/7424#discussion_r1175825387


##########
spark/v3.4/spark/src/test/java/org/apache/iceberg/spark/sql/TestStoragePartitionedJoins.java:
##########
@@ -94,7 +96,6 @@ public void removeTables() {
   }
 
   // TODO: add tests for truncate transforms once SPARK-40295 is released

Review Comment:
   Tests for truncate are coming in a separate PR.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #7424: Spark 3.4: Add tests for SPJ when partition keys mismatch

Posted by "aokolnychyi (via GitHub)" <gi...@apache.org>.
aokolnychyi commented on code in PR #7424:
URL: https://github.com/apache/iceberg/pull/7424#discussion_r1176787655


##########
spark/v3.4/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestStoragePartitionedJoinsInRowLevelOperations.java:
##########
@@ -257,14 +261,22 @@ private void checkMerge(RowLevelOperationMode mode) {
           SparkPlan plan =
               executeAndKeepPlan(
                   "MERGE INTO %s AS t USING %s AS s "
-                      + "ON t.id = s.id AND t.dep = s.dep AND t.dep = 'hr'"
+                      + "ON t.id = s.id AND t.dep = s.dep "
                       + "WHEN MATCHED THEN "
                       + "  UPDATE SET t.salary = s.salary "
                       + "WHEN NOT MATCHED THEN "
                       + "  INSERT *",
                   tableName, tableName(OTHER_TABLE_NAME));
           String planAsString = plan.toString();
-          Assert.assertFalse("Should be no shuffles with SPJ", planAsString.contains("Exchange"));
+          if (mode == COPY_ON_WRITE) {
+            int actualNumShuffles = StringUtils.countMatches(planAsString, "Exchange");
+            Assert.assertEquals("Should be 1 shuffle with SPJ", 1, actualNumShuffles);
+            Assert.assertTrue(

Review Comment:
   Thanks, @nastra! Updated.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a diff in pull request #7424: Spark 3.4: Add tests for SPJ when partition keys mismatch

Posted by "RussellSpitzer (via GitHub)" <gi...@apache.org>.
RussellSpitzer commented on code in PR #7424:
URL: https://github.com/apache/iceberg/pull/7424#discussion_r1176801848


##########
spark/v3.4/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestStoragePartitionedJoinsInRowLevelOperations.java:
##########
@@ -257,14 +261,22 @@ private void checkMerge(RowLevelOperationMode mode) {
           SparkPlan plan =
               executeAndKeepPlan(
                   "MERGE INTO %s AS t USING %s AS s "
-                      + "ON t.id = s.id AND t.dep = s.dep AND t.dep = 'hr'"

Review Comment:
   Should we leave in a test that does this join though? I know it shouldn't be required but we might as well keep a test that also checks if SPJ kicks in with the additional predicate as well.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #7424: Spark 3.4: Add tests for SPJ when partition keys mismatch

Posted by "aokolnychyi (via GitHub)" <gi...@apache.org>.
aokolnychyi commented on code in PR #7424:
URL: https://github.com/apache/iceberg/pull/7424#discussion_r1176806457


##########
spark/v3.4/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestStoragePartitionedJoinsInRowLevelOperations.java:
##########
@@ -257,14 +261,22 @@ private void checkMerge(RowLevelOperationMode mode) {
           SparkPlan plan =
               executeAndKeepPlan(
                   "MERGE INTO %s AS t USING %s AS s "
-                      + "ON t.id = s.id AND t.dep = s.dep AND t.dep = 'hr'"

Review Comment:
   I don't think a predicate would affect SPJ but I can keep this test separately to be safe.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #7424: Spark 3.4: Add tests for SPJ when partition keys mismatch

Posted by "aokolnychyi (via GitHub)" <gi...@apache.org>.
aokolnychyi commented on code in PR #7424:
URL: https://github.com/apache/iceberg/pull/7424#discussion_r1175825193


##########
spark/v3.4/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestStoragePartitionedJoinsInRowLevelOperations.java:
##########
@@ -257,14 +261,22 @@ private void checkMerge(RowLevelOperationMode mode) {
           SparkPlan plan =
               executeAndKeepPlan(
                   "MERGE INTO %s AS t USING %s AS s "
-                      + "ON t.id = s.id AND t.dep = s.dep AND t.dep = 'hr'"

Review Comment:
   This condition was a hard requirement in 3.3, otherwise SPJ would not kick in. It is no longer needed, so I adapted the test to verify mismatching keys are supported. That's the usual MERGE use case since it may contain data for new partitions.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] anigos commented on pull request #7424: Spark 3.4: Add tests for SPJ when partition keys mismatch

Posted by "anigos (via GitHub)" <gi...@apache.org>.
anigos commented on PR #7424:
URL: https://github.com/apache/iceberg/pull/7424#issuecomment-1576167143

   @aokolnychyi this is amazing and a longtime ask. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi merged pull request #7424: Spark 3.4: Add tests for SPJ when partition keys mismatch

Posted by "aokolnychyi (via GitHub)" <gi...@apache.org>.
aokolnychyi merged PR #7424:
URL: https://github.com/apache/iceberg/pull/7424


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #7424: Spark 3.4: Add tests for SPJ when partition keys mismatch

Posted by "aokolnychyi (via GitHub)" <gi...@apache.org>.
aokolnychyi commented on code in PR #7424:
URL: https://github.com/apache/iceberg/pull/7424#discussion_r1176833372


##########
spark/v3.4/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestStoragePartitionedJoinsInRowLevelOperations.java:
##########
@@ -257,14 +261,22 @@ private void checkMerge(RowLevelOperationMode mode) {
           SparkPlan plan =
               executeAndKeepPlan(
                   "MERGE INTO %s AS t USING %s AS s "
-                      + "ON t.id = s.id AND t.dep = s.dep AND t.dep = 'hr'"

Review Comment:
   Added both tests now.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on pull request #7424: Spark 3.4: Add tests for SPJ when partition keys mismatch

Posted by "aokolnychyi (via GitHub)" <gi...@apache.org>.
aokolnychyi commented on PR #7424:
URL: https://github.com/apache/iceberg/pull/7424#issuecomment-1522282540

   Thanks for reviewing, @nastra @RussellSpitzer!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org