You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2020/07/31 07:36:16 UTC

[GitHub] [iceberg] HeartSaVioR opened a new pull request #1280: Core: fix serialization issue in BaseCombinedScanTask with Kyro

HeartSaVioR opened a new pull request #1280:
URL: https://github.com/apache/iceberg/pull/1280


   This patch fixes #1279 
   
   Kyro is not aware of Guava collection classes and trying to deal with the interface level which has no idea whether the implementation is immutable. I see we already fixed some of them, but during experimenting I found the another case, `BasedCombinedScanTask`.
   
   This patch changes BasedCombinedScanTask to retain the instance of Java's ArrayList, and adds relevant unit tests.
   
   There're some refactoring as well to deduplicate the code:
   
   * I found the implementation of `copyList` are scattered, hence created a new util class and moved it.
   * Comparing BaseCombinedScanTask needs compare of DataFile, which we already have the method for this, but in other test suite. I created a new helper class to move out comparison methods so that they can be co-used.
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #1280: Spark: refactor serialization test suites

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #1280:
URL: https://github.com/apache/iceberg/pull/1280#discussion_r467505473



##########
File path: spark/src/test/java/org/apache/iceberg/SerializationCheckHelper.java
##########
@@ -0,0 +1,93 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg;
+
+import java.util.Comparator;
+import java.util.List;
+import java.util.stream.Collectors;
+import org.junit.Assert;
+
+public final class SerializationCheckHelper {
+  private SerializationCheckHelper() {
+  }
+
+  public static void checkBaseCombinedScanTask(BaseCombinedScanTask expected, BaseCombinedScanTask actual) {

Review comment:
       You might consider renaming these methods to `assertEquals` if that's what they are doing. That makes them more generic and easier for read because it tells the reader exactly what it validated. Otherwise, someone that wants to reuse these methods needs to look at what the check does.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue merged pull request #1280: Spark: refactor serialization test suites

Posted by GitBox <gi...@apache.org>.
rdblue merged pull request #1280:
URL: https://github.com/apache/iceberg/pull/1280


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] HeartSaVioR commented on a change in pull request #1280: Spark: refactor serialization test suites

Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on a change in pull request #1280:
URL: https://github.com/apache/iceberg/pull/1280#discussion_r467079223



##########
File path: spark/src/test/java/org/apache/iceberg/SerializationCheckHelper.java
##########
@@ -0,0 +1,93 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg;
+
+import java.util.Comparator;
+import java.util.List;
+import java.util.stream.Collectors;
+import org.junit.Assert;
+
+public final class SerializationCheckHelper {
+  private SerializationCheckHelper() {
+  }
+
+  public static void checkBaseCombinedScanTask(BaseCombinedScanTask expected, BaseCombinedScanTask actual) {
+    List<FileScanTask> expectedTasks = getFileScanTasksInFilePathOrder(expected);
+    List<FileScanTask> actualTasks = getFileScanTasksInFilePathOrder(actual);
+
+    Assert.assertEquals("The number of file scan tasks should match",
+        expectedTasks.size(), actualTasks.size());
+
+    for (int i = 0; i < expectedTasks.size(); i++) {
+      FileScanTask expectedTask = expectedTasks.get(i);
+      FileScanTask actualTask = actualTasks.get(i);
+      checkFileScanTask(expectedTask, actualTask);
+    }
+  }
+
+  private static List<FileScanTask> getFileScanTasksInFilePathOrder(BaseCombinedScanTask task) {
+    return task.files().stream()
+        // use file path + start position to differentiate the tasks
+        .sorted(Comparator.comparing(o -> o.file().path().toString() + "##" + o.start()))
+        .collect(Collectors.toList());
+  }
+
+  public static void checkFileScanTask(FileScanTask expected, FileScanTask actual) {

Review comment:
       This is moved from `TestScanTaskSerialization.checkFileScanTask`.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] HeartSaVioR commented on a change in pull request #1280: Core: fix serialization issue in BaseCombinedScanTask with Kyro

Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on a change in pull request #1280:
URL: https://github.com/apache/iceberg/pull/1280#discussion_r463454025



##########
File path: spark/src/test/java/org/apache/iceberg/actions/TestRewriteDataFilesAction.java
##########
@@ -311,6 +324,64 @@ public void testRewriteLargeTableHasResiduals() {
     Assert.assertEquals("Rows must match", records, actualRecords);
   }
 
+  @Test

Review comment:
       I decided to add the serde tests of BaseCombinedScanTask here, because it seems to require non-trivial efforts to create BasedCombinedScanTask by hand. I can move out if we'd like to have separate suite with manually crafted test objects.

##########
File path: spark/src/test/java/org/apache/iceberg/SerializationCheckHelper.java
##########
@@ -0,0 +1,93 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg;
+
+import org.junit.Assert;
+import java.util.Comparator;
+import java.util.List;
+import java.util.stream.Collectors;
+
+public final class SerializationCheckHelper {
+  private SerializationCheckHelper() {
+  }
+
+  public static void checkBaseCombinedScanTask(BaseCombinedScanTask expected, BaseCombinedScanTask actual) {
+    List<FileScanTask> expectedTasks = getFileScanTasksInFilePathOrder(expected);
+    List<FileScanTask> actualTasks = getFileScanTasksInFilePathOrder(actual);
+
+    Assert.assertEquals("The number of file scan tasks should match",
+        expectedTasks.size(), actualTasks.size());
+
+    for (int i = 0 ; i < expectedTasks.size(); i++) {
+      FileScanTask expectedTask = expectedTasks.get(i);
+      FileScanTask actualTask = actualTasks.get(i);
+      checkFileScanTask(expectedTask, actualTask);
+    }
+  }
+
+  private static List<FileScanTask> getFileScanTasksInFilePathOrder(BaseCombinedScanTask task) {
+    return task.files().stream()
+        // use file path + start position to differentiate the tasks
+        .sorted(Comparator.comparing(o -> o.file().path().toString() + "##" + o.start()))
+        .collect(Collectors.toList());
+  }
+
+  public static void checkFileScanTask(FileScanTask expected, FileScanTask actual) {
+    checkDataFile(expected.file(), actual.file());
+
+    // PartitionSpec implements its own equals method
+    Assert.assertEquals("PartitionSpec doesn't match", expected.spec(), actual.spec());
+
+    Assert.assertEquals("starting position doesn't match", expected.start(), actual.start());
+
+    Assert.assertEquals("the number of bytes to scan doesn't match", expected.start(), actual.start());
+
+    // simplify comparison on residual expression via comparing toString
+    Assert.assertEquals("Residual expression doesn't match",
+        expected.residual().toString(), actual.residual().toString());
+  }
+
+  public static void checkDataFile(DataFile expected, DataFile actual) {

Review comment:
       This is identical to `TestDataFileSerialization.checkDataFile` - just moved to here.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #1280: Spark: refactor serialization test suites

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #1280:
URL: https://github.com/apache/iceberg/pull/1280#discussion_r467505136



##########
File path: spark/src/test/java/org/apache/iceberg/SerializationCheckHelper.java
##########
@@ -0,0 +1,93 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg;
+
+import java.util.Comparator;
+import java.util.List;
+import java.util.stream.Collectors;
+import org.junit.Assert;
+
+public final class SerializationCheckHelper {

Review comment:
       It looks like these are mostly equivalent to `assertEquals`. In that case, they don't necessarily need to be used to validate serialization. To make them easier to find and reuse, how about naming this class `TaskCheckHelper`?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] HeartSaVioR commented on a change in pull request #1280: Spark: refactor serialization test suites

Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on a change in pull request #1280:
URL: https://github.com/apache/iceberg/pull/1280#discussion_r467079516



##########
File path: spark/src/test/java/org/apache/iceberg/SerializationCheckHelper.java
##########
@@ -0,0 +1,93 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg;
+
+import java.util.Comparator;
+import java.util.List;
+import java.util.stream.Collectors;
+import org.junit.Assert;
+
+public final class SerializationCheckHelper {
+  private SerializationCheckHelper() {
+  }
+
+  public static void checkBaseCombinedScanTask(BaseCombinedScanTask expected, BaseCombinedScanTask actual) {

Review comment:
       This is moved from `TestScanTaskSerialization.checkBaseCombinedScanTask`.

##########
File path: spark/src/test/java/org/apache/iceberg/SerializationCheckHelper.java
##########
@@ -0,0 +1,93 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg;
+
+import java.util.Comparator;
+import java.util.List;
+import java.util.stream.Collectors;
+import org.junit.Assert;
+
+public final class SerializationCheckHelper {
+  private SerializationCheckHelper() {
+  }
+
+  public static void checkBaseCombinedScanTask(BaseCombinedScanTask expected, BaseCombinedScanTask actual) {
+    List<FileScanTask> expectedTasks = getFileScanTasksInFilePathOrder(expected);
+    List<FileScanTask> actualTasks = getFileScanTasksInFilePathOrder(actual);
+
+    Assert.assertEquals("The number of file scan tasks should match",
+        expectedTasks.size(), actualTasks.size());
+
+    for (int i = 0; i < expectedTasks.size(); i++) {
+      FileScanTask expectedTask = expectedTasks.get(i);
+      FileScanTask actualTask = actualTasks.get(i);
+      checkFileScanTask(expectedTask, actualTask);
+    }
+  }
+
+  private static List<FileScanTask> getFileScanTasksInFilePathOrder(BaseCombinedScanTask task) {

Review comment:
       This is moved from `TestScanTaskSerialization.getFileScanTasksInFilePathOrder`.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] HeartSaVioR commented on pull request #1280: Spark: refactor serialization test suites

Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on pull request #1280:
URL: https://github.com/apache/iceberg/pull/1280#issuecomment-671849788


   Thanks for reviewing and merging!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] HeartSaVioR edited a comment on pull request #1280: Spark: refactor serialization test suites

Posted by GitBox <gi...@apache.org>.
HeartSaVioR edited a comment on pull request #1280:
URL: https://github.com/apache/iceberg/pull/1280#issuecomment-670302109


   This is rebased after merging #1285 to handle only refactoring.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #1280: Spark: refactor serialization test suites

Posted by GitBox <gi...@apache.org>.
rdblue commented on pull request #1280:
URL: https://github.com/apache/iceberg/pull/1280#issuecomment-671624646


   Thanks for updating this, @HeartSaVioR! Looks great. I merged it.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] HeartSaVioR commented on pull request #1280: Spark: refactor serialization test suites

Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on pull request #1280:
URL: https://github.com/apache/iceberg/pull/1280#issuecomment-670302109


   This is rebased after merging #1285.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] HeartSaVioR commented on a change in pull request #1280: Spark: refactor serialization test suites

Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on a change in pull request #1280:
URL: https://github.com/apache/iceberg/pull/1280#discussion_r467079223



##########
File path: spark/src/test/java/org/apache/iceberg/SerializationCheckHelper.java
##########
@@ -0,0 +1,93 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg;
+
+import java.util.Comparator;
+import java.util.List;
+import java.util.stream.Collectors;
+import org.junit.Assert;
+
+public final class SerializationCheckHelper {
+  private SerializationCheckHelper() {
+  }
+
+  public static void checkBaseCombinedScanTask(BaseCombinedScanTask expected, BaseCombinedScanTask actual) {
+    List<FileScanTask> expectedTasks = getFileScanTasksInFilePathOrder(expected);
+    List<FileScanTask> actualTasks = getFileScanTasksInFilePathOrder(actual);
+
+    Assert.assertEquals("The number of file scan tasks should match",
+        expectedTasks.size(), actualTasks.size());
+
+    for (int i = 0; i < expectedTasks.size(); i++) {
+      FileScanTask expectedTask = expectedTasks.get(i);
+      FileScanTask actualTask = actualTasks.get(i);
+      checkFileScanTask(expectedTask, actualTask);
+    }
+  }
+
+  private static List<FileScanTask> getFileScanTasksInFilePathOrder(BaseCombinedScanTask task) {
+    return task.files().stream()
+        // use file path + start position to differentiate the tasks
+        .sorted(Comparator.comparing(o -> o.file().path().toString() + "##" + o.start()))
+        .collect(Collectors.toList());
+  }
+
+  public static void checkFileScanTask(FileScanTask expected, FileScanTask actual) {

Review comment:
       This is moved from `TestScanTaskSerialization.java. checkFileScanTask`.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org