You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2021/01/14 09:33:26 UTC

[GitHub] [iceberg] samarthjain opened a new pull request #2091: Spark - Fix parquet vectorized reads when column types are promoted

samarthjain opened a new pull request #2091:
URL: https://github.com/apache/iceberg/pull/2091


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] samarthjain commented on a change in pull request #2091: Spark - Fix parquet vectorized reads when column types are promoted

Posted by GitBox <gi...@apache.org>.
samarthjain commented on a change in pull request #2091:
URL: https://github.com/apache/iceberg/pull/2091#discussion_r558765297



##########
File path: spark/src/test/java/org/apache/iceberg/spark/data/TestHelpers.java
##########
@@ -267,11 +267,23 @@ private static void assertEqualsUnsafe(Type type, Object expected, Object actual
     }
 
     switch (type.typeId()) {
-      case BOOLEAN:
       case INTEGER:
+        Assert.assertEquals("Values didn't match", ((Number) expected).intValue(),
+                ((Number) actual).intValue());

Review comment:
       Yeah. The records in this test are created with data types int and float. However, when we read it, we use the Iceberg schema which has these columns as long and double. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] samarthjain commented on pull request #2091: Spark - Fix parquet vectorized reads when column types are promoted

Posted by GitBox <gi...@apache.org>.
samarthjain commented on pull request #2091:
URL: https://github.com/apache/iceberg/pull/2091#issuecomment-761130617


   > @samarthjain, can you describe the problem and how you fixed it in this PR?
   
   Done.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #2091: Spark - Fix parquet vectorized reads when column types are promoted

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #2091:
URL: https://github.com/apache/iceberg/pull/2091#discussion_r559764879



##########
File path: spark/src/test/java/org/apache/iceberg/spark/data/TestHelpers.java
##########
@@ -267,22 +267,24 @@ private static void assertEqualsUnsafe(Type type, Object expected, Object actual
     }
 
     switch (type.typeId()) {
-      case INTEGER:
-        Assert.assertEquals("Values didn't match", ((Number) expected).intValue(),
-                ((Number) actual).intValue());
-        break;
       case LONG:
-        Assert.assertEquals("Values didn't match", ((Number) expected).longValue(),
-                ((Number) actual).longValue());
-        break;
-      case FLOAT:
-        Assert.assertEquals("Values didn't match", Float.floatToIntBits(((Number) expected).floatValue()),
-                Float.floatToIntBits(((Number) actual).floatValue()));
+        if (expected instanceof Integer) {
+          Assert.assertEquals("Values didn't match", ((Number) expected).longValue(),
+              ((Number) actual).longValue());

Review comment:
       This should not alter the type of the actual value. If actual is an `int`, that's still an error.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] samarthjain commented on a change in pull request #2091: Spark - Fix parquet vectorized reads when column types are promoted

Posted by GitBox <gi...@apache.org>.
samarthjain commented on a change in pull request #2091:
URL: https://github.com/apache/iceberg/pull/2091#discussion_r558698866



##########
File path: spark/src/test/java/org/apache/iceberg/spark/data/TestHelpers.java
##########
@@ -267,11 +267,23 @@ private static void assertEqualsUnsafe(Type type, Object expected, Object actual
     }
 
     switch (type.typeId()) {
-      case BOOLEAN:
       case INTEGER:
+        Assert.assertEquals("Values didn't match", ((Number) expected).intValue(),
+                ((Number) actual).intValue());

Review comment:
       Without this change, the comparison fails with ClassCastException.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] samarthjain commented on pull request #2091: Spark - Fix parquet vectorized reads when column types are promoted

Posted by GitBox <gi...@apache.org>.
samarthjain commented on pull request #2091:
URL: https://github.com/apache/iceberg/pull/2091#issuecomment-761268837


   > @samarthjain, so if I understand correctly, the vector will always reflect what's in the file and the type promotion is done when reading from the vector?
   
   Yes. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #2091: Spark - Fix parquet vectorized reads when column types are promoted

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #2091:
URL: https://github.com/apache/iceberg/pull/2091#discussion_r559022078



##########
File path: spark/src/test/java/org/apache/iceberg/spark/data/TestHelpers.java
##########
@@ -267,11 +267,23 @@ private static void assertEqualsUnsafe(Type type, Object expected, Object actual
     }
 
     switch (type.typeId()) {
-      case BOOLEAN:
       case INTEGER:
+        Assert.assertEquals("Values didn't match", ((Number) expected).intValue(),
+                ((Number) actual).intValue());

Review comment:
       Okay, so the problem is that the expected records used to match have not been promoted?
   
   If that's the case, then I think we should handle type promotion for the expected values in the assertions. Let's pass the expected schema, which will have `long` or `double`. Then the behavior for `int` and `float` don't need to change because there is no promotion to handle. In the cases for `long` and `double`, the expected value can be checked using `expected instanceof Integer` and promoted if needed. Otherwise, the original `assertEquals` can be used.
   
   I think that will more narrowly validate the cases here. What we want to avoid is allowing `1` and `1.0D` to match if the type is `int` because there is no promotion to `int` and a `double` should never be promoted to `long` if the type were `long`.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #2091: Spark - Fix parquet vectorized reads when column types are promoted

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #2091:
URL: https://github.com/apache/iceberg/pull/2091#discussion_r558577297



##########
File path: spark/src/test/java/org/apache/iceberg/spark/data/TestHelpers.java
##########
@@ -267,11 +267,23 @@ private static void assertEqualsUnsafe(Type type, Object expected, Object actual
     }
 
     switch (type.typeId()) {
-      case BOOLEAN:
       case INTEGER:
+        Assert.assertEquals("Values didn't match", ((Number) expected).intValue(),
+                ((Number) actual).intValue());
+        break;
       case LONG:
+        Assert.assertEquals("Values didn't match", ((Number) expected).longValue(),
+                ((Number) actual).longValue());
+        break;
       case FLOAT:
+        Assert.assertEquals("Values didn't match", ((Number) expected).floatValue(),
+                ((Number) actual).floatValue(), 0.01F);

Review comment:
       We generally don't use `assertEquals` that has a floating point tolerance because Iceberg does not calculate values. The values should match exactly. I think this should use `Float.floatToIntBits` and compare the integer results.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #2091: Spark - Fix parquet vectorized reads when column types are promoted

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #2091:
URL: https://github.com/apache/iceberg/pull/2091#discussion_r559827572



##########
File path: spark/src/test/java/org/apache/iceberg/spark/data/TestHelpers.java
##########
@@ -268,17 +268,18 @@ private static void assertEqualsUnsafe(Type type, Object expected, Object actual
 
     switch (type.typeId()) {
       case LONG:
+        Assert.assertTrue("Should be a long", actual instanceof Long);
         if (expected instanceof Integer) {
-          Assert.assertEquals("Values didn't match", ((Number) expected).longValue(),
-              ((Number) actual).longValue());
+          Assert.assertEquals("Values didn't match", ((Number) expected).longValue(), actual);
         } else {
           Assert.assertEquals("Primitive value should be equal to expected", expected, actual);
         }
         break;
       case DOUBLE:
+        Assert.assertTrue("Should be a double", actual instanceof Double);
         if (expected instanceof Float) {
           Assert.assertEquals("Values didn't match", Float.floatToIntBits(((Number) expected).floatValue()),
-              Float.floatToIntBits(((Number) actual).floatValue()));
+                    Float.floatToIntBits(((Number) actual).floatValue()));

Review comment:
       Why isn't this assertion like the one for `long`? Shouldn't this use the actual value and convert the expect to a `double`?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #2091: Spark - Fix parquet vectorized reads when column types are promoted

Posted by GitBox <gi...@apache.org>.
rdblue commented on pull request #2091:
URL: https://github.com/apache/iceberg/pull/2091#issuecomment-761077860


   @samarthjain, can you describe the problem and how you fixed it in this PR?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue merged pull request #2091: Spark - Fix parquet vectorized reads when column types are promoted

Posted by GitBox <gi...@apache.org>.
rdblue merged pull request #2091:
URL: https://github.com/apache/iceberg/pull/2091


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #2091: Spark - Fix parquet vectorized reads when column types are promoted

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #2091:
URL: https://github.com/apache/iceberg/pull/2091#discussion_r558577923



##########
File path: spark/src/test/java/org/apache/iceberg/spark/data/TestHelpers.java
##########
@@ -267,11 +267,23 @@ private static void assertEqualsUnsafe(Type type, Object expected, Object actual
     }
 
     switch (type.typeId()) {
-      case BOOLEAN:
       case INTEGER:
+        Assert.assertEquals("Values didn't match", ((Number) expected).intValue(),
+                ((Number) actual).intValue());

Review comment:
       Why was it necessary to change this method? Shouldn't these values be the same type already?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #2091: Spark - Fix parquet vectorized reads when column types are promoted

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #2091:
URL: https://github.com/apache/iceberg/pull/2091#discussion_r559840439



##########
File path: spark/src/test/java/org/apache/iceberg/spark/data/TestHelpers.java
##########
@@ -267,11 +267,26 @@ private static void assertEqualsUnsafe(Type type, Object expected, Object actual
     }
 
     switch (type.typeId()) {
-      case BOOLEAN:
-      case INTEGER:
       case LONG:
-      case FLOAT:
+        Assert.assertTrue("Should be a long", actual instanceof Long);
+        if (expected instanceof Integer) {
+          Assert.assertEquals("Values didn't match", ((Number) expected).longValue(), actual);
+        } else {
+          Assert.assertEquals("Primitive value should be equal to expected", expected, actual);
+        }
+        break;
       case DOUBLE:
+        Assert.assertTrue("Should be a double", actual instanceof Double);
+        if (expected instanceof Float) {
+          Assert.assertEquals("Values didn't match", Double.doubleToLongBits(((Double) expected).doubleValue()),

Review comment:
       Expected won't be a `Double`. Should you cast to `Float` or `Number` instead?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #2091: Spark - Fix parquet vectorized reads when column types are promoted

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #2091:
URL: https://github.com/apache/iceberg/pull/2091#discussion_r558577494



##########
File path: spark/src/test/java/org/apache/iceberg/spark/data/parquet/vectorized/TestParquetVectorizedReads.java
##########
@@ -231,4 +232,33 @@ public void testVectorizedReadsWithReallocatedArrowBuffers() throws IOException
           return record;
         });
   }
+
+  @Test
+  public void testReadsForTypePromotedColumns() throws Exception {
+    Schema writeSchema = new Schema(
+            required(100, "id", Types.LongType.get()),
+            optional(101, "int_data", Types.IntegerType.get()),
+            optional(102, "float_data", Types.FloatType.get()),
+            optional(103, "decimal_data", Types.DecimalType.of(10, 5))

Review comment:
       Nit: indentation in this function it off. It should be 2 indents (4 spaces) for continuation indents.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #2091: Spark - Fix parquet vectorized reads when column types are promoted

Posted by GitBox <gi...@apache.org>.
rdblue commented on pull request #2091:
URL: https://github.com/apache/iceberg/pull/2091#issuecomment-761180368


   @samarthjain, so if I understand correctly, the vector will always reflect what's in the file and the type promotion is done when reading from the vector?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] samarthjain commented on a change in pull request #2091: Spark - Fix parquet vectorized reads when column types are promoted

Posted by GitBox <gi...@apache.org>.
samarthjain commented on a change in pull request #2091:
URL: https://github.com/apache/iceberg/pull/2091#discussion_r559833653



##########
File path: spark/src/test/java/org/apache/iceberg/spark/data/TestHelpers.java
##########
@@ -268,17 +268,18 @@ private static void assertEqualsUnsafe(Type type, Object expected, Object actual
 
     switch (type.typeId()) {
       case LONG:
+        Assert.assertTrue("Should be a long", actual instanceof Long);
         if (expected instanceof Integer) {
-          Assert.assertEquals("Values didn't match", ((Number) expected).longValue(),
-              ((Number) actual).longValue());
+          Assert.assertEquals("Values didn't match", ((Number) expected).longValue(), actual);
         } else {
           Assert.assertEquals("Primitive value should be equal to expected", expected, actual);
         }
         break;
       case DOUBLE:
+        Assert.assertTrue("Should be a double", actual instanceof Double);
         if (expected instanceof Float) {
           Assert.assertEquals("Values didn't match", Float.floatToIntBits(((Number) expected).floatValue()),
-              Float.floatToIntBits(((Number) actual).floatValue()));
+                    Float.floatToIntBits(((Number) actual).floatValue()));

Review comment:
       Ugh..I just can't seem to get this right :-(. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #2091: Spark - Fix parquet vectorized reads when column types are promoted

Posted by GitBox <gi...@apache.org>.
rdblue commented on pull request #2091:
URL: https://github.com/apache/iceberg/pull/2091#issuecomment-761291687


   FYI @jackye1995. We should probably try to get this into the 0.11.0 release.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #2091: Spark - Fix parquet vectorized reads when column types are promoted

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #2091:
URL: https://github.com/apache/iceberg/pull/2091#discussion_r558753802



##########
File path: spark/src/test/java/org/apache/iceberg/spark/data/TestHelpers.java
##########
@@ -267,11 +267,23 @@ private static void assertEqualsUnsafe(Type type, Object expected, Object actual
     }
 
     switch (type.typeId()) {
-      case BOOLEAN:
       case INTEGER:
+        Assert.assertEquals("Values didn't match", ((Number) expected).intValue(),
+                ((Number) actual).intValue());

Review comment:
       Sounds like you're comparing objects that are two different types. Can you find out why the types don't match?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org