You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2023/01/09 22:04:00 UTC

[GitHub] [iceberg] aokolnychyi opened a new pull request, #6550: Spark 3.3: Automatically set Arrow properties for read performance

aokolnychyi opened a new pull request, #6550:
URL: https://github.com/apache/iceberg/pull/6550

   This PR adds logic to automatically set Arrow properties for read performance. Unless these properties are set, our read path can be up to 2x slower than built-in read path in Spark.
   
   I verified this patch by removing all explicit settings and running benchmarks with and without setting properties.
   
   Benchmark results without setting Arrow properties:
   ```
   Benchmark                                                              Mode  Cnt  Score   Error  Units
   VectorizedReadFlatParquetDataBenchmark.readLongsIcebergVectorized5k      ss    5  1.321 ± 0.029   s/op
   VectorizedReadFlatParquetDataBenchmark.readLongsSparkVectorized5k        ss    5  1.064 ± 0.162   s/op
   VectorizedReadFlatParquetDataBenchmark.readStringsIcebergVectorized5k    ss    5  2.187 ± 0.031   s/op
   VectorizedReadFlatParquetDataBenchmark.readStringsSparkVectorized5k      ss    5  1.304 ± 0.287   s/op
   ```
   
   Benchmark results with setting Arrow properties automatically:
   ```
   Benchmark                                                              Mode  Cnt  Score   Error  Units
   VectorizedReadFlatParquetDataBenchmark.readLongsIcebergVectorized5k      ss    5  0.927 ± 0.028   s/op
   VectorizedReadFlatParquetDataBenchmark.readLongsSparkVectorized5k        ss    5  1.035 ± 0.070   s/op
   VectorizedReadFlatParquetDataBenchmark.readStringsIcebergVectorized5k    ss    5  1.306 ± 0.029   s/op
   VectorizedReadFlatParquetDataBenchmark.readStringsSparkVectorized5k      ss    5  1.369 ± 0.114   s/op
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on pull request #6550: Spark 3.3: Automatically set Arrow properties for read performance

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on PR #6550:
URL: https://github.com/apache/iceberg/pull/6550#issuecomment-1377828306

   I took a look at `ArrowColumnVector` in Spark and it seems to be using its own nullability mechanism.
   
   ```
   @Override
   public UTF8String getUTF8String(int rowId) {
     if (isNullAt(rowId)) return null;
     return accessor.getUTF8String(rowId);
   }
   ```
   
   Given the performance impact, I am inclined to go ahead with this change.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] flyrain commented on a diff in pull request #6550: Spark 3.3: Automatically set Arrow properties for read performance

Posted by GitBox <gi...@apache.org>.
flyrain commented on code in PR #6550:
URL: https://github.com/apache/iceberg/pull/6550#discussion_r1065247118


##########
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/data/vectorized/VectorizedSparkParquetReaders.java:
##########
@@ -75,6 +108,57 @@ public static ColumnarBatchReader buildReader(
                 deleteFilter));
   }
 
+  public static ColumnarBatchReader buildReader(
+      Schema expectedSchema,
+      MessageType fileSchema,
+      Map<Integer, ?> idToConstant,
+      DeleteFilter<InternalRow> deleteFilter) {
+    return (ColumnarBatchReader)
+        TypeWithSchemaVisitor.visit(
+            expectedSchema.asStruct(),
+            fileSchema,
+            new ReaderBuilder(
+                expectedSchema,
+                fileSchema,
+                NullCheckingForGet.NULL_CHECKING_ENABLED,

Review Comment:
   yeah, it is going to change a lot of places. I'm OK with another PR or something.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #6550: Spark 3.3: Automatically set Arrow properties for read performance

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on code in PR #6550:
URL: https://github.com/apache/iceberg/pull/6550#discussion_r1065229065


##########
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/data/vectorized/VectorizedSparkParquetReaders.java:
##########
@@ -75,6 +108,57 @@ public static ColumnarBatchReader buildReader(
                 deleteFilter));
   }
 
+  public static ColumnarBatchReader buildReader(
+      Schema expectedSchema,
+      MessageType fileSchema,
+      Map<Integer, ?> idToConstant,
+      DeleteFilter<InternalRow> deleteFilter) {
+    return (ColumnarBatchReader)
+        TypeWithSchemaVisitor.visit(
+            expectedSchema.asStruct(),
+            fileSchema,
+            new ReaderBuilder(
+                expectedSchema,
+                fileSchema,
+                NullCheckingForGet.NULL_CHECKING_ENABLED,

Review Comment:
   I did that initially too but I had to change more places as it is used in tests. So I gave up and reverted. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on pull request #6550: Spark 3.3: Automatically set Arrow properties for read performance

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on PR #6550:
URL: https://github.com/apache/iceberg/pull/6550#issuecomment-1376491203

   > My only worry is we basically are changing behavior for other Arrow uses within the Spark Environment and we do not change the property back to the default when we are done if it was not configured.
   
   I had the same question but the problem is that Arrow would read properties only once and then cache the value. Even if we change the values back, it will have no effect. Let me look into Spark a bit more to see if anything can go wrong.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on pull request #6550: Spark 3.3: Automatically set Arrow properties for read performance

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on PR #6550:
URL: https://github.com/apache/iceberg/pull/6550#issuecomment-1377925747

   Thanks for reviewing, @singhpk234 @samarthjain @flyrain @RussellSpitzer!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #6550: Spark 3.3: Automatically set Arrow properties for read performance

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on code in PR #6550:
URL: https://github.com/apache/iceberg/pull/6550#discussion_r1065154236


##########
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/data/vectorized/VectorizedSparkParquetReaders.java:
##########
@@ -75,6 +108,57 @@ public static ColumnarBatchReader buildReader(
                 deleteFilter));
   }
 
+  public static ColumnarBatchReader buildReader(

Review Comment:
   We never had specific Iceberg configs for this behavior. We always relied on system properties for Arrow.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi merged pull request #6550: Spark 3.3: Automatically set Arrow properties for read performance

Posted by GitBox <gi...@apache.org>.
aokolnychyi merged PR #6550:
URL: https://github.com/apache/iceberg/pull/6550


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #6550: Spark 3.3: Automatically set Arrow properties for read performance

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on code in PR #6550:
URL: https://github.com/apache/iceberg/pull/6550#discussion_r1065253115


##########
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/data/vectorized/VectorizedSparkParquetReaders.java:
##########
@@ -75,6 +108,57 @@ public static ColumnarBatchReader buildReader(
                 deleteFilter));
   }
 
+  public static ColumnarBatchReader buildReader(
+      Schema expectedSchema,
+      MessageType fileSchema,
+      Map<Integer, ?> idToConstant,
+      DeleteFilter<InternalRow> deleteFilter) {
+    return (ColumnarBatchReader)
+        TypeWithSchemaVisitor.visit(
+            expectedSchema.asStruct(),
+            fileSchema,
+            new ReaderBuilder(
+                expectedSchema,
+                fileSchema,
+                NullCheckingForGet.NULL_CHECKING_ENABLED,

Review Comment:
   Agreed. I'll double check.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #6550: Spark 3.3: Automatically set Arrow properties for read performance

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on code in PR #6550:
URL: https://github.com/apache/iceberg/pull/6550#discussion_r1066294492


##########
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/data/vectorized/VectorizedSparkParquetReaders.java:
##########
@@ -75,6 +108,57 @@ public static ColumnarBatchReader buildReader(
                 deleteFilter));
   }
 
+  public static ColumnarBatchReader buildReader(
+      Schema expectedSchema,
+      MessageType fileSchema,
+      Map<Integer, ?> idToConstant,
+      DeleteFilter<InternalRow> deleteFilter) {
+    return (ColumnarBatchReader)
+        TypeWithSchemaVisitor.visit(
+            expectedSchema.asStruct(),
+            fileSchema,
+            new ReaderBuilder(
+                expectedSchema,
+                fileSchema,
+                NullCheckingForGet.NULL_CHECKING_ENABLED,

Review Comment:
   I took another look. We should be able to get rid of this property once the deprecated methods are removed. We can then switch to setting the Arrow property in tests.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #6550: Spark 3.3: Automatically set Arrow properties for read performance

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on code in PR #6550:
URL: https://github.com/apache/iceberg/pull/6550#discussion_r1065164062


##########
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/data/vectorized/VectorizedSparkParquetReaders.java:
##########
@@ -75,6 +108,57 @@ public static ColumnarBatchReader buildReader(
                 deleteFilter));
   }
 
+  public static ColumnarBatchReader buildReader(

Review Comment:
   That's nice to hear, @samarthjain! All of our tests also set these properties so it seems fairly safe.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] samarthjain commented on a diff in pull request #6550: Spark 3.3: Automatically set Arrow properties for read performance

Posted by GitBox <gi...@apache.org>.
samarthjain commented on code in PR #6550:
URL: https://github.com/apache/iceberg/pull/6550#discussion_r1065156577


##########
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/data/vectorized/VectorizedSparkParquetReaders.java:
##########
@@ -75,6 +108,57 @@ public static ColumnarBatchReader buildReader(
                 deleteFilter));
   }
 
+  public static ColumnarBatchReader buildReader(

Review Comment:
   LGTM. 
   
   In our internal Spark fork, we have the arrow config settings configured in the executor JVM args. I can confirm that things have been running fine with them in our prod env for the last 2+ years. So, +1. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] flyrain commented on a diff in pull request #6550: Spark 3.3: Automatically set Arrow properties for read performance

Posted by GitBox <gi...@apache.org>.
flyrain commented on code in PR #6550:
URL: https://github.com/apache/iceberg/pull/6550#discussion_r1065225701


##########
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/data/vectorized/VectorizedSparkParquetReaders.java:
##########
@@ -75,6 +108,57 @@ public static ColumnarBatchReader buildReader(
                 deleteFilter));
   }
 
+  public static ColumnarBatchReader buildReader(
+      Schema expectedSchema,
+      MessageType fileSchema,
+      Map<Integer, ?> idToConstant,
+      DeleteFilter<InternalRow> deleteFilter) {
+    return (ColumnarBatchReader)
+        TypeWithSchemaVisitor.visit(
+            expectedSchema.asStruct(),
+            fileSchema,
+            new ReaderBuilder(
+                expectedSchema,
+                fileSchema,
+                NullCheckingForGet.NULL_CHECKING_ENABLED,

Review Comment:
   Minor point: Since we are going to read/write this system property everywhere, I guess there is no point to pass around this parameter. Always good to minimize the parameters. I'm OK to fix it or not though. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] singhpk234 commented on pull request #6550: Spark 3.3: Automatically set Arrow properties for read performance

Posted by GitBox <gi...@apache.org>.
singhpk234 commented on PR #6550:
URL: https://github.com/apache/iceberg/pull/6550#issuecomment-1376516447

   +1 on making these configs default 
   more tickets related to same : 
   1. https://github.com/apache/iceberg/issues/4217#issuecomment-1159209016
   2. https://github.com/apache/iceberg/issues/6320


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #6550: Spark 3.3: Automatically set Arrow properties for read performance

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on code in PR #6550:
URL: https://github.com/apache/iceberg/pull/6550#discussion_r1065150851


##########
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/data/vectorized/VectorizedSparkParquetReaders.java:
##########
@@ -75,6 +108,57 @@ public static ColumnarBatchReader buildReader(
                 deleteFilter));
   }
 
+  public static ColumnarBatchReader buildReader(

Review Comment:
   This method is added to make sure `NullCheckingForGet.NULL_CHECKING_ENABLED` is referenced after we set system properties in this class. Otherwise, Arrow would memorize them earlier and our logic would have no effect. See `BoundsChecking` and `NullCheckingForGet` in Arrow for details.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on pull request #6550: Spark 3.3: Automatically set Arrow properties for read performance

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on PR #6550:
URL: https://github.com/apache/iceberg/pull/6550#issuecomment-1376395958

   cc @samarthjain @rdblue @flyrain @RussellSpitzer @szehon-ho 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org