You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by "amogh-jahagirdar (via GitHub)" <gi...@apache.org> on 2023/05/12 18:27:45 UTC

[GitHub] [iceberg] amogh-jahagirdar commented on a diff in pull request #7591: Spark: Remove deprecated VectorizedSparkParquetReaders#buildReader API for 1.3.0 release

amogh-jahagirdar commented on code in PR #7591:
URL: https://github.com/apache/iceberg/pull/7591#discussion_r1192679845


##########
spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/data/parquet/vectorized/TestParquetVectorizedReads.java:
##########
@@ -57,6 +58,10 @@ public class TestParquetVectorizedReads extends AvroDataTest {
 
   static final Function<GenericData.Record, GenericData.Record> IDENTITY = record -> record;
 
+  static {
+    System.setProperty("arrow.enable_null_check_for_get", "true");
+  }

Review Comment:
   TBH this probably  is not the right way to handle the failing tests. So for context what happens is that [here](https://github.com/apache/iceberg/blob/master/spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/data/vectorized/VectorizedSparkParquetReaders.java#L99) we always use the `NullCheckingForGet.NULL_CHECKING_ENABLED` value which is a final static variable which is set once based on the value of the arrow.enable_null_check_for_get property.
   
   Right now we have some tests (most of which want to validate the validity buffer) and a few which do not. It's not possible to dynamically set the property for these different cases because it's static final, once it's set, every read of the value will just yield the original value. 
   
   Before we had an API to explicitly passing in to the vectorized reader if we should use the validity buffer, but now we want to deprecate that.
   
   In practice users will set this once for their Spark job but  for the purpose of testing we want to validate both paths (my implementation here just optimizes for the majority of the existing test cases, but misses out on validating the behavior when this is set to false which is the default due to better performance.
   
   Long story short, I'm thinking we should still expose a method but it will be package private, for setting the validity buffer. this package private method would be used for the purpose of testing, and constructing a parquet reader depending on what we want to test.
   
   Thoughts @aokolnychyi @singhpk234 ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org