You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by "amogh-jahagirdar (via GitHub)" <gi...@apache.org> on 2023/05/12 00:59:04 UTC

[GitHub] [iceberg] amogh-jahagirdar opened a new pull request, #7591: Spark: Remove deprecated VectorizedSparkParquetReaders#buildReader API for 1.3.0 release

amogh-jahagirdar opened a new pull request, #7591:
URL: https://github.com/apache/iceberg/pull/7591

   Remove deprecated VectorizedSparkParquetReaders#buildReader API for 1.3.0 release 
   cc: @szehon-ho @aokolnychyi @nastra 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] amogh-jahagirdar commented on a diff in pull request #7591: Spark: Remove deprecated VectorizedSparkParquetReaders#buildReader API for 1.3.0 release

Posted by "amogh-jahagirdar (via GitHub)" <gi...@apache.org>.
amogh-jahagirdar commented on code in PR #7591:
URL: https://github.com/apache/iceberg/pull/7591#discussion_r1193227616


##########
spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/data/parquet/vectorized/TestParquetVectorizedReads.java:
##########
@@ -57,6 +58,10 @@ public class TestParquetVectorizedReads extends AvroDataTest {
 
   static final Function<GenericData.Record, GenericData.Record> IDENTITY = record -> record;
 
+  static {
+    System.setProperty("arrow.enable_null_check_for_get", "true");
+  }

Review Comment:
   I think I'm overcomplicating this. since Iceberg performs the nullability check anyways there's not much value for our test path to even validate the arrow validity buffer (which was the premise of https://github.com/apache/iceberg/pull/6550/files). I think we can just remove the assertions related to `checkArrowValidityVector` from these tests



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi merged pull request #7591: Spark: Remove deprecated VectorizedSparkParquetReaders#buildReader API for 1.3.0 release

Posted by "aokolnychyi (via GitHub)" <gi...@apache.org>.
aokolnychyi merged PR #7591:
URL: https://github.com/apache/iceberg/pull/7591


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] amogh-jahagirdar commented on a diff in pull request #7591: Spark: Remove deprecated VectorizedSparkParquetReaders#buildReader API for 1.3.0 release

Posted by "amogh-jahagirdar (via GitHub)" <gi...@apache.org>.
amogh-jahagirdar commented on code in PR #7591:
URL: https://github.com/apache/iceberg/pull/7591#discussion_r1192679845


##########
spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/data/parquet/vectorized/TestParquetVectorizedReads.java:
##########
@@ -57,6 +58,10 @@ public class TestParquetVectorizedReads extends AvroDataTest {
 
   static final Function<GenericData.Record, GenericData.Record> IDENTITY = record -> record;
 
+  static {
+    System.setProperty("arrow.enable_null_check_for_get", "true");
+  }

Review Comment:
   TBH this probably  is not the right way to handle the failing tests. So for context what happens is that [here](https://github.com/apache/iceberg/blob/master/spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/data/vectorized/VectorizedSparkParquetReaders.java#L99) we always use the `NullCheckingForGet.NULL_CHECKING_ENABLED` value which is a final static variable which is set once based on the value of the arrow.enable_null_check_for_get property.
   
   Right now we have some tests (most of which want to validate the validity buffer) and a few which do not. It's not possible to dynamically set the property for these different cases because it's static final, once it's set, every read of the value will just yield the original value. 
   
   Before we had an API to explicitly passing in to the vectorized reader if we should use the validity buffer, but now we want to deprecate that.
   
   In practice users will set this once for their Spark job but  for the purpose of testing we want to validate both paths (my implementation here just optimizes for the majority of the existing test cases, but misses out on validating the behavior when this is set to false which is the default due to better performance.
   
   Long story short, I'm thinking we should still expose a method but it will be package private, for setting the validity buffer. this package private method would be used for the purpose of testing, and constructing a parquet reader depending on what we want to test.
   
   Thoughts @aokolnychyi @singhpk234 ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] amogh-jahagirdar commented on a diff in pull request #7591: Spark: Remove deprecated VectorizedSparkParquetReaders#buildReader API for 1.3.0 release

Posted by "amogh-jahagirdar (via GitHub)" <gi...@apache.org>.
amogh-jahagirdar commented on code in PR #7591:
URL: https://github.com/apache/iceberg/pull/7591#discussion_r1193227616


##########
spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/data/parquet/vectorized/TestParquetVectorizedReads.java:
##########
@@ -57,6 +58,10 @@ public class TestParquetVectorizedReads extends AvroDataTest {
 
   static final Function<GenericData.Record, GenericData.Record> IDENTITY = record -> record;
 
+  static {
+    System.setProperty("arrow.enable_null_check_for_get", "true");
+  }

Review Comment:
   I think I'm overcomplicating this. since Iceberg performs the nullability check anyways there's not much value for our test path to even validate the arrow validity buffer (which was the premise of https://github.com/apache/iceberg/pull/6550/files). We can remove the assertions related to `checkArrowValidityVector` from these tests



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] amogh-jahagirdar commented on pull request #7591: Spark: Remove deprecated VectorizedSparkParquetReaders#buildReader API for 1.3.0 release

Posted by "amogh-jahagirdar (via GitHub)" <gi...@apache.org>.
amogh-jahagirdar commented on PR #7591:
URL: https://github.com/apache/iceberg/pull/7591#issuecomment-1544956414

   Ah looks like some tests still reference the deprecated methods. Will fix those


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] amogh-jahagirdar commented on pull request #7591: Spark: Remove deprecated VectorizedSparkParquetReaders#buildReader API for 1.3.0 release

Posted by "amogh-jahagirdar (via GitHub)" <gi...@apache.org>.
amogh-jahagirdar commented on PR #7591:
URL: https://github.com/apache/iceberg/pull/7591#issuecomment-1545821517

   I need to update the tests so that the arrow null checking property is enabled


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on pull request #7591: Spark: Remove deprecated VectorizedSparkParquetReaders#buildReader API for 1.3.0 release

Posted by "aokolnychyi (via GitHub)" <gi...@apache.org>.
aokolnychyi commented on PR #7591:
URL: https://github.com/apache/iceberg/pull/7591#issuecomment-1548774430

   Thanks, @amogh-jahagirdar! Thanks for reviewing, @szehon-ho!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] amogh-jahagirdar commented on a diff in pull request #7591: Spark: Remove deprecated VectorizedSparkParquetReaders#buildReader API for 1.3.0 release

Posted by "amogh-jahagirdar (via GitHub)" <gi...@apache.org>.
amogh-jahagirdar commented on code in PR #7591:
URL: https://github.com/apache/iceberg/pull/7591#discussion_r1192679845


##########
spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/data/parquet/vectorized/TestParquetVectorizedReads.java:
##########
@@ -57,6 +58,10 @@ public class TestParquetVectorizedReads extends AvroDataTest {
 
   static final Function<GenericData.Record, GenericData.Record> IDENTITY = record -> record;
 
+  static {
+    System.setProperty("arrow.enable_null_check_for_get", "true");
+  }

Review Comment:
   This probably  is not the right way to handle the failing tests. So for context what happens is that [here](https://github.com/apache/iceberg/blob/master/spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/data/vectorized/VectorizedSparkParquetReaders.java#L99) we always use the `NullCheckingForGet.NULL_CHECKING_ENABLED` value which is a final static variable which is set once based on the value of the arrow.enable_null_check_for_get property.
   
   Right now we have some tests (most of which want to validate the validity buffer) and a few which do not. It's not possible to dynamically set the property for these different cases because it's static final, once it's set, every read of the value will just yield the original value. 
   
   Before we had an API to explicitly passing in to the vectorized reader if we should use the validity buffer, but now we want to deprecate that.
   
   In practice users will set this once for their Spark job but  for the purpose of testing we want to validate both paths (my implementation here just optimizes for the majority of the existing test cases, but misses out on validating the behavior when this is set to false which is the default due to better performance.
   
   Long story short, I'm thinking we should still expose a method but it will be package private, for setting the validity buffer. this package private method would be used for the purpose of testing, and constructing a parquet reader depending on what we want to test.
   
   Thoughts @aokolnychyi @singhpk234 ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org