You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2022/09/12 21:39:50 UTC

[GitHub] [iceberg] wypoon commented on pull request #5742: Spark: Add read conf for setting threshold to use streaming delete filter

wypoon commented on PR #5742:
URL: https://github.com/apache/iceberg/pull/5742#issuecomment-1244528824

   @flyrain @RussellSpitzer this is a follow up to https://github.com/apache/iceberg/pull/4588. In that change, there is a code path that is not tested, which is counting positional deletes when using a streaming delete filter. I manually tested that code path by temporarily changing the threshold to use a streaming filter in `DeleteFilter` from 100,000 to 2 and running `TestSparkReaderDeletes` that way. With this change, we make the threshold configurable so we can set it for testing. I had actually introduced the change here in the original PR at some point, but Russell asked me to separate it out because the PR was already quite complex.
   
   The logic behind this change is as follows:
   We add a `streamDeleteFilterThreshold` field to `SparkScan.ReadTask`, because the `planInputPartitions` method of both `SparkBatch` and `SparkMicroBatchStream` construct `SparkScan.ReadTask`s, and `SparkBatch` and `SparkMicroBatchStream` both take a `SparkReadConf` and thus can get the threshold value from the `SparkReadConf` and pass it in when constructing `SparkScan.ReadTask`. `SparkScan.RowReader` and `SparkScan.BatchReader` both take a `SparkScan.ReadTask` in their constructor, so they can get the threshold from `SparkScan.ReadTask` and pass it up their constructor chain to their respective superclasses, `RowDataReader` and `BatchDataReader`, where in their `open(FileScanTask)` methods, they construct a `BaseReader.SparkDeleteFilter`, which is where we pass in the threshold value.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org