You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/11/07 21:45:19 UTC

[GitHub] [arrow-datafusion] alamb opened a new pull request, #4132: Support parquet page filtering for string columns

alamb opened a new pull request, #4132:
URL: https://github.com/apache/arrow-datafusion/pull/4132

   Draft as it builds on tests in https://github.com/apache/arrow-datafusion/pull/4131
   
   # Which issue does this PR close?
   
   Part of https://github.com/apache/arrow-datafusion/issues/3833
   
   # Rationale for this change
   
   I want to be able to use parquet page index filtering for string datatypes in IOx.
   
   # What changes are included in this PR?
   
   1. Implement page index filtering for string columns
   2. Test
   
   # Are there any user-facing changes?
   
   Hopefully faster parquet predicate evaluation on string columns


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] liukun4515 commented on a diff in pull request #4132: Support parquet page filtering for string columns

Posted by GitBox <gi...@apache.org>.
liukun4515 commented on code in PR #4132:
URL: https://github.com/apache/arrow-datafusion/pull/4132#discussion_r1016029491


##########
datafusion/core/src/physical_plan/file_format/parquet/page_filter.rs:
##########
@@ -421,7 +423,16 @@ macro_rules! get_min_max_values_for_page_index {
                     vec.iter().map(|x| x.$func().cloned()),
                 )))
             }
-            Index::INT96(_) | Index::BYTE_ARRAY(_) | Index::FIXED_LEN_BYTE_ARRAY(_) => {
+            Index::BYTE_ARRAY(index) => {
+                let vec = &index.indexes;
+                let array: StringArray = vec

Review Comment:
   should also support the type `BYTE_ARRAY` in the `null_counts` of `PagesPruningStatistics`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] Ted-Jiang commented on pull request #4132: Support parquet page filtering for string columns

Posted by GitBox <gi...@apache.org>.
Ted-Jiang commented on PR #4132:
URL: https://github.com/apache/arrow-datafusion/pull/4132#issuecomment-1314964105

   @alamb  Would your mind i continue on this, with pr on your branch?😄


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on pull request #4132: Support parquet page filtering for string columns

Posted by GitBox <gi...@apache.org>.
alamb commented on PR #4132:
URL: https://github.com/apache/arrow-datafusion/pull/4132#issuecomment-1315449601

   @Ted-Jiang  -- go right ahead (or also feel free to make another PR and I can close this one0  -- I haven't had the time to work on this for the last few days. I will likely get back to it either later this week or next if you don't have a chance to


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] isidentical commented on pull request #4132: Support parquet page filtering for string columns

Posted by GitBox <gi...@apache.org>.
isidentical commented on PR #4132:
URL: https://github.com/apache/arrow-datafusion/pull/4132#issuecomment-1306298307

   @alamb does it make sense to also include support for `null_counts()`?
   https://github.com/apache/arrow-datafusion/blob/7b5842b91ebd00a2c7f894fcad797bea68a56d0f/datafusion/core/src/physical_plan/file_format/parquet/page_filter.rs#L440-L464


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] Ted-Jiang commented on pull request #4132: Support parquet page filtering for string columns

Posted by GitBox <gi...@apache.org>.
Ted-Jiang commented on PR #4132:
URL: https://github.com/apache/arrow-datafusion/pull/4132#issuecomment-1317983032

   > There is so much going on in DataFusion I can barely keep up!
   
   of course! You've given too much to this community ❤️


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] liukun4515 commented on a diff in pull request #4132: Support parquet page filtering for string columns

Posted by GitBox <gi...@apache.org>.
liukun4515 commented on code in PR #4132:
URL: https://github.com/apache/arrow-datafusion/pull/4132#discussion_r1016027321


##########
datafusion/core/src/physical_plan/file_format/parquet/page_filter.rs:
##########
@@ -421,7 +423,16 @@ macro_rules! get_min_max_values_for_page_index {
                     vec.iter().map(|x| x.$func().cloned()),
                 )))
             }
-            Index::INT96(_) | Index::BYTE_ARRAY(_) | Index::FIXED_LEN_BYTE_ARRAY(_) => {
+            Index::BYTE_ARRAY(index) => {
+                let vec = &index.indexes;
+                let array: StringArray = vec

Review Comment:
   we need check the logical type for the value.
   BYTE_ARRAY in the parquet can represent many logical types, such as `DECIMAL`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] liukun4515 commented on pull request #4132: Support parquet page filtering for string columns

Posted by GitBox <gi...@apache.org>.
liukun4515 commented on PR #4132:
URL: https://github.com/apache/arrow-datafusion/pull/4132#issuecomment-1308656884

   > 
   
   
   > Thanks for the comments @liukun4515 and @isidentical -- I agree about the nullcounts. I will work on this PR more to address your comments
   
   maybe you can just implement the case about utf8 type, I can help you to implement other logical data type for example Decimal.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on pull request #4132: Support parquet page filtering for string columns

Posted by GitBox <gi...@apache.org>.
alamb commented on PR #4132:
URL: https://github.com/apache/arrow-datafusion/pull/4132#issuecomment-1307114561

   Thanks for the comments @liukun4515  and @isidentical  -- I agree about the nullcounts. I will work on this PR more to address your comments


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #4132: Support parquet page filtering for string columns

Posted by GitBox <gi...@apache.org>.
alamb commented on code in PR #4132:
URL: https://github.com/apache/arrow-datafusion/pull/4132#discussion_r1015930022


##########
datafusion/core/src/physical_plan/file_format/parquet/page_filter.rs:
##########
@@ -421,7 +423,16 @@ macro_rules! get_min_max_values_for_page_index {
                     vec.iter().map(|x| x.$func().cloned()),
                 )))
             }
-            Index::INT96(_) | Index::BYTE_ARRAY(_) | Index::FIXED_LEN_BYTE_ARRAY(_) => {
+            Index::BYTE_ARRAY(index) => {
+                let vec = &index.indexes;
+                let array: StringArray = vec

Review Comment:
   I am not 100% sure if this is ok (like what if the parquet data got mapped to a LargeStringArray? 🤔 



##########
datafusion/core/tests/parquet_filter_pushdown.rs:
##########
@@ -266,20 +266,17 @@ async fn single_file_small_data_pages() {
     // page 3:                                     DLE:RLE RLE:RLE VLE:RLE_DICTIONARY ST:[min: djzdyiecnumrsrcbizwlqzdhnpoiqdh, max: fktdcgtmzvoedpwhfevcvvrtaurzgex, num_nulls not defined] CRC:[none] SZ:7 VC:9216
     // page 4:                                     DLE:RLE RLE:RLE VLE:RLE_DICTIONARY ST:[min: fktdcgtmzvoedpwhfevcvvrtaurzgex, max: fwtdpgtxwqkkgtgvthhwycrvjiizdifyp, num_nulls not defined] CRC:[none] SZ:7 VC:9216
     // page 5:                                     DLE:RLE RLE:RLE VLE:RLE_DICTIONARY ST:[min: fwtdpgtxwqkkgtgvthhwycrvjiizdifyp, max: iadnalqpdzthpifrvewossmpqibgtsuin, num_nulls not defined] CRC:[none] SZ:7 VC:7739
-    //
-    // This test currently fails due to https://github.com/apache/arrow-datafusion/issues/3833
-    // (page index pruning not implemented for byte array)
-
-    // TestCase::new(&test_parquet_file)
-    //     .with_name("selective")
-    //     // predicagte is chosen carefully to prune pages 0, 1, 2, 3, 4
-    //     // pod = 'iadnalqpdzthpifrvewossmpqibgtsuin'
-    //     .with_filter(col("pod").eq(lit("iadnalqpdzthpifrvewossmpqibgtsuin")))
-    //     .with_pushdown_expected(PushdownExpected::Some)
-    //     .with_page_index_filtering_expected(PageIndexFilteringExpected::Some)
-    //     .with_expected_rows(2574)
-    //     .run()
-    //     .await;
+
+    TestCase::new(&test_parquet_file)
+        .with_name("selective")
+        // predicate is chosen carefully to prune pages 0, 1, 2, 3, 4
+        // pod = 'iadnalqpdzthpifrvewossmpqibgtsuin'
+        .with_filter(col("pod").eq(lit("iadnalqpdzthpifrvewossmpqibgtsuin")))
+        .with_pushdown_expected(PushdownExpected::Some)
+        .with_page_index_filtering_expected(PageIndexFilteringExpected::Some)
+        .with_expected_rows(2574)
+        .run()
+        .await;

Review Comment:
   this now passes!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on pull request #4132: Support parquet page filtering for string columns

Posted by GitBox <gi...@apache.org>.
alamb commented on PR #4132:
URL: https://github.com/apache/arrow-datafusion/pull/4132#issuecomment-1317757883

   @Ted-Jiang  I am not sure what happened to this branch -- can you please make a new branch /new PR for this feature? I didn't have a chance to work on (or review) your PR today because some other unexpected work appeared.  😢 
   
   
   There is so much going on in DataFusion I can barely keep up!
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb closed pull request #4132: Support parquet page filtering for string columns

Posted by GitBox <gi...@apache.org>.
alamb closed pull request #4132: Support parquet page filtering for string columns
URL: https://github.com/apache/arrow-datafusion/pull/4132


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org