You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "alamb (via GitHub)" <gi...@apache.org> on 2023/02/13 21:04:08 UTC

[GitHub] [arrow-datafusion] alamb opened a new pull request, #5268: Support page skipping / page_index pushdown for evolved schemas

alamb opened a new pull request, #5268:
URL: https://github.com/apache/arrow-datafusion/pull/5268

   # Which issue does this PR close?
   
   Closes https://github.com/apache/arrow-datafusion/issues/5104
   
   # Rationale for this change
   
   See https://github.com/apache/arrow-datafusion/issues/5104
   
   I want to turn on page index pushdown for all queries. I can't do that if it causes errors for queries across parquet files where the schema is not the same
   
   # What changes are included in this PR?
   
   Fix a bug (I will describe inline)
   
   # Are these changes tested?
   
   Yes
   
   IN PROGRESS -- testing against https://github.com/apache/arrow-datafusion/pull/5099
   
   # Are there any user-facing changes?
   Less bugs


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #5268: Support page skipping / page_index pushdown for evolved schemas

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on code in PR #5268:
URL: https://github.com/apache/arrow-datafusion/pull/5268#discussion_r1105847268


##########
datafusion/core/src/datasource/file_format/parquet.rs:
##########
@@ -550,50 +550,63 @@ pub(crate) mod test_util {
     use parquet::file::properties::WriterProperties;
     use tempfile::NamedTempFile;
 
+    /// How many rows per page should be written
+    const ROWS_PER_PAGE: usize = 2;
+
     /// Writes `batches` to a temporary parquet file
     ///
-    /// If multi_page is set to `true`, all batches are written into
-    /// one temporary parquet file and the parquet file is written
+    /// If multi_page is set to `true`, the parquet file(s) are written
     /// with 2 rows per data page (used to test page filtering and
     /// boundaries).
     pub async fn store_parquet(
         batches: Vec<RecordBatch>,
         multi_page: bool,
     ) -> Result<(Vec<ObjectMeta>, Vec<NamedTempFile>)> {
-        if multi_page {
-            // All batches write in to one file, each batch must have same schema.
-            let mut output = NamedTempFile::new().expect("creating temp file");
-            let mut builder = WriterProperties::builder();
-            builder = builder.set_data_page_row_count_limit(2);
-            let proper = builder.build();
-            let mut writer =
-                ArrowWriter::try_new(&mut output, batches[0].schema(), Some(proper))
-                    .expect("creating writer");
-            for b in batches {
-                writer.write(&b).expect("Writing batch");
-            }
-            writer.close().unwrap();
-            Ok((vec![local_unpartitioned_file(&output)], vec![output]))
-        } else {
-            // Each batch writes to their own file
-            let files: Vec<_> = batches
-                .into_iter()
-                .map(|batch| {
-                    let mut output = NamedTempFile::new().expect("creating temp file");
+        // Each batch writes to their own file
+        let files: Vec<_> = batches
+            .into_iter()
+            .map(|batch| {
+                let mut output = NamedTempFile::new().expect("creating temp file");
+
+                let builder = WriterProperties::builder();
+                let props = if multi_page {
+                    builder.set_data_page_row_count_limit(2)

Review Comment:
   👍  you are totally right -- good catch



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #5268: Support page skipping / page_index pushdown for evolved schemas

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on code in PR #5268:
URL: https://github.com/apache/arrow-datafusion/pull/5268#discussion_r1105045157


##########
datafusion/core/src/physical_plan/file_format/parquet.rs:
##########
@@ -1635,27 +1740,38 @@ mod tests {
 
     #[tokio::test]
     async fn parquet_page_index_exec_metrics() {
-        let c1: ArrayRef = Arc::new(Int32Array::from(vec![Some(1), None, Some(2)]));
-        let c2: ArrayRef = Arc::new(Int32Array::from(vec![Some(3), Some(4), Some(5)]));
+        let c1: ArrayRef = Arc::new(Int32Array::from(vec![
+            Some(1),
+            None,
+            Some(2),
+            Some(3),
+            Some(4),
+            Some(5),
+        ]));
         let batch1 = create_batch(vec![("int", c1.clone())]);
-        let batch2 = create_batch(vec![("int", c2.clone())]);
 
         let filter = col("int").eq(lit(4_i32));
 
         let rt = RoundTrip::new()
             .with_predicate(filter)
             .with_page_index_predicate()
-            .round_trip(vec![batch1, batch2])
+            .round_trip(vec![batch1])
             .await;
 
         let metrics = rt.parquet_exec.metrics().unwrap();
 
         // assert the batches and some metrics
+        #[rustfmt::skip]
         let expected = vec![
-            "+-----+", "| int |", "+-----+", "| 3   |", "| 4   |", "| 5   |", "+-----+",
+            "+-----+",

Review Comment:
   this is different because previously the page layout was as follows
   
   Page1:
   1
   None
   2
   
   Page 2
   3
   4
   5
   
   Now the page layout is
   
   
   Page1:
   1
   None
   
   Page2
   2
   3
   
   Page3
   4
   5



##########
datafusion/core/src/physical_plan/file_format/parquet/page_filter.rs:
##########
@@ -223,6 +222,64 @@ impl PagePruningPredicate {
     }
 }
 
+/// Returns the column index in the row group metadata for the single
+/// column of a single column pruning predicate.
+///
+/// For example, give the predicate `y > 5`
+///
+/// And columns in the RowGroupMetadata like `['x', 'y', 'z']` will
+/// return 1.

Review Comment:
   ```suggestion
   /// For example, give the predicate `y > 5`and columns in the 
   /// RowGroupMetadata like `['x', 'y', 'z']` will return 1.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] Ted-Jiang commented on a diff in pull request #5268: Support page skipping / page_index pushdown for evolved schemas

Posted by "Ted-Jiang (via GitHub)" <gi...@apache.org>.
Ted-Jiang commented on code in PR #5268:
URL: https://github.com/apache/arrow-datafusion/pull/5268#discussion_r1105340349


##########
datafusion/core/src/physical_plan/file_format/parquet.rs:
##########
@@ -1331,6 +1330,80 @@ mod tests {
         assert_eq!(get_value(&metrics, "pushdown_rows_filtered"), 5);
     }
 
+    #[tokio::test]
+    async fn evolved_schema_disjoint_schema_with_page_index_pushdown() {
+        let c1: ArrayRef = Arc::new(StringArray::from(vec![
+            // Page 1
+            Some("Foo"),
+            Some("Bar"),
+            // Page 2
+            Some("Foo2"),
+            Some("Bar2"),
+            // Page 3
+            Some("Foo3"),
+            Some("Bar3"),
+        ]));
+
+        let c2: ArrayRef = Arc::new(Int64Array::from(vec![
+            // Page 1:
+            Some(1),
+            Some(2),
+            // Page 2: (pruned)
+            Some(3),
+            Some(4),
+            // Page 3: (pruned)
+            Some(5),
+            None,
+        ]));
+
+        // batch1: c1(string)
+        let batch1 = create_batch(vec![("c1", c1.clone())]);
+
+        // batch2: c2(int64)
+        let batch2 = create_batch(vec![("c2", c2.clone())]);
+
+        // batch3 (has c2, c1) -- both columns, should still prune
+        let batch3 = create_batch(vec![("c1", c1.clone()), ("c2", c2.clone())]);
+
+        // batch4 (has c2, c1) -- different column order, should still prune

Review Comment:
   Nice test case 👍 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] ursabot commented on pull request #5268: Support page skipping / page_index pushdown for evolved schemas

Posted by "ursabot (via GitHub)" <gi...@apache.org>.
ursabot commented on PR #5268:
URL: https://github.com/apache/arrow-datafusion/pull/5268#issuecomment-1432005892

   Benchmark runs are scheduled for baseline = d05647c65e14d865b854a845ae970797a6086e2c and contender = ec24724ea1130909402c64d5651a5815dcaf8285. ec24724ea1130909402c64d5651a5815dcaf8285 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/8994f3f58aef40acbf2920ba30cdb40a...43f6474ce1094de19bdfdd7832e62a4d/)
   [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] [test-mac-arm](https://conbench.ursa.dev/compare/runs/f9197ba2ad784c9d86311279c7e5d858...7e50642b40ae43b1bfc084a4d7f4e007/)
   [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/1d1733cf8528470384211aa856c63caf...0ed35f5970494164bc985fa92105e879/)
   [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/ad33b31339d84f56ad24ec8b7cc42a76...dd1cd4e3a0de45089ec4fc290fbd45e8/)
   Buildkite builds:
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on pull request #5268: Support page skipping / page_index pushdown for evolved schemas

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on PR #5268:
URL: https://github.com/apache/arrow-datafusion/pull/5268#issuecomment-1428732198

   @Ted-Jiang  and @thinkharderdev  I think I have finally fixed the bug with page index pushdown.
   
   I think @Dandandan  had asked about the status of this project as well


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on pull request #5268: Support page skipping / page_index pushdown for evolved schemas

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on PR #5268:
URL: https://github.com/apache/arrow-datafusion/pull/5268#issuecomment-1431991853

   > "Improved" question, does it fix those tests when setting enable_page_index to true?
   
   
   Yes! 🎉  I verified that all CI passes with page index pushdown enabled by default  (when this PR change is included). Check out https://github.com/apache/arrow-datafusion/pull/5099. I should have mentioned that. Sorry


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] Ted-Jiang commented on a diff in pull request #5268: Support page skipping / page_index pushdown for evolved schemas

Posted by "Ted-Jiang (via GitHub)" <gi...@apache.org>.
Ted-Jiang commented on code in PR #5268:
URL: https://github.com/apache/arrow-datafusion/pull/5268#discussion_r1105333170


##########
datafusion/core/src/physical_optimizer/pruning.rs:
##########
@@ -233,25 +233,18 @@ impl PruningPredicate {
             .unwrap_or_default()
     }
 
-    /// Returns all need column indexes to evaluate this pruning predicate
-    pub(crate) fn need_input_columns_ids(&self) -> HashSet<usize> {
-        let mut set = HashSet::new();
-        self.required_columns.columns.iter().for_each(|x| {
-            match self.schema().column_with_name(x.0.name.as_str()) {
-                None => {}
-                Some(y) => {
-                    set.insert(y.0);
-                }
-            }
-        });
-        set
+    pub(crate) fn required_columns(&self) -> &RequiredStatColumns {

Review Comment:
   Thanks for explanation! 👍 
   >  If an individual parquet file does not have all the columns or has the columns in a different order
   
   I have a question about if `file_a (c1, c2), file_b(c3, c1)`, do df support create external table t(c1) on both file_a and file_b 🤔 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] Dandandan commented on pull request #5268: Support page skipping / page_index pushdown for evolved schemas

Posted by "Dandandan (via GitHub)" <gi...@apache.org>.
Dandandan commented on PR #5268:
URL: https://github.com/apache/arrow-datafusion/pull/5268#issuecomment-1431729379

   @alamb 
   Ah ok, I apologize, the tests are not ignored indeed.
   "Improved" question, does it fix those tests when setting `enable_page_index` to true? 
   
   ```
       physical_plan::file_format::parquet::tests::evolved_schema_disjoint_schema_filter
       physical_plan::file_format::parquet::tests::evolved_schema_disjoint_schema_with_filter_pushdown
       physical_plan::file_format::parquet::tests::evolved_schema_intersection_filter
       physical_plan::file_format::parquet::tests::evolved_schema_intersection_filter_with_filter_pushdown
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #5268: Support page skipping / page_index pushdown for evolved schemas

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on code in PR #5268:
URL: https://github.com/apache/arrow-datafusion/pull/5268#discussion_r1105042360


##########
datafusion/core/src/physical_plan/file_format/parquet.rs:
##########
@@ -1635,27 +1740,38 @@ mod tests {
 
     #[tokio::test]
     async fn parquet_page_index_exec_metrics() {
-        let c1: ArrayRef = Arc::new(Int32Array::from(vec![Some(1), None, Some(2)]));
-        let c2: ArrayRef = Arc::new(Int32Array::from(vec![Some(3), Some(4), Some(5)]));
+        let c1: ArrayRef = Arc::new(Int32Array::from(vec![

Review Comment:
   This was the only test that used the "merge multiple batches together" behavior of `store_parquet` -- so I rewrote the tests to inline the creation and ensure we got evenly created two row pages



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] Dandandan commented on pull request #5268: Support page skipping / page_index pushdown for evolved schemas

Posted by "Dandandan (via GitHub)" <gi...@apache.org>.
Dandandan commented on PR #5268:
URL: https://github.com/apache/arrow-datafusion/pull/5268#issuecomment-1431149838

   Can we enable `evolved_schema_disjoint_schema_filter` and other tests with this PR too?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #5268: Support page skipping / page_index pushdown for evolved schemas

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on code in PR #5268:
URL: https://github.com/apache/arrow-datafusion/pull/5268#discussion_r1105014942


##########
datafusion/core/src/datasource/file_format/parquet.rs:
##########
@@ -550,50 +550,63 @@ pub(crate) mod test_util {
     use parquet::file::properties::WriterProperties;
     use tempfile::NamedTempFile;
 
+    /// How many rows per page should be written
+    const ROWS_PER_PAGE: usize = 2;
+
     /// Writes `batches` to a temporary parquet file
     ///
-    /// If multi_page is set to `true`, all batches are written into
-    /// one temporary parquet file and the parquet file is written
+    /// If multi_page is set to `true`, the parquet file(s) are written
     /// with 2 rows per data page (used to test page filtering and
     /// boundaries).
     pub async fn store_parquet(
         batches: Vec<RecordBatch>,
         multi_page: bool,
     ) -> Result<(Vec<ObjectMeta>, Vec<NamedTempFile>)> {
-        if multi_page {
-            // All batches write in to one file, each batch must have same schema.
-            let mut output = NamedTempFile::new().expect("creating temp file");
-            let mut builder = WriterProperties::builder();
-            builder = builder.set_data_page_row_count_limit(2);
-            let proper = builder.build();
-            let mut writer =
-                ArrowWriter::try_new(&mut output, batches[0].schema(), Some(proper))
-                    .expect("creating writer");
-            for b in batches {
-                writer.write(&b).expect("Writing batch");
-            }
-            writer.close().unwrap();
-            Ok((vec![local_unpartitioned_file(&output)], vec![output]))
-        } else {
-            // Each batch writes to their own file
-            let files: Vec<_> = batches
-                .into_iter()
-                .map(|batch| {
-                    let mut output = NamedTempFile::new().expect("creating temp file");
+        // Each batch writes to their own file

Review Comment:
   This change is so I can actually test evolved schemas with page indexes (aka write multiple files with different schemas)
   



##########
datafusion/core/src/physical_plan/file_format/parquet/page_filter.rs:
##########
@@ -155,30 +159,25 @@ impl PagePruningPredicate {
 
         let mut row_selections = Vec::with_capacity(page_index_predicates.len());
         for predicate in page_index_predicates {
-            // `extract_page_index_push_down_predicates` only return predicate with one col.
-            //  when building `PruningPredicate`, some single column filter like `abs(i) = 1`
-            //  will be rewrite to `lit(true)`, so may have an empty required_columns.
-            let col_id =
-                if let Some(&col_id) = predicate.need_input_columns_ids().iter().next() {
-                    col_id
-                } else {
-                    continue;
-                };
+            // find column index by looking in the row group metadata.
+            let col_idx = find_column_index(predicate, &groups[0]);

Review Comment:
   this calls the new per-file column index logic. I considered some more major rearranging of this code (like to have it do the column index lookup in the pruning stats) but I felt this way was easiest to review and was likely to be the most performant as well



##########
datafusion/core/src/physical_optimizer/pruning.rs:
##########
@@ -233,25 +233,18 @@ impl PruningPredicate {
             .unwrap_or_default()
     }
 
-    /// Returns all need column indexes to evaluate this pruning predicate
-    pub(crate) fn need_input_columns_ids(&self) -> HashSet<usize> {
-        let mut set = HashSet::new();
-        self.required_columns.columns.iter().for_each(|x| {
-            match self.schema().column_with_name(x.0.name.as_str()) {
-                None => {}
-                Some(y) => {
-                    set.insert(y.0);
-                }
-            }
-        });
-        set
+    pub(crate) fn required_columns(&self) -> &RequiredStatColumns {

Review Comment:
   This is the core change -- `need_input_columns_ids` returns indexes in terms of the overall table schema. If an individual parquet file does not have all the columns or has the columns in a different order, these indexes are not correct



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] Ted-Jiang commented on a diff in pull request #5268: Support page skipping / page_index pushdown for evolved schemas

Posted by "Ted-Jiang (via GitHub)" <gi...@apache.org>.
Ted-Jiang commented on code in PR #5268:
URL: https://github.com/apache/arrow-datafusion/pull/5268#discussion_r1105328492


##########
datafusion/core/src/datasource/file_format/parquet.rs:
##########
@@ -550,50 +550,63 @@ pub(crate) mod test_util {
     use parquet::file::properties::WriterProperties;
     use tempfile::NamedTempFile;
 
+    /// How many rows per page should be written
+    const ROWS_PER_PAGE: usize = 2;
+
     /// Writes `batches` to a temporary parquet file
     ///
-    /// If multi_page is set to `true`, all batches are written into
-    /// one temporary parquet file and the parquet file is written
+    /// If multi_page is set to `true`, the parquet file(s) are written
     /// with 2 rows per data page (used to test page filtering and
     /// boundaries).
     pub async fn store_parquet(
         batches: Vec<RecordBatch>,
         multi_page: bool,
     ) -> Result<(Vec<ObjectMeta>, Vec<NamedTempFile>)> {
-        if multi_page {
-            // All batches write in to one file, each batch must have same schema.
-            let mut output = NamedTempFile::new().expect("creating temp file");
-            let mut builder = WriterProperties::builder();
-            builder = builder.set_data_page_row_count_limit(2);
-            let proper = builder.build();
-            let mut writer =
-                ArrowWriter::try_new(&mut output, batches[0].schema(), Some(proper))
-                    .expect("creating writer");
-            for b in batches {
-                writer.write(&b).expect("Writing batch");
-            }
-            writer.close().unwrap();
-            Ok((vec![local_unpartitioned_file(&output)], vec![output]))
-        } else {
-            // Each batch writes to their own file
-            let files: Vec<_> = batches
-                .into_iter()
-                .map(|batch| {
-                    let mut output = NamedTempFile::new().expect("creating temp file");
+        // Each batch writes to their own file
+        let files: Vec<_> = batches
+            .into_iter()
+            .map(|batch| {
+                let mut output = NamedTempFile::new().expect("creating temp file");
+
+                let builder = WriterProperties::builder();
+                let props = if multi_page {
+                    builder.set_data_page_row_count_limit(2)

Review Comment:
   ```suggestion
                       builder.set_data_page_row_count_limit(ROWS_PER_PAGE)
   ```
   I think here should keep use `ROWS_PER_PAGE`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] Ted-Jiang commented on a diff in pull request #5268: Support page skipping / page_index pushdown for evolved schemas

Posted by "Ted-Jiang (via GitHub)" <gi...@apache.org>.
Ted-Jiang commented on code in PR #5268:
URL: https://github.com/apache/arrow-datafusion/pull/5268#discussion_r1105333170


##########
datafusion/core/src/physical_optimizer/pruning.rs:
##########
@@ -233,25 +233,18 @@ impl PruningPredicate {
             .unwrap_or_default()
     }
 
-    /// Returns all need column indexes to evaluate this pruning predicate
-    pub(crate) fn need_input_columns_ids(&self) -> HashSet<usize> {
-        let mut set = HashSet::new();
-        self.required_columns.columns.iter().for_each(|x| {
-            match self.schema().column_with_name(x.0.name.as_str()) {
-                None => {}
-                Some(y) => {
-                    set.insert(y.0);
-                }
-            }
-        });
-        set
+    pub(crate) fn required_columns(&self) -> &RequiredStatColumns {

Review Comment:
   Thanks for explanation! 👍 
   >  If an individual parquet file does not have all the columns or has the columns in a different order
   
   I have a question about if `file_a (c1, c2), file_b(c1, c3)`, do df support create external table t(c1) on both file_a and file_b 🤔 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] Ted-Jiang commented on a diff in pull request #5268: Support page skipping / page_index pushdown for evolved schemas

Posted by "Ted-Jiang (via GitHub)" <gi...@apache.org>.
Ted-Jiang commented on code in PR #5268:
URL: https://github.com/apache/arrow-datafusion/pull/5268#discussion_r1105334626


##########
datafusion/core/src/physical_optimizer/pruning.rs:
##########
@@ -233,25 +233,18 @@ impl PruningPredicate {
             .unwrap_or_default()
     }
 
-    /// Returns all need column indexes to evaluate this pruning predicate
-    pub(crate) fn need_input_columns_ids(&self) -> HashSet<usize> {
-        let mut set = HashSet::new();
-        self.required_columns.columns.iter().for_each(|x| {
-            match self.schema().column_with_name(x.0.name.as_str()) {
-                None => {}
-                Some(y) => {
-                    set.insert(y.0);
-                }
-            }
-        });
-        set
+    pub(crate) fn required_columns(&self) -> &RequiredStatColumns {

Review Comment:
   And file_a (c1, c2), file_b(c3) ,  support create external table t(c1)?
   Do both file have to have the c1 meta in both parquet files meta ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] Ted-Jiang commented on a diff in pull request #5268: Support page skipping / page_index pushdown for evolved schemas

Posted by "Ted-Jiang (via GitHub)" <gi...@apache.org>.
Ted-Jiang commented on code in PR #5268:
URL: https://github.com/apache/arrow-datafusion/pull/5268#discussion_r1105338488


##########
datafusion/core/src/physical_optimizer/pruning.rs:
##########
@@ -233,25 +233,18 @@ impl PruningPredicate {
             .unwrap_or_default()
     }
 
-    /// Returns all need column indexes to evaluate this pruning predicate
-    pub(crate) fn need_input_columns_ids(&self) -> HashSet<usize> {
-        let mut set = HashSet::new();
-        self.required_columns.columns.iter().for_each(|x| {
-            match self.schema().column_with_name(x.0.name.as_str()) {
-                None => {}
-                Some(y) => {
-                    set.insert(y.0);
-                }
-            }
-        });
-        set
+    pub(crate) fn required_columns(&self) -> &RequiredStatColumns {

Review Comment:
   i see both situation support in below test 😆 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on pull request #5268: Support page skipping / page_index pushdown for evolved schemas

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on PR #5268:
URL: https://github.com/apache/arrow-datafusion/pull/5268#issuecomment-1431523229

   @Dandandan, I apologize,  but I don't understand this request
   
   >  Can we un-ignore evolved_schema_disjoint_schema_filter and other tests with this PR too?
   
   That test does not appear to be ignored on master (nor in this PR): 
   
   https://github.com/apache/arrow-datafusion/blob/07ae738b48b6001498a4b10e9750422f9d8e1b7f/datafusion/core/src/physical_plan/file_format/parquet.rs#L1257-L1259
   
   I only found one instance of this test in the repo: 
   https://github.com/search?q=repo%3Aapache%2Farrow-datafusion%20evolved_schema_disjoint_schema_filter&type=code
   
   I couldn't find any other `#ignore`'d test elsewhere in the repo that looked relevant. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb merged pull request #5268: Support page skipping / page_index pushdown for evolved schemas

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb merged PR #5268:
URL: https://github.com/apache/arrow-datafusion/pull/5268


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org