You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/03/24 15:26:47 UTC

[GitHub] [arrow-datafusion] Ted-Jiang opened a new pull request #2077: Fix lost filters and projections in ParquetExec, CSVExec etc

Ted-Jiang opened a new pull request #2077:
URL: https://github.com/apache/arrow-datafusion/pull/2077


   # Which issue does this PR close?
   
   <!--
   We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123.
   -->
   Thanks for @yjshen give this advice ❤️. 
   Closes #2073.
   
    # Rationale for this change
   
   Before:
   ```
   ❯  explain select c1, c2 from test where  c2 = 0.000001;
   +---------------+------------------------------------------------------------------------------------------------------------------------------------------------------+
   | plan_type     | plan                                                                                                                                                 |
   +---------------+------------------------------------------------------------------------------------------------------------------------------------------------------+
   | logical_plan  | Projection: #test.c1, #test.c2                                                                                                                       |
   |               |   Filter: #test.c2 = Float64(0.000001)                                                                                                               |
   |               |     TableScan: test projection=Some([0, 1]), partial_filters=[#test.c2 = Float64(0.000001)]                                                          |
   | physical_plan | ProjectionExec: expr=[c1@0 as c1, c2@1 as c2]                                                                                                        |
   |               |   CoalesceBatchesExec: target_batch_size=4096                                                                                                        |
   |               |     FilterExec: CAST(c2@1 AS Float64) = 0.000001                                                                                                     |
   |               |       RepartitionExec: partitioning=RoundRobinBatch(16)                                                                                              |
   |               |         CsvExec: files=[/Users/yangjiang/CLionProjects/github/arrow-datafusion/testing/data/csv/aggregate_test_100.csv], has_header=true, limit=None |
   |               |                                                                                                                                                      |
   +---------------+------------------------------------------------------------------------------------------------------------------------------------------------------+
   2 rows in set. Query took 0.004 seconds.
   ```
   
   Now:
   ```
   explain select l_orderkey, l_shipdate from parquet where l_shipdate < '1996-05-20';
   +---------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
   | plan_type     | plan                                                                                                                                                                                                                                                                           |
   +---------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
   | logical_plan  | Projection: #parquet.l_orderkey, #parquet.l_shipdate                                                                                                                                                                                                                           |
   |               |   Filter: #parquet.l_shipdate < Utf8("1996-05-20")                                                                                                                                                                                                                             |
   |               |     TableScan: parquet projection=Some([0, 10]), partial_filters=[#parquet.l_shipdate < Utf8("1996-05-20")]                                                                                                                                                                    |
   | physical_plan | ProjectionExec: expr=[l_orderkey@0 as l_orderkey, l_shipdate@1 as l_shipdate]                                                                                                                                                                                                  |
   |               |   CoalesceBatchesExec: target_batch_size=4096                                                                                                                                                                                                                                  |
   |               |     FilterExec: l_shipdate@1 < CAST(1996-05-20 AS Date32)                                                                                                                                                                                                                      |
   |               |       RepartitionExec: partitioning=RoundRobinBatch(16)                                                                                                                                                                                                                        |
   |               |         ParquetExec: limit=None, partitions=[/Users/yangjiang/test-data/tpch-1g-oneFile/lineitem/part-00000-41937a05-669a-4bfc-abbd-b4cdf90557f1-c000.snappy.parquet], pruning_predicate=l_shipdate_min@0 < CAST(1996-05-20 AS Date32), projected_col=[l_orderkey, l_shipdate] |
   |               |                                                                                                                                                                                                                                                                                |
   +---------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
   2 rows in set. Query took 0.008 seconds.
   
   
   
   explain select c1,c2 from csv where c2 < 10;
   +---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
   | plan_type     | plan                                                                                                                                                                         |
   +---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
   | logical_plan  | Projection: #csv.c1, #csv.c2                                                                                                                                                 |
   |               |   Filter: #csv.c2 < Int64(10)                                                                                                                                                |
   |               |     TableScan: csv projection=Some([0, 1]), partial_filters=[#csv.c2 < Int64(10)]                                                                                            |
   | physical_plan | ProjectionExec: expr=[c1@0 as c1, c2@1 as c2]                                                                                                                                |
   |               |   CoalesceBatchesExec: target_batch_size=4096                                                                                                                                |
   |               |     FilterExec: CAST(c2@1 AS Int64) < 10                                                                                                                                     |
   |               |       RepartitionExec: partitioning=RoundRobinBatch(16)                                                                                                                      |
   |               |         CsvExec: files=[/Users/yangjiang/CLionProjects/github/arrow-datafusion/testing/data/csv/aggregate_test_100.csv], has_header=true, limit=None, projected_col=[c1, c2] |
   |               |                                                                                                                                                                              |
   +---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
   2 rows in set. Query took 0.005 seconds.
   
   
   
   
   
   
   
   
   
   ```
   
   # What changes are included in this PR?
   <!--
   There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR.
   -->
   
   # Are there any user-facing changes?
   <!--
   If there are user-facing changes then we may require documentation to be updated before approving the PR.
   -->
   
   <!--
   If there are any breaking changes to public APIs, please add the `api change` label.
   -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] Ted-Jiang closed pull request #2077: Fix lost filters and projections in ParquetExec, CSVExec etc

Posted by GitBox <gi...@apache.org>.
Ted-Jiang closed pull request #2077:
URL: https://github.com/apache/arrow-datafusion/pull/2077


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] xudong963 merged pull request #2077: Fix lost filters and projections in ParquetExec, CSVExec etc

Posted by GitBox <gi...@apache.org>.
xudong963 merged pull request #2077:
URL: https://github.com/apache/arrow-datafusion/pull/2077


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] yjshen commented on a change in pull request #2077: Fix lost filters and projections in ParquetExec, CSVExec etc

Posted by GitBox <gi...@apache.org>.
yjshen commented on a change in pull request #2077:
URL: https://github.com/apache/arrow-datafusion/pull/2077#discussion_r834460580



##########
File path: datafusion/src/physical_plan/file_format/parquet.rs
##########
@@ -274,12 +274,24 @@ impl ExecutionPlan for ParquetExec {
     ) -> std::fmt::Result {
         match t {
             DisplayFormatType::Default => {
-                write!(
-                    f,
-                    "ParquetExec: limit={:?}, partitions={}",
-                    self.base_config.limit,
-                    super::FileGroupsDisplay(&self.base_config.file_groups)
-                )
+                if let Some(pre) = &self.pruning_predicate {
+                    write!(
+                        f,
+                        "ParquetExec: limit={:?}, partitions={}, pruning_predicate={}, projected_col={}",

Review comment:
       How do you like `projection` and `predicate` to shorten it? Besides, I suggest we put `project` and `predicate` before `partitions` since it might be a long list for partitions.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] Ted-Jiang commented on a change in pull request #2077: Fix lost filters and projections in ParquetExec, CSVExec etc

Posted by GitBox <gi...@apache.org>.
Ted-Jiang commented on a change in pull request #2077:
URL: https://github.com/apache/arrow-datafusion/pull/2077#discussion_r834469614



##########
File path: datafusion/src/physical_plan/file_format/parquet.rs
##########
@@ -274,12 +274,24 @@ impl ExecutionPlan for ParquetExec {
     ) -> std::fmt::Result {
         match t {
             DisplayFormatType::Default => {
-                write!(
-                    f,
-                    "ParquetExec: limit={:?}, partitions={}",
-                    self.base_config.limit,
-                    super::FileGroupsDisplay(&self.base_config.file_groups)
-                )
+                if let Some(pre) = &self.pruning_predicate {
+                    write!(
+                        f,
+                        "ParquetExec: limit={:?}, partitions={}, pruning_predicate={}, projected_col={}",

Review comment:
       Sure, I wasn't sure about the name at first👍




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] Ted-Jiang commented on a change in pull request #2077: Fix lost filters and projections in ParquetExec, CSVExec etc

Posted by GitBox <gi...@apache.org>.
Ted-Jiang commented on a change in pull request #2077:
URL: https://github.com/apache/arrow-datafusion/pull/2077#discussion_r834438207



##########
File path: datafusion/tests/user_defined_plan.rs
##########
@@ -225,7 +225,9 @@ async fn topk_plan() -> Result<()> {
     let actual_output = exec_sql(&mut ctx, &explain_query).await?;
 
     // normalize newlines (output on windows uses \r\n)
-    let actual_output = actual_output.replace("\r\n", "\n");
+    let mut actual_output = actual_output.replace("\r\n", "\n");
+    actual_output.retain(|x| !x.is_ascii_whitespace());

Review comment:
       Everytime changing explain struct will add whitespace in output.
   Add this filter will make this test pass.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on a change in pull request #2077: Fix lost filters and projections in ParquetExec, CSVExec etc

Posted by GitBox <gi...@apache.org>.
alamb commented on a change in pull request #2077:
URL: https://github.com/apache/arrow-datafusion/pull/2077#discussion_r834591154



##########
File path: datafusion/src/physical_plan/file_format/parquet.rs
##########
@@ -274,12 +274,24 @@ impl ExecutionPlan for ParquetExec {
     ) -> std::fmt::Result {
         match t {
             DisplayFormatType::Default => {
-                write!(
-                    f,
-                    "ParquetExec: limit={:?}, partitions={}",
-                    self.base_config.limit,
-                    super::FileGroupsDisplay(&self.base_config.file_groups)
-                )
+                if let Some(pre) = &self.pruning_predicate {
+                    write!(
+                        f,
+                        "ParquetExec: limit={:?}, partitions={}, pruning_predicate={}, projected_col={}",

Review comment:
       I also like `projection` and `predicate`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org