You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "alamb (via GitHub)" <gi...@apache.org> on 2023/05/02 21:18:40 UTC

[GitHub] [arrow-datafusion] alamb opened a new issue, #6194: Explain plan does not always show ordering

alamb opened a new issue, #6194:
URL: https://github.com/apache/arrow-datafusion/issues/6194

   ### Is your feature request related to a problem or challenge?
   
   When debugging something downstream, I was quite confused by the following:
   
   Make input:
   ```shell
   echo "x,y" > /tmp/test.csv
   echo "a,1" >> /tmp/test.csv
   echo "a,2" >> /tmp/test.csv
   echo "b,3" >> /tmp/test.csv
   ```
   
   Run in `datafusion-cli`:
   
   Create table:
   
   ```sql
   DROP TABLE IF EXISTS test ;
   
   CREATE EXTERNAL TABLE test(x varchar, y bigint)
   STORED AS CSV
   WITH HEADER ROW
   WITH ORDER (x  ASC)
   LOCATION '/tmp/test.csv'
   ;
   ```
   
   Then run a query:
   
   ```
   ❯ explain select * from test order by x ASC;
   +---------------+----------------------------------------------------------------------------------------------------+
   | plan_type     | plan                                                                                               |
   +---------------+----------------------------------------------------------------------------------------------------+
   | logical_plan  | Sort: test.x ASC NULLS LAST                                                                        |
   |               |   TableScan: test projection=[x, y]                                                                |
   | physical_plan | CsvExec: files={1 group: [[private/tmp/test.csv]]}, has_header=true, limit=None, projection=[x, y] |
   |               |                                                                                                    |
   +---------------+----------------------------------------------------------------------------------------------------+
   ```
   
   Note that the CSV exec does *NOT* show `output_ordering` in the plan but the optimizer has used it (there is no sort in the actual plan)
   
   Here is an example of a parquet file showing `output_ordering=[tag0@0 ASC, time@1 ASC]` 
   
   ```
   2023-05-02T13:39:55.659173Z TRACE datafusion::physical_plan::planner: Optimized physical plan by parquet_sortness:
   SortExec: expr=[iox::measurement@0 ASC NULLS LAST,key@1 ASC NULLS LAST,value@2 ASC NULLS LAST]
     ProjectionExec: expr=[select_test as iox::measurement, tag0 as key, tag0@0 as value]
       AggregateExec: mode=FinalPartitioned, gby=[tag0@0 as tag0], aggr=[], ordering_mode=FullyOrdered
         AggregateExec: mode=Partial, gby=[tag0@0 as tag0], aggr=[], ordering_mode=FullyOrdered
           UnionExec
             ProjectionExec: expr=[tag0@0 as tag0]
               FilterExec: time@1 >= 631152000000000000
                 ParquetExec: limit=None, partitions={1 group: [[1/1/1/3a820ed1-c0a1-468d-b4de-edd49f2fef50.parquet]]}, predicate=time@12 >= 631152000000000000, pruning_predicate=time_max@0 >= 631152000000000000, output_ordering=[tag0@0 ASC, time@1 ASC], projection=[tag0, time]
   ```
   
   
   ### Describe the solution you'd like
   
   I would like all the listing tables (e.g.  `CsvExec`, `AvroExec`, `JsonExec`, etc to have `fmt_as` that include `output_ordering` when it has one
   
   Here is the relevant part in `CsvExec`:
   
   https://github.com/apache/arrow-datafusion/blob/cda00b545e1b4492269f76f65545c82264f79b88/datafusion/core/src/physical_plan/file_format/csv.rs#L166-L183
   
   
   
   ### Describe alternatives you've considered
   
   
   The simple solution would be to copy the code from `ParquetExec` in https://github.com/apache/arrow-datafusion/blob/cda00b545e1b4492269f76f65545c82264f79b88/datafusion/core/src/physical_plan/file_format/parquet.rs#L422-L435
   
   The (better) solution would be to make a generic way to format the `base_config` field that is used across all of the executors. 
   
   ```rust
       base_config: FileScanConfig,
   ```
   
   A generic solution would be better as it would be far more likely to remain in sync if additional fields are added
   
   ### Additional context
   
   I think this is a good first issue as it is a relatively straightforward coding exercise (and test output update exercise) that would help someone understand the codebase


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb closed issue #6194: Explain plan does not always show ordering

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb closed issue #6194: Explain plan does not always show ordering
URL: https://github.com/apache/arrow-datafusion/issues/6194


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org