You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "alamb (via GitHub)" <gi...@apache.org> on 2023/05/02 21:18:40 UTC
[GitHub] [arrow-datafusion] alamb opened a new issue, #6194: Explain plan does not always show ordering
alamb opened a new issue, #6194:
URL: https://github.com/apache/arrow-datafusion/issues/6194
### Is your feature request related to a problem or challenge?
When debugging something downstream, I was quite confused by the following:
Make input:
```shell
echo "x,y" > /tmp/test.csv
echo "a,1" >> /tmp/test.csv
echo "a,2" >> /tmp/test.csv
echo "b,3" >> /tmp/test.csv
```
Run in `datafusion-cli`:
Create table:
```sql
DROP TABLE IF EXISTS test ;
CREATE EXTERNAL TABLE test(x varchar, y bigint)
STORED AS CSV
WITH HEADER ROW
WITH ORDER (x ASC)
LOCATION '/tmp/test.csv'
;
```
Then run a query:
```
❯ explain select * from test order by x ASC;
+---------------+----------------------------------------------------------------------------------------------------+
| plan_type | plan |
+---------------+----------------------------------------------------------------------------------------------------+
| logical_plan | Sort: test.x ASC NULLS LAST |
| | TableScan: test projection=[x, y] |
| physical_plan | CsvExec: files={1 group: [[private/tmp/test.csv]]}, has_header=true, limit=None, projection=[x, y] |
| | |
+---------------+----------------------------------------------------------------------------------------------------+
```
Note that the CSV exec does *NOT* show `output_ordering` in the plan but the optimizer has used it (there is no sort in the actual plan)
Here is an example of a parquet file showing `output_ordering=[tag0@0 ASC, time@1 ASC]`
```
2023-05-02T13:39:55.659173Z TRACE datafusion::physical_plan::planner: Optimized physical plan by parquet_sortness:
SortExec: expr=[iox::measurement@0 ASC NULLS LAST,key@1 ASC NULLS LAST,value@2 ASC NULLS LAST]
ProjectionExec: expr=[select_test as iox::measurement, tag0 as key, tag0@0 as value]
AggregateExec: mode=FinalPartitioned, gby=[tag0@0 as tag0], aggr=[], ordering_mode=FullyOrdered
AggregateExec: mode=Partial, gby=[tag0@0 as tag0], aggr=[], ordering_mode=FullyOrdered
UnionExec
ProjectionExec: expr=[tag0@0 as tag0]
FilterExec: time@1 >= 631152000000000000
ParquetExec: limit=None, partitions={1 group: [[1/1/1/3a820ed1-c0a1-468d-b4de-edd49f2fef50.parquet]]}, predicate=time@12 >= 631152000000000000, pruning_predicate=time_max@0 >= 631152000000000000, output_ordering=[tag0@0 ASC, time@1 ASC], projection=[tag0, time]
```
### Describe the solution you'd like
I would like all the listing tables (e.g. `CsvExec`, `AvroExec`, `JsonExec`, etc to have `fmt_as` that include `output_ordering` when it has one
Here is the relevant part in `CsvExec`:
https://github.com/apache/arrow-datafusion/blob/cda00b545e1b4492269f76f65545c82264f79b88/datafusion/core/src/physical_plan/file_format/csv.rs#L166-L183
### Describe alternatives you've considered
The simple solution would be to copy the code from `ParquetExec` in https://github.com/apache/arrow-datafusion/blob/cda00b545e1b4492269f76f65545c82264f79b88/datafusion/core/src/physical_plan/file_format/parquet.rs#L422-L435
The (better) solution would be to make a generic way to format the `base_config` field that is used across all of the executors.
```rust
base_config: FileScanConfig,
```
A generic solution would be better as it would be far more likely to remain in sync if additional fields are added
### Additional context
I think this is a good first issue as it is a relatively straightforward coding exercise (and test output update exercise) that would help someone understand the codebase
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] alamb closed issue #6194: Explain plan does not always show ordering
Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb closed issue #6194: Explain plan does not always show ordering
URL: https://github.com/apache/arrow-datafusion/issues/6194
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org