You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/04/20 16:53:20 UTC

[GitHub] [arrow-datafusion] tustvold opened a new issue, #2293: Single File Per ParquetExec, AvroExec, etc...

tustvold opened a new issue, #2293:
URL: https://github.com/apache/arrow-datafusion/issues/2293

**Is your feature request related to a problem or challenge? Please describe what you are trying to do.**

Part of #2079

Following on from #2292 and #2291 it should be possible to pull the multi-file handling out of each individual file operator, and delegate it to the physical plan. As described in #2079 this will greatly simplify the implementations, whilst also hiding fewer details from the physical plan.

**Describe the solution you'd like**

Currently a FileScanConfig would result `ListingTable::scan` generating a physical plan that looks something like

```
ParquetExec
```

I propose instead generating something like

```
UnionExec
ProjectionExec: ... // Partition 1
SchemaAdapterExec
ParquetExec: ... // Partition 1 File 1
SchemaAdapterExec
ParquetExec: ... // Partition 1 File 2
ProjectionExec: ... // Partition 2
SchemaAdapterExec
ParquetExec: ... // Partition 2 File 1
SchemaAdapterExec
ParquetExec: ... // Partition 2 File 2
SchemaAdapterExec
ParquetExec: ... // Partition 2 File 3
```

Whilst this is more complex, it results in less complexity in the file format operators, and should hopefully lead to less bugs due to things like #2170 or #2000

**Describe alternatives you've considered**

We could not do this

FYI @thinkharderdev @matthewmturner

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] tustvold commented on issue #2293: Single File Per ParquetExec, AvroExec, etc...

Posted by GitBox <gi...@apache.org>.

tustvold commented on issue #2293:
URL: https://github.com/apache/arrow-datafusion/issues/2293#issuecomment-1104960286

   > Do you mean PartitionedFile for File, but removing the partition_values field?
   
   Yes, although removing the partition_values is likely follow up work
   
   > in ParquetExec's try_new method or somewhere related place in the physical plan?
   
   I would rather keep the translation logic out of the file format specific operators, but having a free function that can be called by `ListingTable` and potentially other things, such as your Spark translation layer, seems perfectly sensible to me. I just care about reducing the amount of smarts in the individual file format specific operators :smile: 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] yjshen commented on issue #2293: Single File Per ParquetExec, AvroExec, etc...

Posted by GitBox <gi...@apache.org>.

yjshen commented on issue #2293:
URL: https://github.com/apache/arrow-datafusion/issues/2293#issuecomment-1105001136

   >  having a free function that can be called by ListingTable and potentially other things, such as your Spark translation layer, seems perfectly sensible to me
   
   Sounds great!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] yjshen commented on issue #2293: Single File Per ParquetExec, AvroExec, etc...

Posted by GitBox <gi...@apache.org>.

yjshen commented on issue #2293:
URL: https://github.com/apache/arrow-datafusion/issues/2293#issuecomment-1104652888

   Do you mean `PartitionedFile` for File?
   ```rust
   pub struct PartitionedFile {
       /// Path for the file (e.g. URL, filesystem path, etc)
       pub file_meta: FileMeta,
       /// Values of partition columns to be appended to each row
       pub partition_values: Vec<ScalarValue>,
       /// An optional file range for a more fine-grained parallel execution
       pub range: Option<FileRange>,
   }
   ```
   
   I'm okay with the change, but regarding we directly translating Spark physical plan into DataFusion physical plan, is this possible we do this `Original ParquetExec -> UnionExec(SchemaAdapterExec(New ParquetExec))` in `ParquetExec`'s try_new method or somewhere related place in the physical plan?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] thinkharderdev commented on issue #2293: Single File Per ParquetExec, AvroExec, etc...

Posted by GitBox <gi...@apache.org>.

thinkharderdev commented on issue #2293:
URL: https://github.com/apache/arrow-datafusion/issues/2293#issuecomment-1108850307

   This sounds like a great idea. The serial file processing in `ParquetExec` is currently a pretty nasty bottleneck for queries over large numbers of parquet files.  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org