You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/04/10 04:04:28 UTC

[GitHub] [arrow] seddonm1 opened a new pull request #9976: ARROW-12290: [Rust][DataFusion] Add input_file_name function [WIP]

seddonm1 opened a new pull request #9976:
URL: https://github.com/apache/arrow/pull/9976

@alamb @jorgecarleitao @andygrove

This is a WIP start of work to re-implement the `input_file_name` function by not using the `metadata` of the `RecordBatch` and instead relying on plan traversal. Inspired by @jorgecarleitao's comment: https://github.com/apache/arrow/pull/9944#pullrequestreview-632013179 I have tried to initially extend the `TableProvider` to capture this information by means of the `Statistics`.

This code collects statistics at the **partition** level not the **table** level and provides methods for calculating the table level statistics. It can easily be extended to do things like `TableProvider::file_name(partition: usize)` to get a specific `partition` filename. Calculating things like `distinct_count` per column will be a fun future problem and may require something like `HyperLogLog`.

This code also assumes that a 1:1 mapping of Parquet file to logical partition - but it still has a lot of the initial implementation for future many:one mapping of files to partition. I think it would be fair to fix a 1:1 mapping of file:partition and then down the line a `repartition` operator used to merge them. If agreed I can make those changes.

I don't like to make these changes in a vacuum and the mailing list is not ideal to demonstrate ideas so I hoped to review here.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] alamb commented on pull request #9976: ARROW-12290: [Rust][DataFusion] Add input_file_name function [WIP]

Posted by GitBox <gi...@apache.org>.

alamb commented on pull request #9976:
URL: https://github.com/apache/arrow/pull/9976#issuecomment-831243862


   https://github.com/apache/arrow/pull/10096 has removed the arrow implementation from this repository (it now resides in https://github.com/apache/arrow-rs and https://github.com/apache/arrow-datafusion) in the hopes of streamlining the development process
   
   Please re-target this PR (let us know if you need help doing so) to one/both of the new repositories. 
   
   Thank you for understanding and helping to make arrow-rs and datafusion better


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] alamb commented on pull request #9976: ARROW-12290: [Rust][DataFusion] Add input_file_name function [WIP]

Posted by GitBox <gi...@apache.org>.

alamb commented on pull request #9976:
URL: https://github.com/apache/arrow/pull/9976#issuecomment-817111476


   @seddonm1  can you remind me again why adding a non-persisted field to each Rust `RecordBatch` (for example `RecordBatch::runtime_exension`) was rejected. I can't help thinking that implementing an accurate `input_filename` with all the data reorganization that can happen during plan time is going to require runtime (not just plan time) support.
   
   Sorry for muddying the waters,


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorgecarleitao commented on pull request #9976: ARROW-12290: [Rust][DataFusion] Add input_file_name function [WIP]

Posted by GitBox <gi...@apache.org>.

jorgecarleitao commented on pull request #9976:
URL: https://github.com/apache/arrow/pull/9976#issuecomment-817243506


   Isn't the filename that a column came from uniquely identified by the logical plan? If two physical plans arrive to different conclusions about a columns' provenance, then those physical plans are using two data origins, which implies different semantics.
   
   This is rationale I was using to recommend addressing this at the DAG level. I do agree that the logical plan may not have all the information about how the source is partitioned and its exact names (as that may even change with time), but I would expect to resolve that as part of the query execution (just like we resolve the physical plan when we run `SHOW PLAN`).
   
   @seddonm1 , I think that a physical expression does not need to be `ScalarFunctionExpr`: the `ScalarFunctionExpr` is useful for the cases where the physical operation can be described by a simple function and signature. Check e.g. how e.g. the binary operators are defined: they have their own custom `struct` that implements `PhysicalExpr`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] alamb closed pull request #9976: ARROW-12290: [Rust][DataFusion] Add input_file_name function [WIP]

Posted by GitBox <gi...@apache.org>.

alamb closed pull request #9976:
URL: https://github.com/apache/arrow/pull/9976


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] alamb commented on pull request #9976: ARROW-12290: [Rust][DataFusion] Add input_file_name function [WIP]

Posted by GitBox <gi...@apache.org>.

alamb commented on pull request #9976:
URL: https://github.com/apache/arrow/pull/9976#issuecomment-822358694


   The Apache Arrow Rust community is moving the Rust implementation into its own dedicated github repositories [arrow-rs](https://github.com/apache/arrow-rs) and [arrow-datafusion](https://github.com/apache/arrow-datafusion). It is likely we will not merge this PR into this repository
   
   Please see the [mailing-list](https://lists.apache.org/thread.html/rce7c61cb3f703367dc00984530016e3fcb828e5a8035655fbcccf862%40%3Cdev.arrow.apache.org%3E) thread for more details
   
   We expect the process to take a few days and will follow up with a migration plan for the in-flight PRs.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] alamb commented on a change in pull request #9976: ARROW-12290: [Rust][DataFusion] Add input_file_name function [WIP]

Posted by GitBox <gi...@apache.org>.

alamb commented on a change in pull request #9976:
URL: https://github.com/apache/arrow/pull/9976#discussion_r611024130



##########
File path: rust/datafusion/src/physical_plan/parquet.rs
##########
@@ -271,62 +249,18 @@ impl ParquetExec {
                 .collect(),
         );
 
-        // sum the statistics
-        let mut num_rows: Option<usize> = None;

Review comment:
       I really like how you have consolidated the statistics accumulation / cross partition summarization into a central place.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] seddonm1 commented on pull request #9976: ARROW-12290: [Rust][DataFusion] Add input_file_name function [WIP]

Posted by GitBox <gi...@apache.org>.

seddonm1 commented on pull request #9976:
URL: https://github.com/apache/arrow/pull/9976#issuecomment-817237604


   @jorgecarleitao 
   So my challenge is this function signature:
   
   ```rust
   /// Scalar function
   pub type ScalarFunctionImplementation = Arc<dyn Fn(&[ColumnarValue]) -> Result<ColumnarValue> + Send + Sync>;
   ```
   
   which is required for the physical expression:
   
   ```rust
   /// Physical expression of a scalar function
   pub struct ScalarFunctionExpr {
       fun: ScalarFunctionImplementation,
       name: String,
       args: Vec<Arc<dyn PhysicalExpr>>,
       return_type: DataType,
   }
   ```
   
   I can either change this type alias to accept additional parameters (in this case I could also pass the `LogicalPlan`) which would require changes to all `BuiltinScalarFunction`s or I have to create a second implementation of `ScalarFunctionExpr` like `ScalarFunctionPlanExpr` which passes in the plan with the modified type signature. Both are do-able but have drawbacks.
   
   Thoughts?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] github-actions[bot] commented on pull request #9976: ARROW-12290: [Rust][DataFusion] Add input_file_name function [WIP]

Posted by GitBox <gi...@apache.org>.

github-actions[bot] commented on pull request #9976:
URL: https://github.com/apache/arrow/pull/9976#issuecomment-817073814


   https://issues.apache.org/jira/browse/ARROW-12290


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] seddonm1 commented on pull request #9976: ARROW-12290: [Rust][DataFusion] Add input_file_name function [WIP]

Posted by GitBox <gi...@apache.org>.

seddonm1 commented on pull request #9976:
URL: https://github.com/apache/arrow/pull/9976#issuecomment-817216973


   @alamb Thanks for the review and I wanted to raise this early as I appreciate the feedback (even if we end up closing this PR).
   
   The `RecordBatch` metadata is definitely the easiest mechanism for attaching this data but I do agree with Jorge that the modifications to the Arrow crate are undesirable.
   
   As this is ultimately a lineage type capability (by traversing the plan) I have checked how Spark implements it and it will throw an error: `'input_file_name' does not support more than one sources` for queries not against the simple use case (i.e. direct select against a `TableProvider`) - for example if trying to execute it in a SQL query with multiple tables.
   
   I will have a go trying to make basic functionality work to ensure I'm not off on a wild goose chase.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org