You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/04/08 02:30:13 UTC

[GitHub] [arrow] seddonm1 opened a new pull request #9944: ARROW-12290: [Rust][DataFusion] Add input_file_name function

seddonm1 opened a new pull request #9944:
URL: https://github.com/apache/arrow/pull/9944


   For lineage and diffing purposes (used by protocols like DeltaLake) it can be useful to know the source of input data for a Dataframe. This adds the `input_file_name` function which, like Spark, returns the name of the file being read, or NULL if not available.
   
   Unfortunately the Arrow RecordBatch does not have the ability to serialise this information correctly so this is runtime only. See: https://lists.apache.org/thread.html/rd1ab179db7e899635351df7d5de2286915cc439fd1f48e0057a373db%40%3Cdev.arrow.apache.org%3E


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] seddonm1 commented on pull request #9944: ARROW-12290: [Rust][DataFusion] Add input_file_name function

Posted by GitBox <gi...@apache.org>.
seddonm1 commented on pull request #9944:
URL: https://github.com/apache/arrow/pull/9944#issuecomment-816452198


   Thanks both of you. I will have a look at the visitor pattern that @jorgecarleitao suggested as I agree this is quite dirty. Let met see what is possible and have another go.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on pull request #9944: ARROW-12290: [Rust][DataFusion] Add input_file_name function

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #9944:
URL: https://github.com/apache/arrow/pull/9944#issuecomment-816264432


   (BTW there is a small lint error on the PR)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on a change in pull request #9944: ARROW-12290: [Rust][DataFusion] Add input_file_name function

Posted by GitBox <gi...@apache.org>.
alamb commented on a change in pull request #9944:
URL: https://github.com/apache/arrow/pull/9944#discussion_r610150874



##########
File path: rust/datafusion/src/physical_plan/crypto_expressions.rs
##########
@@ -144,7 +144,7 @@ fn md5_array<T: StringOffsetSizeTrait>(
 }
 
 /// crypto function that accepts Utf8 or LargeUtf8 and returns a [`ColumnarValue`]
-pub fn md5(args: &[ColumnarValue]) -> Result<ColumnarValue> {
+pub fn md5(args: &[ColumnarValue], _: &Schema) -> Result<ColumnarValue> {

Review comment:
       Yeah, it is unfortunate that we had to plumb the `schema` argument all the way through.
   
   Another pattern I have seen (though it has its own downsides) is to use some sort of `thread_local` storage to pass stuff like this (from the `RecordBatch`s schema into the expression evaluation code). 
   
   So like before evaluating a physical_expr we would set some sort of thread_local that had a pointer back to the `Schema` that `input_file_name` could consult. This would avoid a bunch of plumbing through a mostly unused argument all over the place.
   
   What do you think of this approach @jorgecarleitao and @Dandandan  ? 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #9944: ARROW-12290: [Rust][DataFusion] Add input_file_name function

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #9944:
URL: https://github.com/apache/arrow/pull/9944#issuecomment-815402880


   https://issues.apache.org/jira/browse/ARROW-12290


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] seddonm1 closed pull request #9944: ARROW-12290: [Rust][DataFusion] Add input_file_name function

Posted by GitBox <gi...@apache.org>.
seddonm1 closed pull request #9944:
URL: https://github.com/apache/arrow/pull/9944


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org