You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/10/26 12:06:03 UTC

[GitHub] [arrow] rdettai opened a new pull request #8527: ARROW-10389: [Rust] [DataFusion] Make the custom source implementation API more explicit

rdettai opened a new pull request #8527:
URL: https://github.com/apache/arrow/pull/8527


   PR for https://issues.apache.org/jira/browse/ARROW-10389
   > As you can see in ARROW-10368, I had quite a hard time figuring out that the TableProvider trait could be used to provide custom source implementations for DataFusion.
   > I think that adding a read_table(provider: Box<dyn TableProvider + Send + Sync>) -> Result<Arc<dyn DataFrame>> method to the ExecutionContext and an example in the tests and/or in the examples folder would make the API more explicit.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] rdettai commented on pull request #8527: ARROW-10389: [Rust] [DataFusion] Make the custom source implementation API more explicit

Posted by GitBox <gi...@apache.org>.
rdettai commented on pull request #8527:
URL: https://github.com/apache/arrow/pull/8527#issuecomment-716507864


   @jorgecarleitao this is the follow up to https://github.com/apache/arrow/pull/8513. As discussed, there is a simple way already available to implement custom sources through the `TableProvider` trait. I added the example/test in the tests folder because the `execution/context.rs` is already huge.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #8527: ARROW-10389: [Rust] [DataFusion] Make the custom source implementation API more explicit

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #8527:
URL: https://github.com/apache/arrow/pull/8527#issuecomment-716508233


   https://issues.apache.org/jira/browse/ARROW-10389


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao closed pull request #8527: ARROW-10389: [Rust] [DataFusion] Make the custom source implementation API more explicit

Posted by GitBox <gi...@apache.org>.
jorgecarleitao closed pull request #8527:
URL: https://github.com/apache/arrow/pull/8527


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] andygrove commented on pull request #8527: ARROW-10389: [Rust] [DataFusion] Make the custom source implementation API more explicit

Posted by GitBox <gi...@apache.org>.
andygrove commented on pull request #8527:
URL: https://github.com/apache/arrow/pull/8527#issuecomment-718790491


   That makes sense. I do agree that it should be possible to buid the plan purely using the DataFrame API since it makes it much easier for other projects to re-use this and provide their own execution logic. Thanks for explaining and I will continue reviewing this today hopefully.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on pull request #8527: ARROW-10389: [Rust] [DataFusion] Make the custom source implementation API more explicit

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on pull request #8527:
URL: https://github.com/apache/arrow/pull/8527#issuecomment-716710390


   Thanks a lot @rdettai for taking the time to understand the problem and come up with a way to extend this. I am not very familiar with the TableProvider API and its design, so that I am ccing @andygrove to help out with this.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] rdettai commented on pull request #8527: ARROW-10389: [Rust] [DataFusion] Make the custom source implementation API more explicit

Posted by GitBox <gi...@apache.org>.
rdettai commented on pull request #8527:
URL: https://github.com/apache/arrow/pull/8527#issuecomment-718455057


   @andygrove In my understanding (mainly from the way Spark works), when you use the dataframe API, you don't need to register anything on the global context. The dataframe acts as a sort of local context where you have all the sources, operations and actions for your processings. Wouldn't that be the intended behavior for DataFusion ? This PR is just about adding a way to use a custom provider as dataframe source directly, without having to register it to the context to then get it back as a dataframe. 
   
   This also has the benefit (in my opinion) to make the `TableProvider` interface more explicit when looking at the API (from my personal user experience, it took me a while to notice that you could implement `TableProvider` to add new sources 😄 ). 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] andygrove commented on pull request #8527: ARROW-10389: [Rust] [DataFusion] Make the custom source implementation API more explicit

Posted by GitBox <gi...@apache.org>.
andygrove commented on pull request #8527:
URL: https://github.com/apache/arrow/pull/8527#issuecomment-718072011


   Thanks @rdettai I have started reviewing this now. TableProvider existed before we had the physical plan and at physical planning time gets translated into ParquetExec, CsvExec, or MemoryExec.
   
   The ExecutionContext has a register_table method that allows custom table providers to be registered and then referenced by name in the logical plan using the TableScan variant.
   
   It looks like this PR adds an alternate mechanism for acheiving the same goal? I'm not clear on the benefits of this approach over the existing approach but I haven't spent much time looking at this yet.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org