You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/01/30 21:27:15 UTC

[GitHub] [arrow-datafusion] cpcloud opened a new pull request #1712: generalize table provider df impl

cpcloud opened a new pull request #1712:
URL: https://github.com/apache/arrow-datafusion/pull/1712


   This is a draft PR following up a
   [discussion](https://github.com/apache/arrow-datafusion/pull/1699#discussion_r794994473)
   on #1699.
   
   Issues:
   
   1. The execution context state is using `Default`, it compiles and tests pass,
      but I don't know if that's the right thing to do there.
   1. It doesn't appear that it's possible to allow registration of `dyn
      DataFrame` without the `trait_upcasting` feature, which is unstable.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] realno commented on pull request #1712: generalize table provider df impl

Posted by GitBox <gi...@apache.org>.
realno commented on pull request #1712:
URL: https://github.com/apache/arrow-datafusion/pull/1712#issuecomment-1026343875


   > do you think it makes sense to just replace DataFrameImpl with Dataframe now and get rid of all the trait object overheads?
   
   I like the idea of simplifying interfaces - given the issues with this PR I think it seems to be a good idea to just consolidate `DataFrame`. Please take this as an opinion given my limited knowledge of the code base. Another thought is, will this have any impact on things like Python library? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] houqp commented on pull request #1712: generalize table provider df impl

Posted by GitBox <gi...@apache.org>.
houqp commented on pull request #1712:
URL: https://github.com/apache/arrow-datafusion/pull/1712#issuecomment-1025266914


   As for the first issue you mentioned, I think we need to avoid referencing both `DataFrameImpl` and `ExecutionContext` within the default TableProvider implementation because these two types are specific to the `DataFrameImpl` implementation of the trait.
   
   However, now that we don't have a separate Dataframe implementation in ballista anymore, I am not sure what we still gain from having the `DataFrame` trait defined. @andygrove @realno @yahoNanJing @alamb @Dandandan do you think it makes sense to just replace `DataFrameImpl` with `Dataframe` now and get rid of all the trait object overhead?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] houqp commented on pull request #1712: generalize table provider df impl

Posted by GitBox <gi...@apache.org>.
houqp commented on pull request #1712:
URL: https://github.com/apache/arrow-datafusion/pull/1712#issuecomment-1025262177


   @cpcloud good point on lack of access to the trait_upcasting feature :( I found an interesting workaround for this online based on https://articles.bchlr.de/traits-dynamic-dispatch-upcasting:
   
   ```rust
   diff --git a/datafusion/src/datasource/datasource.rs b/datafusion/src/datasource/datasource.rs
   index 1b59c857f..703d14493 100644
   --- a/datafusion/src/datasource/datasource.rs
   +++ b/datafusion/src/datasource/datasource.rs
   @@ -55,9 +55,15 @@ pub enum TableType {
        Temporary,
    }
    
   +pub trait AsDynTableProvider {
   +    fn as_dyn_table<'a>(self: Arc<Self>) -> Arc<dyn TableProvider + 'a>
   +    where
   +        Self: 'a;
   +}
   +
    /// Source table
    #[async_trait]
   -pub trait TableProvider: Sync + Send {
   +pub trait TableProvider: Sync + Send + AsDynTableProvider {
        /// Returns the table provider as [`Any`](std::any::Any) so that it can be
        /// downcast to a specific implementation.
        fn as_any(&self) -> &dyn Any;
   @@ -94,3 +100,12 @@ pub trait TableProvider: Sync + Send {
            Ok(TableProviderFilterPushDown::Unsupported)
        }
    }
   +
   +impl<T: TableProvider + Sized> AsDynTableProvider for T {
   +    fn as_dyn_table<'a>(self: Arc<Self>) -> Arc<dyn TableProvider + 'a>
   +    where
   +        Self: 'a,
   +    {
   +        self
   +    }
   +}
   diff --git a/datafusion/src/execution/dataframe_impl.rs b/datafusion/src/execution/dataframe_impl.rs
   index c1933adaa..eb59e3a93 100644
   --- a/datafusion/src/execution/dataframe_impl.rs
   +++ b/datafusion/src/execution/dataframe_impl.rs
   @@ -550,7 +550,7 @@ mod tests {
            let mut ctx = ExecutionContext::new();
    
            // register a dataframe as a table
   -        ctx.register_table("test_table", df)?;
   +        ctx.register_table("test_table", df.as_dyn_table())?;
            Ok(())
        }
   ```
   
   The schema method name conflict is unfortunate and I don't have a good solution for that other than renaming one of the methods :( Perhaps other contributors can jump in to help suggest workarounds.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] houqp commented on pull request #1712: generalize table provider df impl

Posted by GitBox <gi...@apache.org>.
houqp commented on pull request #1712:
URL: https://github.com/apache/arrow-datafusion/pull/1712#issuecomment-1070324228


   thank you for your research on this @cpcloud , I believe https://github.com/apache/arrow-datafusion/pull/1998 has this covered now :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] houqp commented on pull request #1712: generalize table provider df impl

Posted by GitBox <gi...@apache.org>.
houqp commented on pull request #1712:
URL: https://github.com/apache/arrow-datafusion/pull/1712#issuecomment-1026365987


   @alamb it's defined in https://github.com/apache/arrow-datafusion/blob/master/datafusion/src/dataframe.rs.
   
   @realno should have no impact to the python binding since traits are implementation details in the Rust land that won't get exposed to the python runtime.
   
   Let's leave this on for couple days to see if @andygrove @Jimexist @yahoNanJing has a strong opinion on this or not.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] houqp edited a comment on pull request #1712: generalize table provider df impl

Posted by GitBox <gi...@apache.org>.
houqp edited a comment on pull request #1712:
URL: https://github.com/apache/arrow-datafusion/pull/1712#issuecomment-1025266914


   As for the first issue you mentioned, I think we need to avoid referencing both `DataFrameImpl` and `ExecutionContext` within the default TableProvider implementation because these two types are specific to the `DataFrameImpl` implementation of the trait.
   
   However, now that we don't have a separate Dataframe implementation in ballista anymore, I am not sure what we gain from having the `DataFrame` trait defined. @andygrove @realno @yahoNanJing @alamb @Dandandan do you think it makes sense to just replace `DataFrameImpl` with `Dataframe` now and get rid of all the trait object overhead?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on pull request #1712: generalize table provider df impl

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #1712:
URL: https://github.com/apache/arrow-datafusion/pull/1712#issuecomment-1026710253


   > @alamb it's defined in https://github.com/apache/arrow-datafusion/blob/master/datafusion/src/dataframe.rs.
   
   Thanks @houqp  -- looking at the history of that file the `DataFrame` trait  appears to have been around a long time, lending credence to the idea that it might be due for some refactoring


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on pull request #1712: generalize table provider df impl

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #1712:
URL: https://github.com/apache/arrow-datafusion/pull/1712#issuecomment-1026237402


   > do you think it makes sense to just replace DataFrameImpl with Dataframe now and get rid of all the trait object overheads?
   
   It seems to make sense to me -- it seems to me if someone wanted a different dataframe implementation they would probably just implement it themselves rather than trying to conform to an interface defined by `DataFusion. 
   
   I can't seem to find where the `DataFrame` trait was defined now (the history gets murky for me when the code got moved around)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] houqp commented on pull request #1712: generalize table provider df impl

Posted by GitBox <gi...@apache.org>.
houqp commented on pull request #1712:
URL: https://github.com/apache/arrow-datafusion/pull/1712#issuecomment-1062583069


   Great, I filed https://github.com/apache/arrow-datafusion/issues/1962 to track the work.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb edited a comment on pull request #1712: generalize table provider df impl

Posted by GitBox <gi...@apache.org>.
alamb edited a comment on pull request #1712:
URL: https://github.com/apache/arrow-datafusion/pull/1712#issuecomment-1026237402


   > do you think it makes sense to just replace DataFrameImpl with Dataframe now and get rid of all the trait object overheads?
   
   It seems to make sense to me -- it seems to me if someone wanted a different dataframe implementation they would probably just implement it themselves rather than trying to conform to an interface defined by DataFusion.  However, I may not be a good arbiter of what is useful and what is not as I don't use the DataFrame impl much
   
   I can't seem to find where the `DataFrame` trait was defined now (the history gets murky for me when the code got moved around)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] matthewmturner commented on pull request #1712: generalize table provider df impl

Posted by GitBox <gi...@apache.org>.
matthewmturner commented on pull request #1712:
URL: https://github.com/apache/arrow-datafusion/pull/1712#issuecomment-1059430176


   While working on #1922 i also ran into some issues with the `DataFrame` trait that i wasnt expecting.  Since I hadnt really worked on that part of the code base before i was also surprised it was a trait and not a struct.  I would be supportive of moving `DataFrame` to a struct as well.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] andygrove commented on pull request #1712: generalize table provider df impl

Posted by GitBox <gi...@apache.org>.
andygrove commented on pull request #1712:
URL: https://github.com/apache/arrow-datafusion/pull/1712#issuecomment-1064187204


   I think my original intent was to provide different implementations for DataFusion vs Ballista but we have a better solution now with the DataFusion context allowing a pluggable query planner.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] houqp edited a comment on pull request #1712: generalize table provider df impl

Posted by GitBox <gi...@apache.org>.
houqp edited a comment on pull request #1712:
URL: https://github.com/apache/arrow-datafusion/pull/1712#issuecomment-1025266914


   As for the first issue you mentioned, I think we need to avoid referencing both `DataFrameImpl` and `ExecutionContext` within the default TableProvider implementation because these two types are specific to the `DataFrameImpl` implementation of the trait.
   
   However, now that we don't have a separate Dataframe implementation in ballista anymore, I am not sure what we gain from having the `DataFrame` trait defined. @andygrove @realno @yahoNanJing @alamb @Dandandan do you think it makes sense to just replace `DataFrameImpl` with `Dataframe` now and get rid of all the trait object overheads?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] houqp closed pull request #1712: generalize table provider df impl

Posted by GitBox <gi...@apache.org>.
houqp closed pull request #1712:
URL: https://github.com/apache/arrow-datafusion/pull/1712


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org