You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/01/25 20:09:27 UTC

[GitHub] [arrow-datafusion] thinkharderdev opened a new issue #1675: Improvements to Ballista extensibility

thinkharderdev opened a new issue #1675:
URL: https://github.com/apache/arrow-datafusion/issues/1675


   **Is your feature request related to a problem or challenge? Please describe what you are trying to do.**
   A clear and concise description of what the problem is. Ex. I'm always frustrated when [...] 
   (This section helps Arrow developers understand the context and *why* for this feature, in addition to  the *what*)
   
   Currently, we are working with DataFusion/Ballista as a query execution engine. One of the primary selling points for DataFusion is extensibility but it is not currently possible to use the many extension points in DataFusion with Ballista. 
   
   This is primarily due to the constraints of serializing all logical and physical plans as Protobuf messages. 
   
   Ideally we would like to use Ballista to execute:
   * Scans using custom object stores
   * User Defined logical plan extensions
   * User defined physical plan extensions
   * User defined scalar and aggregation functions
   
   
   **Describe the solution you'd like**
   A clear and concise description of what you want to happen.
   
   There are two things ideally:
   1. We would like to decouple the core Ballista functionality from the serializable representations of plans so that the serde layer can become pluggable/extensible. 
   2. Serde should be aware of a user-defined `ExecutionContext` so we can leverage optimizers, extension planners, and udf/udaf
   
   **Describe alternatives you've considered**
   A clear and concise description of any alternative solutions or features you've considered.
   
   There currently is no workaround for this but we have been prototyping possible solutions which we'd be interested in upstreaming. 
   
   **Additional context**
   Add any other context or screenshots about the feature request here.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] andygrove commented on issue #1675: Improvements to Ballista extensibility

Posted by GitBox <gi...@apache.org>.
andygrove commented on issue #1675:
URL: https://github.com/apache/arrow-datafusion/issues/1675#issuecomment-1021646194


   It may be useful to see how substrait is handling extensions as well - https://substrait.io/extensions/


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on issue #1675: Improvements to Ballista extensibility

Posted by GitBox <gi...@apache.org>.
alamb commented on issue #1675:
URL: https://github.com/apache/arrow-datafusion/issues/1675#issuecomment-1022483235


   I had a comment on https://github.com/apache/arrow-datafusion/pull/1677#pullrequestreview-863950055 which I think is worth considering


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on issue #1675: Improvements to Ballista extensibility

Posted by GitBox <gi...@apache.org>.
alamb commented on issue #1675:
URL: https://github.com/apache/arrow-datafusion/issues/1675#issuecomment-1021594908


   cc @realno  @gaojun2048 @yahoNanJing
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] thinkharderdev commented on issue #1675: Improvements to Ballista extensibility

Posted by GitBox <gi...@apache.org>.
thinkharderdev commented on issue #1675:
URL: https://github.com/apache/arrow-datafusion/issues/1675#issuecomment-1022172161


   Agree on the substrait integration. It would definitely be nice to have a universal serializable representation and a way to configure extensions delcaritively. 
   
   I posted a draft PR #1677 which I think can solve the immediate term issues with extensibility and also will be useful in migrating to a substrait-based implementation. By decoupling the representation from the core execution engine we can avoid a "Big Bang" migration (not to mention and endless parade of painful rebases while in development :))


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] thinkharderdev commented on issue #1675: Improvements to Ballista extensibility

Posted by GitBox <gi...@apache.org>.
thinkharderdev commented on issue #1675:
URL: https://github.com/apache/arrow-datafusion/issues/1675#issuecomment-1022606101


   Related questions after tinkering a bit more today:
   
   Should `SchemaProvider` methods be async? It could be useful to support integration with external metadata catalogs (AWS Glue, etc). This could also simplify the serialization of `LogicalPlan::TableScan`s by just passing a `TableReference` and resolving the `TableProvider` using the context.  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] houqp commented on issue #1675: Improvements to Ballista extensibility

Posted by GitBox <gi...@apache.org>.
houqp commented on issue #1675:
URL: https://github.com/apache/arrow-datafusion/issues/1675#issuecomment-1021874217


   FWIW, Andy wrote a substrait rust implementation: https://github.com/andygrove/substrait-rs.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] realno edited a comment on issue #1675: Improvements to Ballista extensibility

Posted by GitBox <gi...@apache.org>.
realno edited a comment on issue #1675:
URL: https://github.com/apache/arrow-datafusion/issues/1675#issuecomment-1021830009


   > Maybe it's better to introduce the substrait integration into the roadmap.
   
   +1
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on issue #1675: Improvements to Ballista extensibility

Posted by GitBox <gi...@apache.org>.
alamb commented on issue #1675:
URL: https://github.com/apache/arrow-datafusion/issues/1675#issuecomment-1023657588


   > Should SchemaProvider methods be async? It could be useful to support integration with external metadata catalogs (AWS Glue, etc). This could also simplify the serialization of LogicalPlan::TableScans by just passing a TableReference and resolving the TableProvider using the context.
   
   I can see a rationale for making schema provider methods `async` if the usecase is a one time query across a pile of parquet files.
   
   However, in general if one has to do network IO to figure out what tables exist or their schemas, it may be hard to get adequate performance


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] yahoNanJing edited a comment on issue #1675: Improvements to Ballista extensibility

Posted by GitBox <gi...@apache.org>.
yahoNanJing edited a comment on issue #1675:
URL: https://github.com/apache/arrow-datafusion/issues/1675#issuecomment-1021799128


   Thanks @thinkharderdev for proposing these potentials.
   
   > Scans using custom object stores
   
   For this, actually our team has implemented for the HDFS. To avoid new object store registration, our workaround is to make the path self description with its scheme, like hdfs:://localhost:15050/..../file.parquet. Then with the scheme, we will know which kind of remote object store we needs. 
   
   > User Defined logical plan extensions, physical plan extensions, scalar and aggregation functions
   
   Maybe it's better to introduce the substrait integration into the roadmap.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] realno commented on issue #1675: Improvements to Ballista extensibility

Posted by GitBox <gi...@apache.org>.
realno commented on issue #1675:
URL: https://github.com/apache/arrow-datafusion/issues/1675#issuecomment-1021830009


   > Maybe it's better to introduce the substrait integration into the roadmap.
   +1
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] yahoNanJing commented on issue #1675: Improvements to Ballista extensibility

Posted by GitBox <gi...@apache.org>.
yahoNanJing commented on issue #1675:
URL: https://github.com/apache/arrow-datafusion/issues/1675#issuecomment-1021799128


   Thanks @thinkharderdev for proposing these potentials.
   > Scans using custom object stores
   For this, actually our team has implemented for the HDFS. To avoid new object store registration, our workaround is to make the path self description with its scheme, like hdfs:://localhost:15050/..../file.parquet. Then with the scheme, we will know which kind of remote object store we needs. 
   
   > User Defined logical plan extensions, physical plan extensions, scalar and aggregation functions
   Maybe it's better to introduce the substrait integration into the roadmap.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] realno commented on issue #1675: Improvements to Ballista extensibility

Posted by GitBox <gi...@apache.org>.
realno commented on issue #1675:
URL: https://github.com/apache/arrow-datafusion/issues/1675#issuecomment-1021690436


   This would be a great  improvement 👍 I will follow the design and PRs


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org