You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Li Jin (Jira)" <ji...@apache.org> on 2022/10/17 15:11:00 UTC

[jira] [Comment Edited] (ARROW-18063) [C++][Python] Custom streaming data providers in {{run_query}}

    [ https://issues.apache.org/jira/browse/ARROW-18063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17618950#comment-17618950 ] 

Li Jin edited comment on ARROW-18063 at 10/17/22 3:10 PM:
----------------------------------------------------------

>It might be slightly nicer to throw an error when setting the default named table provider if it has already been set. There are more complex alternatives such as a named table provider registry or a chain of named table providers but I'm not sure they are needed in this case.

I think either override or raise error is fine. In practice I don't see our application would need to invoke the initialization of custom registration more than once.

 

>Another alternative, which might be a more long term solution, is to create a new Substrait extension which defines a new {{read_type}} (e.g. {{{}ExtensionTable{}}}) which contains the needed information (e.g. URL).

>We would then need to make it possible to construct custom sources from {{ExtensionTable}} though which probably puts us in roughly the same boat :). We would need an {{ExtensionTableProvider}} and we would probably want the default to be configurable.

I have the same thinking as well. Long term we should allow user to register custom ExtensionTableProvider as well and ideally with the similar way of how to extend ExecFactoryRegistry and NamedTableProvider.


was (Author: icexelloss):
>It might be slightly nicer to throw an error when setting the default named table provider if it has already been set. There are more complex alternatives such as a named table provider registry or a chain of named table providers but I'm not sure they are needed in this case.

I think either override or raise error is fine. In practice I don't see our application would need to invoke the initialization of custom registration more than once.

 

>Another alternative, which might be a more long term solution, is to create a new Substrait extension which defines a new {{read_type}} (e.g. {{{}ExtensionTable{}}}) which contains the needed information (e.g. URL).

We would then need to make it possible to construct custom sources from {{ExtensionTable}} though which probably puts us in roughly the same boat :). We would need an {{ExtensionTableProvider}} and we would probably want the default to be configurable.

I have the same thinking as well. Long term we should allow user to register custom ExtensionTableProvider as well and ideally with the similar way of how to extend ExecFactoryRegistry and NamedTableProvider.

> [C++][Python] Custom streaming data providers in {{run_query}}
> --------------------------------------------------------------
>
>                 Key: ARROW-18063
>                 URL: https://issues.apache.org/jira/browse/ARROW-18063
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Ben Kietzman
>            Priority: Major
>
> [Mailing list thread|https://lists.apache.org/thread/r484sqrd6xjdd058prbrcwh3t5vg91so]
> The goal is to:
> - generate a substrait plan in Python using Ibis
> - ... wherein tables are specified using custom URLs
> - use the python API {{run_query}} to execute the plan
> - ... against source data which is *streamed* from those URLs rather than pulled fully into local memory
> The obstacles include:
> - The API for constructing a data stream from the custom URLs is only available in c++
> - The python {{run_query}} function requires tables as input and cannot accept a RecordBatchReader even if one could be constructed from a custom URL
> - Writing custom cython is not preferred
> Some potential solutions:
> - Use ExecuteSerializedPlan() directly usable from c++ so that construction of data sources need not be handled in python. Passing a buffer from python/ibis down to C++ is much simpler and can be navigated without writing cython
> - Refactor NamedTableProvider from a lambda mapping {{names -> data source}} into a registry so that data source factories can be added from c++ then referenced by name from python
> - Extend {{run_query}} to support non-Table sources and require the user to write a python mapping from URLs to {{pa.RecordBatchReader}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)