You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Weston Pace (Jira)" <ji...@apache.org> on 2022/10/14 17:58:00 UTC
[jira] [Commented] (ARROW-18063) [C++][Python] Custom streaming data providers in {{run_query}}

    [ https://issues.apache.org/jira/browse/ARROW-18063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17617896#comment-17617896 ] 

Weston Pace commented on ARROW-18063:
-------------------------------------

{quote}
Refactor NamedTableProvider from a lambda mapping names -> data source into a registry so that data source factories can be added from c++ then referenced by name from python
{quote}

I'm not sure this is exactly what has been proposed.  Instead I think the idea is that default named table provider is either a property of the ExecFactoryRegistry or part of some larger "AceroContext".  A user can then configure which named table provider to use by grabbing the default context and setting the named table provider at the same time they grab the default context and add exec factories.

There is then no python reference or bindings needed at all.

I think this is a reasonable solution (I prefer AceroContext over MetaRegistry which was mentioned in the ML threads).

One does then have to consider what happens if two processes or calls are made to configure the default named table provider.  I think the simplest option would be to just overwrite it.  It might be slightly nicer to throw an error when setting the default named table provider if it has already been set.  There are more complex alternatives such as a named table provider registry or a chain of named table providers but I'm not sure they are needed in this case.

CC [~icexelloss] to confirm.

> [C++][Python] Custom streaming data providers in {{run_query}}
> --------------------------------------------------------------
>
>                 Key: ARROW-18063
>                 URL: https://issues.apache.org/jira/browse/ARROW-18063
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Ben Kietzman
>            Priority: Major
>
> [Mailing list thread|https://lists.apache.org/thread/r484sqrd6xjdd058prbrcwh3t5vg91so]
> The goal is to:
> - generate a substrait plan in Python using Ibis
> - ... wherein tables are specified using custom URLs
> - use the python API {{run_query}} to execute the plan
> - ... against source data which is *streamed* from those URLs rather than pulled fully into local memory
> The obstacles include:
> - The API for constructing a data stream from the custom URLs is only available in c++
> - The python {{run_query}} function requires tables as input and cannot accept a RecordBatchReader even if one could be constructed from a custom URL
> - Writing custom cython is not preferred
> Some potential solutions:
> - Use ExecuteSerializedPlan() directly usable from c++ so that construction of data sources need not be handled in python. Passing a buffer from python/ibis down to C++ is much simpler and can be navigated without writing cython
> - Refactor NamedTableProvider from a lambda mapping {{names -> data source}} into a registry so that data source factories can be added from c++ then referenced by name from python
> - Extend {{run_query}} to support non-Table sources and require the user to write a python mapping from URLs to {{pa.RecordBatchReader}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)