You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Yaron Gvili <rt...@hotmail.com> on 2022/06/30 20:05:48 UTC

accessing Substrait protobuf Python classes from PyArrow

Hi,

Is there support for accessing Substrait protobuf Python classes (such as Plan) from PyArrow? If not, how should such support be added? For example, should the PyArrow build system pull in the Substrait repo as an external project and build its protobuf Python classes, in a manner similar to how Arrow C++ does it?

I'm pondering these questions after running into an issue with code I'm writing under PyArrow that parses a Substrait plan represented as a dictionary. The current (and kind of shaky) parsing operation in this code uses json.dumps() on the dictionary, which results in a string that is passed to a Cython API that handles it using Arrow C++ code that has access to Substrait protobuf C++ classes. But when the Substrait plan contains a bytes-type, json.dump() no longer works and fails with "TypeError: Object of type bytes is not JSON serializable". A fix for this, and a better way to parse, is using google.protobuf.json_format.ParseDict() [1] on the dictionary. However, this invocation requires a second argument, namely a protobuf message instance to merge with. The class of this message (such as Plan) is a Substrait protobuf Python class, hence the need to access such classes from PyArrow.

[1] https://googleapis.dev/python/protobuf/latest/google/protobuf/json_format.html


Yaron.

Re: accessing Substrait protobuf Python classes from PyArrow

Posted by Yaron Gvili <rt...@hotmail.com>.
> However, from my understanding, the implementation effort is quite manageable. Please correct me if I'm wrong,..

Yes, this was discussed earlier in this thread. The protobuf compiler would need to be ran by the PyArrow build; currently, only the Arrow C++ build runs it. There would be a bit of work around reusing the build macro for external projects that pulls, verifys, and builds the external project's code. As for rewriting, a custom script is one way to do it and there is an existing example of it in `substrait-io/substrait-validator`, though using a Python refactoring tool such as Rope is a cleaner way to do it; either way, there is also some work around avoiding the public exposure of Google protobuf classes.

> The other question, however, is whether we want to expose this as a public API, as a facility for users...

Only internal access to Substrait protobuf Python classes is needed, so they would not be publicly exposed, in the sense that they would be refactored to an internal namespace. The same applies to the Google protobuf classes to be used internally.


Yaron.
________________________________
From: Antoine Pitrou <an...@python.org>
Sent: Thursday, July 7, 2022 5:15 AM
To: dev@arrow.apache.org <de...@arrow.apache.org>
Subject: Re: accessing Substrait protobuf Python classes from PyArrow


Hi Yaron,

Le 07/07/2022 à 10:48, Yaron Gvili a écrit :
> It looks like the main decision to make is whether accessing Substrait protobuf Python classes from PyArrow is what we want here, because it would require some effort to implement as discussed.

I'm not the best person to comment on this, since I have little
acquaintance with Substrait and with the protobuf libraries/compiler for
Python.

However, from my understanding, the implementation effort is quite
manageable. Please correct me if I'm wrong, but it seems two steps would
be involved:
   1) run the protobuf compiler to produce Python modules and classes
representing the Substrait protobuf entities
   2) run a custom script to rewrite the generated import statements so
that they look into a custom subpackage of arrow (e.g.
`pyarrow.substrait.protobuf`) instead of the top-level package that the
protobuf compiler inflexibly generates

If it is all that's required as far as implementation is concerned, I
personally don't see a problem to doing it *on a technical basis*. Some
amount of custom code generation and/or code rewriting is unavoidable in
a project with the size and complexity of Arrow (we already have custom
code generation in other areas, in any case).


The other question, however, is whether we want to expose this as a
public API, as a facility for users. What do we guarantee there? If for
example we switch to a newer protobuf compiler version, might it
generate slightly incompatible code? Is a particular version of protobuf
required at runtime (I assume the generated code depends on the protobuf
libs, as it does in C++)? Those are the real questions that need to be
addressed, IMHO.

Regards

Antoine.



>
>> These are two reasons to need the protobuf in python.  However, I still don't fully understand why this is needed for pyarrow.
>
> For #1, the invocations of google.protobuf.json_format.ParseDict() are going to be in PyArrow unit tests of UDF functionality. As described earlier, I was able to avoid this invocation until I got a "TypeError: Object of type bytes is not JSON serializable" for a bytes-type field holding the UDF pickle.
>
> For #2, this is related to the more general preference for PyArrow when dealing with UDFs. Basically, it is easier for the Substrait-producer to pickle a Python-implemented UDF into a Substrait plan via PyArrow APIs and, similarly, for the Substrait-consumer to unpickle the UDF from the Substrait plan via PyArrow APIs.  Doing it via Arrow C++ APIs, using callbacks or by launching an embedded Python interpreter, is undesirable IMO. More specifically, the use case of #2 is for UDTs (user-defined tables), which are objects somewhat more complex than UDFs and are yet easier to deal with using Python code.
>
>> In the first approach it would make sense that you would need Substrait access from python.
>
> The first approach is basically the design I'm advocating, except that the registrations allow the replacing-step to be avoided, so the Substrait plan need not be modified. Indeed, this approach leads to a need for Substrait access from Python (and PyArrow, as explained above).
>
>> In the second approach it wouldn't be needed.
>
> Agreed. Though this approach has its disadvantages. For example, each new kind of user-defined object (like UDT) would require at least one new callback in Arrow C++ code. It would also be harder for the user to manually register UDFs outside the context of a Substrait plan, and I think also the Arrow Substrait APIs would need to be changed to accept a FunctionRegistry in order to support a nested registry; it's easier, and probably more flexible, to let PyArrow drive instead of getting called back.
>
>> Is this something I might be able to infer from [1] (I haven't yet had a chance to look at that in detail)?
>
> [1] is focused on UDFs, so it could give you an idea but won't show you the details of UDT integration.
>
>
> Yaron.
> ________________________________
> From: Weston Pace <we...@gmail.com>
> Sent: Wednesday, July 6, 2022 1:30 PM
> To: dev@arrow.apache.org <de...@arrow.apache.org>
> Subject: Re: accessing Substrait protobuf Python classes from PyArrow
>
>> 1.  Parsing a dictionary using google.protobuf.json_format.ParseDict(). This method requires an instance of a Substrait protobuf Python class, such as Plan, to merge with; hence, the need to access such classes, but no need to modify the Substrait plan. This case was described in the first message in this thread.
>> 2.  Accessing a Python object, which is more convenient to manipulate from Python, after unpickling in from a field in the Substrait plan. It's just read-only access to the field from Python, but still needs access to the Substrait protobuf Python classes. This case was mentioned in my previous message in this thread in response to Li's specific question about UDFs.
>
> These are two reasons to need the protobuf in python.  However, I
> still don't fully understand why this is needed for pyarrow.
>
> I can maybe try and infer a little.  For example, is this required to
> add support in pyarrow for running UDFs that are embedded in the
> Substrait plan?
>
> One possible implementation could be to:
>
>   * Have pyarrow accept the Substrait plan
>   * Extract the embedded UDFs
>   * Register Arrow UDFs
>   * Replace the calls to the UDF in the Substrait plan with calls to
> the newly registered UDF
>   * Submit the modified Substrait plan to Arrow-C++
>
> However, another potential implementation could be to:
>
>   * Define a "UDF provider" in C++.  This is a callback that receives a
> buffer (the pickled UDF) and a name and also a function that creates
> an Arrow "call" given the name of a UDF
>   * Implement the UDF provider in pyarrow
>      * The implementation would unpickle the UDF and register a UDF
> with Arrow.  Then making the Arrow call would just be creating a call
> into the registered UDF.
>   * Continue to have Substrait plans go directly into Arrow-C++, but
> now pyarrow is a "UDF provider" for "pickle UDFs"
>
> In the first approach it would make sense that you would need
> Substrait access from python.  In the second approach it wouldn't be
> needed.  I'm not advocating for one approach vs the other but I am
> trying to understand the overall goal with adding this support into
> pyarrow.  However, a PR may be clearer.  Is this something I might be
> able to infer from [1] (I haven't yet had a chance to look at that in
> detail)?
>
>> Regarding Rope that I mentioned earlier in this thread, it has an LGPL v3+ license. Is this license acceptable for a build dependency?
>
> TMK, that should be fine for a build dependency.
>
> [1] https://github.com/apache/arrow/pull/13500
>
> On Wed, Jul 6, 2022 at 5:07 AM Yaron Gvili <rt...@hotmail.com> wrote:
>>
>> Regarding Rope that I mentioned earlier in this thread, it has an LGPL v3+ license. Is this license acceptable for a build dependency?
>>
>>
>> Yaron.
>> ________________________________
>> From: Yaron Gvili <rt...@hotmail.com>
>> Sent: Wednesday, July 6, 2022 7:26 AM
>> To: dev@arrow.apache.org <de...@arrow.apache.org>
>> Subject: Re: accessing Substrait protobuf Python classes from PyArrow
>>
>> @Weston: The need for accessing the Substrait protobuf Python classes is not for the purpose of modifying the Substrait plan. There are two relevant cases I went into:
>>
>>    1.  Parsing a dictionary using google.protobuf.json_format.ParseDict(). This method requires an instance of a Substrait protobuf Python class, such as Plan, to merge with; hence, the need to access such classes, but no need to modify the Substrait plan. This case was described in the first message in this thread.
>>    2.  Accessing a Python object, which is more convenient to manipulate from Python, after unpickling in from a field in the Substrait plan. It's just read-only access to the field from Python, but still needs access to the Substrait protobuf Python classes. This case was mentioned in my previous message in this thread in response to Li's specific question about UDFs.
>>
>> Yaron.
>> ________________________________
>> From: Weston Pace <we...@gmail.com>
>> Sent: Tuesday, July 5, 2022 7:59 PM
>> To: dev@arrow.apache.org <de...@arrow.apache.org>
>> Subject: Re: accessing Substrait protobuf Python classes from PyArrow
>>
>> If this is to be used within pyarrow then I think it makes sense as
>> something to do.  Can you expand a bit more on what you are trying to
>> do?  Why do you need to modify the Substrait plan in pyarrow?
>>
>>> This came up in a data-source UDF scenario where the implementation is a Python stream factory, i.e., a Python callable returning a callable. I think it would be easier to get the factory and then the stream using Python code than using Cython or C++.
>>
>> I'm not quite certain why this requires a modification to the plan.
>>
>>
>> On Tue, Jul 5, 2022 at 7:45 AM Yaron Gvili <rt...@hotmail.com> wrote:
>>>
>>> @Li, yes though in a new way. This came up in a data-source UDF scenario where the implementation is a Python stream factory, i.e., a Python callable returning a callable. I think it would be easier to get the factory and then the stream using Python code than using Cython or C++.
>>>
>>>
>>> Yaron.
>>>
>>> ________________________________
>>> From: Li Jin <ic...@gmail.com>
>>> Sent: Tuesday, July 5, 2022, 4:22 PM
>>> To: dev@arrow.apache.org <de...@arrow.apache.org>
>>> Subject: Re: accessing Substrait protobuf Python classes from PyArrow
>>>
>>> Yaron, do we need to parse the subtrait protobuf in Python so that we can
>>> get the UDFs and register them with Pyarrow?
>>>
>>> On Mon, Jul 4, 2022 at 1:24 PM Yaron Gvili <rt...@hotmail.com> wrote:
>>>
>>>> This rewriting of the package is basically what I had in mind; the `_ep`
>>>> was just to signal a private package, which cannot be enforced, of course.
>>>> Assuming this rewriting would indeed avoid conflict with any standard
>>>> protobuf package a user might want to use, it would be nicer to do this
>>>> using a robust refactoring
>>>> tool. We could try Rope [1] (in particular, its folder refactoring library
>>>> method [2]) for this; it should add Rope only as a build dependency. Does
>>>> this sound worth trying?
>>>>
>>>> [1] https://github.com/python-rope/rope
>>>> [2]
>>>> https://rope.readthedocs.io/en/latest/library.html#list-of-refactorings
>>>>
>>>>
>>>> Yaron.
>>>> ________________________________
>>>> From: Jeroen van Straten <je...@gmail.com>
>>>> Sent: Monday, July 4, 2022 12:38 PM
>>>> To: dev@arrow.apache.org <de...@arrow.apache.org>
>>>> Subject: Re: accessing Substrait protobuf Python classes from PyArrow
>>>>
>>>> Ah, I see. I guess that complicates things a little.
>>>>
>>>> About the exposure part in C++, the idea is that if we statically link
>>>> libprotobuf and use private linkage for the generated classes, *hopefully*
>>>> users won't run into linking issues when they link to both Arrow and to
>>>> some libprotobuf of their own, especially when they or another one of their
>>>> dependencies also includes Substrait protobuf. The classes would absolutely
>>>> conflict otherwise, and the libprotobuf version may or may not conflict.
>>>> The latter is kind of an unsolved problem; AFAIK there have been talks to
>>>> replace libprotobuf with a third-party alternative to avoid all this
>>>> nonsense, but nothing concrete yet.
>>>>
>>>> For Python, the problems are not quite the same:
>>>>
>>>>   - There is no such thing as private linkage in Python, so we *have* to
>>>> re-namespace the generated Python files. Note that the output of protoc
>>>> assumes that the files are in exactly the same namespace/module
>>>> path/whatever as the proto files specify, so the toplevel module would be
>>>> "substrait", not "arrow". See
>>>> https://github.com/substrait-io/substrait/pull/207 and
>>>>
>>>> https://github.com/substrait-io/substrait-validator/blob/941b81ddf65d8ee6266d6eea0d151a3f9d8a512b/py/build.rs#L105-L142
>>>> for my ugly workarounds for this thus far.
>>>>   - There is also no such thing as static linkage in Python, so if you'd
>>>> want to use Google's protobuf implementation in Python pyarrow *will* need
>>>> to depend on it, and it'd become the user's problem to make sure that the
>>>> installed Python protobuf version matches the version of their system
>>>> library. It looks like pyarrow currently only depends on numpy, which is
>>>> pretty awesome... so I feel like we should keep it that way.
>>>>
>>>> Not sure what the best course of action is.
>>>>
>>>> Jeroen
>>>>
>>>> On Sun, 3 Jul 2022 at 22:55, Yaron Gvili <rt...@hotmail.com> wrote:
>>>>
>>>>> Thanks, the Google protobuf exposure concerns are clear. Another concern
>>>>> would be consistent maintenance of the same version of Substrait protobuf
>>>>> used in Arrow C++ and in PyArrow.
>>>>>
>>>>> I didn't mean access to users but internally. That is, I didn't mean to
>>>>> expose Substrait protobuf Python classes to PyArrow users, just to use
>>>> them
>>>>> internally in PyArrow code I'll be developing, per the use case I
>>>> described
>>>>> at the start of this thread. IIUC, this should address the exposure
>>>>> concerns, and the `arrow/engine/substrait` module in Arrow C++ does this
>>>>> too. If the technical approach I just described would actually expose the
>>>>> classes, what would be a proper way to avoid exposing them? Perhaps the
>>>>> classes should be generated into a private package, e.g., under
>>>>> `python/_ep`? (ep stands for external project)
>>>>>
>>>>>
>>>>> Yaron.
>>>>> ________________________________
>>>>> From: Antoine Pitrou <an...@python.org>
>>>>> Sent: Sunday, July 3, 2022 3:20 PM
>>>>> To: dev@arrow.apache.org <de...@arrow.apache.org>
>>>>> Subject: Re: accessing Substrait protobuf Python classes from PyArrow
>>>>>
>>>>>
>>>>> I agree that giving direct access to protobuf classes is not Arrow's
>>>>> job. You can probably take the upstream (i.e. Substrait's) protobuf
>>>>> definitions and compile them yourself, using whatever settings required
>>>>> by your project.
>>>>>
>>>>> Regards
>>>>>
>>>>> Antoine.
>>>>>
>>>>>
>>>>> Le 03/07/2022 à 21:16, Jeroen van Straten a écrit :
>>>>>> It's not so much about whether or not we *can*, but about whether or
>>>> not
>>>>> we
>>>>>> *should* generate and expose these files.
>>>>>>
>>>>>> Fundamentally, Substrait aims to be a means to connect different
>>>> systems
>>>>>> together by standardizing some interchange format. Currently that
>>>> happens
>>>>>> to be based on protobuf, so *one* (obvious) way to generate and
>>>> interpret
>>>>>> Substrait plans is to use Google's own protobuf implementation, as
>>>>>> generated by protoc for various languages. It's true that there's
>>>> nothing
>>>>>> fundamentally blocking Arrow from exposing those.
>>>>>>
>>>>>> However, it's by no means the *only* way to interact with protobuf
>>>>>> serializations, and Google's own implementation is a hot mess when it
>>>>> comes
>>>>>> to dependencies; there are lots of good reasons why you might not want
>>>> to
>>>>>> depend on it and opt for a third-party implementation instead. For one
>>>>>> thing, the Python wheel depends on system libraries beyond manylinux,
>>>> and
>>>>>> if they happen to be incompatible (which is likely) it just explodes
>>>>> unless
>>>>>> you set some environment variable. Therefore, assuming pyarrow doesn't
>>>>>> already depend on protobuf, I feel like we should keep it that way, and
>>>>>> thus that we should not include the generated Python files in the
>>>>> release.
>>>>>> Note that we don't even expose/leak the protoc-generated C++ classes
>>>> for
>>>>>> similar reasons.
>>>>>>
>>>>>> Also, as Weston already pointed out, it's not really our job; Substrait
>>>>>> aims to publish bindings for various languages by itself. It just
>>>> doesn't
>>>>>> expose them for Python *yet*. In the interim I suppose you could use
>>>> the
>>>>>> substrait-validator package from PyPI. It does expose them, as well as
>>>>> some
>>>>>> convenient conversion functions, but I'm having trouble finding people
>>>> to
>>>>>> help me keep the validator maintained.
>>>>>>
>>>>>> I suppose versioning would be difficult either way, since Substrait
>>>>>> semi-regularly pushes breaking changes and Arrow currently lags behind
>>>> by
>>>>>> several months (though I have a PR open for Substrait 0.6). I guess
>>>> from
>>>>>> that point of view distributing the right version along with pyarrow
>>>>> seems
>>>>>> nice, but the issues of Google's protobuf implementation remain. This
>>>>> being
>>>>>> an issue at all is also very much a Substrait problem, not an Arrow
>>>>>> problem; at best we can try to mitigate it.
>>>>>>
>>>>>> Jeroen
>>>>>>
>>>>>> On Sun, Jul 3, 2022, 17:51 Yaron Gvili <rt...@hotmail.com> wrote:
>>>>>>
>>>>>>> I looked into the Arrow build system some more. It is possible to get
>>>>> the
>>>>>>> Python classes generated by adding "--python-out" flag (set to a
>>>>> directory
>>>>>>> created for it) to the `${ARROW_PROTOBUF_PROTOC}` command under
>>>>>>> `macro(build_substrait)` in
>>>>> `cpp/cmake_modules/ThirdpartyToolchain.cmake`.
>>>>>>> However, this makes them available only in the Arrow C++ build whereas
>>>>> for
>>>>>>> the current purpose they need to be available in the PyArrow build.
>>>> The
>>>>>>> PyArrow build calls `cmake` on `python/CMakeLists.txt`, which AFAICS
>>>> has
>>>>>>> access to `cpp/cmake_modules`. So, one solution could be to pull
>>>>>>> `macro(build_substrait)` into `python/CMakeLists.txt` and call it to
>>>>>>> generate the Python protobuf classes under `python/`, making them
>>>>> available
>>>>>>> for import by PyArrow code. This would probably be cleaner with some
>>>>> macro
>>>>>>> parameters to distinguish between C++ and Python generation.
>>>>>>>
>>>>>>> Does this sound like a reasonable approach?
>>>>>>>
>>>>>>>
>>>>>>> Yaron.
>>>>>>>
>>>>>>> ________________________________
>>>>>>> From: Yaron Gvili <rt...@hotmail.com>
>>>>>>> Sent: Saturday, July 2, 2022 8:55 AM
>>>>>>> To: dev@arrow.apache.org <de...@arrow.apache.org>; Phillip Cloud <
>>>>>>> cpcloud@gmail.com>
>>>>>>> Subject: Re: accessing Substrait protobuf Python classes from PyArrow
>>>>>>>
>>>>>>> I'm somewhat confused by this answer because I think resolving the
>>>>> issue I
>>>>>>> raised does not require any change outside PyArrow. I'll try to
>>>> explain
>>>>> the
>>>>>>> issue differently.
>>>>>>>
>>>>>>> First, let me describe the current situation with Substrait protobuf
>>>> in
>>>>>>> Arrow C++. The Arrow C++ build system handles external projects in
>>>>>>> `cpp/cmake_modules/ThirdpartyToolchain.cmake`, and one of these
>>>> external
>>>>>>> projects is "substrait". By default, the build system takes the source
>>>>> code
>>>>>>> for "substrait" from `
>>>>>>>
>>>>>
>>>> https://github.com/substrait-io/substrait/archive/${ARROW_SUBSTRAIT_BUILD_VERSION}.tar.gz
>>>>>>> ` where `ARROW_SUBSTRAIT_BUILD_VERSION` is set in
>>>>>>> `cpp/thirdparty/versions.txt`. The source code is check-summed and
>>>>> unpacked
>>>>>>> in `substrait_ep-prefix` under the build directory and from this the
>>>>>>> protobuf C++ classes are generated in `*.pb.{h,cc}` files in
>>>>>>> `substrait_ep-generated` under the build directory. The build system
>>>>> makes
>>>>>>> a library using the `*.cc` files and makes the `*.h` files available
>>>> for
>>>>>>> other C++ modules to use.
>>>>>>>
>>>>>>> Setting up the above mechanism did not require any change in the
>>>>>>> `substrait-io/substrait` repo, nor any coordination with its authors.
>>>>> What
>>>>>>> I'm looking for is a similar build mechanism for PyArrow that builds
>>>>>>> Substrait protobuf Python classes and makes them available for use by
>>>>> other
>>>>>>> PyArrow modules. I believe this PyArrow build mechanism does not exist
>>>>>>> currently and that setting up one would not require any changes
>>>> outside
>>>>>>> PyArrow. I'm asking (1) whether that's indeed the case, (2) whether
>>>>> others
>>>>>>> agree this mechanism is needed at least due to the problem I ran into
>>>>> that
>>>>>>> I previously described, and (3) for any thoughts about how to set up
>>>>> this
>>>>>>> mechanism assuming it is needed.
>>>>>>>
>>>>>>> Weston, perhaps your thinking was that the Substrait protobuf Python
>>>>>>> classes need to be built by a repo in the substrait-io space and made
>>>>>>> available as a binary+headers package? This can be done but will
>>>> require
>>>>>>> involving Substrait people and appears to be inconsistent with current
>>>>>>> patterns in the Arrow build system. Note that for my purposes here,
>>>> the
>>>>>>> Substrait protobuf Python classes will be used for composing or
>>>>>>> interpreting a Substrait plan, not for transforming it by an
>>>> optimizer,
>>>>>>> though a Python-based optimizer is a valid use case for them.
>>>>>>>
>>>>>>>
>>>>>>> Yaron.
>>>>>>> ________________________________
>>>>>>> From: Weston Pace <we...@gmail.com>
>>>>>>> Sent: Friday, July 1, 2022 12:42 PM
>>>>>>> To: dev@arrow.apache.org <de...@arrow.apache.org>; Phillip Cloud <
>>>>>>> cpcloud@gmail.com>
>>>>>>> Subject: Re: accessing Substrait protobuf Python classes from PyArrow
>>>>>>>
>>>>>>> Given that Acero does not do any planner / optimizer type tasks I'm
>>>>>>> not sure you will find anything like this in arrow-cpp or pyarrow.
>>>>>>> What you are describing I sometimes refer to as "plan slicing and
>>>>>>> dicing".  I have wondered if we will someday need this in Acero but I
>>>>>>> fear it is a slippery slope between "a little bit of plan
>>>>>>> manipulation" and "a full blown planner" so I've shied away from it.
>>>>>>> My first spot to look would be a substrait-python repository which
>>>>>>> would belong here: https://github.com/substrait-io
>>>>>>>
>>>>>>> However, it does not appear that such a repository exists.  If you're
>>>>>>> willing to create one then a quick ask on the Substrait Slack instance
>>>>>>> should be enough to get the repository created.  Perhaps there is some
>>>>>>> genesis of this library in Ibis although I think Ibis would use its
>>>>>>> own representation for slicing and dicing and only use Substrait for
>>>>>>> serialization.
>>>>>>>
>>>>>>> Once that repository is created pyarrow could probably import it but
>>>>>>> unless this plan manipulation makes sense purely from a pyarrow
>>>>>>> perspective I would rather prefer that the user application import
>>>>>>> both pyarrow and substrait-python independently.
>>>>>>>
>>>>>>> Perhaps @Phillip Cloud or someone from the Ibis space might have some
>>>>>>> ideas on where this might be found.
>>>>>>>
>>>>>>> -Weston
>>>>>>>
>>>>>>> On Thu, Jun 30, 2022 at 10:06 AM Yaron Gvili <rt...@hotmail.com>
>>>> wrote:
>>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> Is there support for accessing Substrait protobuf Python classes
>>>> (such
>>>>>>> as Plan) from PyArrow? If not, how should such support be added? For
>>>>>>> example, should the PyArrow build system pull in the Substrait repo as
>>>>> an
>>>>>>> external project and build its protobuf Python classes, in a manner
>>>>> similar
>>>>>>> to how Arrow C++ does it?
>>>>>>>>
>>>>>>>> I'm pondering these questions after running into an issue with code
>>>> I'm
>>>>>>> writing under PyArrow that parses a Substrait plan represented as a
>>>>>>> dictionary. The current (and kind of shaky) parsing operation in this
>>>>> code
>>>>>>> uses json.dumps() on the dictionary, which results in a string that is
>>>>>>> passed to a Cython API that handles it using Arrow C++ code that has
>>>>> access
>>>>>>> to Substrait protobuf C++ classes. But when the Substrait plan
>>>> contains
>>>>> a
>>>>>>> bytes-type, json.dump() no longer works and fails with "TypeError:
>>>>> Object
>>>>>>> of type bytes is not JSON serializable". A fix for this, and a better
>>>>> way
>>>>>>> to parse, is using google.protobuf.json_format.ParseDict() [1] on the
>>>>>>> dictionary. However, this invocation requires a second argument,
>>>> namely
>>>>> a
>>>>>>> protobuf message instance to merge with. The class of this message
>>>>> (such as
>>>>>>> Plan) is a Substrait protobuf Python class, hence the need to access
>>>>> such
>>>>>>> classes from PyArrow.
>>>>>>>>
>>>>>>>> [1]
>>>>>>>
>>>>>
>>>> https://googleapis.dev/python/protobuf/latest/google/protobuf/json_format.html
>>>>>>>>
>>>>>>>>
>>>>>>>> Yaron.
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>

Re: accessing Substrait protobuf Python classes from PyArrow

Posted by Antoine Pitrou <an...@python.org>.
Hi Yaron,

Le 07/07/2022 à 10:48, Yaron Gvili a écrit :
> It looks like the main decision to make is whether accessing Substrait protobuf Python classes from PyArrow is what we want here, because it would require some effort to implement as discussed.

I'm not the best person to comment on this, since I have little 
acquaintance with Substrait and with the protobuf libraries/compiler for 
Python.

However, from my understanding, the implementation effort is quite 
manageable. Please correct me if I'm wrong, but it seems two steps would 
be involved:
   1) run the protobuf compiler to produce Python modules and classes 
representing the Substrait protobuf entities
   2) run a custom script to rewrite the generated import statements so 
that they look into a custom subpackage of arrow (e.g. 
`pyarrow.substrait.protobuf`) instead of the top-level package that the 
protobuf compiler inflexibly generates

If it is all that's required as far as implementation is concerned, I 
personally don't see a problem to doing it *on a technical basis*. Some 
amount of custom code generation and/or code rewriting is unavoidable in 
a project with the size and complexity of Arrow (we already have custom 
code generation in other areas, in any case).


The other question, however, is whether we want to expose this as a 
public API, as a facility for users. What do we guarantee there? If for 
example we switch to a newer protobuf compiler version, might it 
generate slightly incompatible code? Is a particular version of protobuf 
required at runtime (I assume the generated code depends on the protobuf 
libs, as it does in C++)? Those are the real questions that need to be 
addressed, IMHO.

Regards

Antoine.



> 
>> These are two reasons to need the protobuf in python.  However, I still don't fully understand why this is needed for pyarrow.
> 
> For #1, the invocations of google.protobuf.json_format.ParseDict() are going to be in PyArrow unit tests of UDF functionality. As described earlier, I was able to avoid this invocation until I got a "TypeError: Object of type bytes is not JSON serializable" for a bytes-type field holding the UDF pickle.
> 
> For #2, this is related to the more general preference for PyArrow when dealing with UDFs. Basically, it is easier for the Substrait-producer to pickle a Python-implemented UDF into a Substrait plan via PyArrow APIs and, similarly, for the Substrait-consumer to unpickle the UDF from the Substrait plan via PyArrow APIs.  Doing it via Arrow C++ APIs, using callbacks or by launching an embedded Python interpreter, is undesirable IMO. More specifically, the use case of #2 is for UDTs (user-defined tables), which are objects somewhat more complex than UDFs and are yet easier to deal with using Python code.
> 
>> In the first approach it would make sense that you would need Substrait access from python.
> 
> The first approach is basically the design I'm advocating, except that the registrations allow the replacing-step to be avoided, so the Substrait plan need not be modified. Indeed, this approach leads to a need for Substrait access from Python (and PyArrow, as explained above).
> 
>> In the second approach it wouldn't be needed.
> 
> Agreed. Though this approach has its disadvantages. For example, each new kind of user-defined object (like UDT) would require at least one new callback in Arrow C++ code. It would also be harder for the user to manually register UDFs outside the context of a Substrait plan, and I think also the Arrow Substrait APIs would need to be changed to accept a FunctionRegistry in order to support a nested registry; it's easier, and probably more flexible, to let PyArrow drive instead of getting called back.
> 
>> Is this something I might be able to infer from [1] (I haven't yet had a chance to look at that in detail)?
> 
> [1] is focused on UDFs, so it could give you an idea but won't show you the details of UDT integration.
> 
> 
> Yaron.
> ________________________________
> From: Weston Pace <we...@gmail.com>
> Sent: Wednesday, July 6, 2022 1:30 PM
> To: dev@arrow.apache.org <de...@arrow.apache.org>
> Subject: Re: accessing Substrait protobuf Python classes from PyArrow
> 
>> 1.  Parsing a dictionary using google.protobuf.json_format.ParseDict(). This method requires an instance of a Substrait protobuf Python class, such as Plan, to merge with; hence, the need to access such classes, but no need to modify the Substrait plan. This case was described in the first message in this thread.
>> 2.  Accessing a Python object, which is more convenient to manipulate from Python, after unpickling in from a field in the Substrait plan. It's just read-only access to the field from Python, but still needs access to the Substrait protobuf Python classes. This case was mentioned in my previous message in this thread in response to Li's specific question about UDFs.
> 
> These are two reasons to need the protobuf in python.  However, I
> still don't fully understand why this is needed for pyarrow.
> 
> I can maybe try and infer a little.  For example, is this required to
> add support in pyarrow for running UDFs that are embedded in the
> Substrait plan?
> 
> One possible implementation could be to:
> 
>   * Have pyarrow accept the Substrait plan
>   * Extract the embedded UDFs
>   * Register Arrow UDFs
>   * Replace the calls to the UDF in the Substrait plan with calls to
> the newly registered UDF
>   * Submit the modified Substrait plan to Arrow-C++
> 
> However, another potential implementation could be to:
> 
>   * Define a "UDF provider" in C++.  This is a callback that receives a
> buffer (the pickled UDF) and a name and also a function that creates
> an Arrow "call" given the name of a UDF
>   * Implement the UDF provider in pyarrow
>      * The implementation would unpickle the UDF and register a UDF
> with Arrow.  Then making the Arrow call would just be creating a call
> into the registered UDF.
>   * Continue to have Substrait plans go directly into Arrow-C++, but
> now pyarrow is a "UDF provider" for "pickle UDFs"
> 
> In the first approach it would make sense that you would need
> Substrait access from python.  In the second approach it wouldn't be
> needed.  I'm not advocating for one approach vs the other but I am
> trying to understand the overall goal with adding this support into
> pyarrow.  However, a PR may be clearer.  Is this something I might be
> able to infer from [1] (I haven't yet had a chance to look at that in
> detail)?
> 
>> Regarding Rope that I mentioned earlier in this thread, it has an LGPL v3+ license. Is this license acceptable for a build dependency?
> 
> TMK, that should be fine for a build dependency.
> 
> [1] https://github.com/apache/arrow/pull/13500
> 
> On Wed, Jul 6, 2022 at 5:07 AM Yaron Gvili <rt...@hotmail.com> wrote:
>>
>> Regarding Rope that I mentioned earlier in this thread, it has an LGPL v3+ license. Is this license acceptable for a build dependency?
>>
>>
>> Yaron.
>> ________________________________
>> From: Yaron Gvili <rt...@hotmail.com>
>> Sent: Wednesday, July 6, 2022 7:26 AM
>> To: dev@arrow.apache.org <de...@arrow.apache.org>
>> Subject: Re: accessing Substrait protobuf Python classes from PyArrow
>>
>> @Weston: The need for accessing the Substrait protobuf Python classes is not for the purpose of modifying the Substrait plan. There are two relevant cases I went into:
>>
>>    1.  Parsing a dictionary using google.protobuf.json_format.ParseDict(). This method requires an instance of a Substrait protobuf Python class, such as Plan, to merge with; hence, the need to access such classes, but no need to modify the Substrait plan. This case was described in the first message in this thread.
>>    2.  Accessing a Python object, which is more convenient to manipulate from Python, after unpickling in from a field in the Substrait plan. It's just read-only access to the field from Python, but still needs access to the Substrait protobuf Python classes. This case was mentioned in my previous message in this thread in response to Li's specific question about UDFs.
>>
>> Yaron.
>> ________________________________
>> From: Weston Pace <we...@gmail.com>
>> Sent: Tuesday, July 5, 2022 7:59 PM
>> To: dev@arrow.apache.org <de...@arrow.apache.org>
>> Subject: Re: accessing Substrait protobuf Python classes from PyArrow
>>
>> If this is to be used within pyarrow then I think it makes sense as
>> something to do.  Can you expand a bit more on what you are trying to
>> do?  Why do you need to modify the Substrait plan in pyarrow?
>>
>>> This came up in a data-source UDF scenario where the implementation is a Python stream factory, i.e., a Python callable returning a callable. I think it would be easier to get the factory and then the stream using Python code than using Cython or C++.
>>
>> I'm not quite certain why this requires a modification to the plan.
>>
>>
>> On Tue, Jul 5, 2022 at 7:45 AM Yaron Gvili <rt...@hotmail.com> wrote:
>>>
>>> @Li, yes though in a new way. This came up in a data-source UDF scenario where the implementation is a Python stream factory, i.e., a Python callable returning a callable. I think it would be easier to get the factory and then the stream using Python code than using Cython or C++.
>>>
>>>
>>> Yaron.
>>>
>>> ________________________________
>>> From: Li Jin <ic...@gmail.com>
>>> Sent: Tuesday, July 5, 2022, 4:22 PM
>>> To: dev@arrow.apache.org <de...@arrow.apache.org>
>>> Subject: Re: accessing Substrait protobuf Python classes from PyArrow
>>>
>>> Yaron, do we need to parse the subtrait protobuf in Python so that we can
>>> get the UDFs and register them with Pyarrow?
>>>
>>> On Mon, Jul 4, 2022 at 1:24 PM Yaron Gvili <rt...@hotmail.com> wrote:
>>>
>>>> This rewriting of the package is basically what I had in mind; the `_ep`
>>>> was just to signal a private package, which cannot be enforced, of course.
>>>> Assuming this rewriting would indeed avoid conflict with any standard
>>>> protobuf package a user might want to use, it would be nicer to do this
>>>> using a robust refactoring
>>>> tool. We could try Rope [1] (in particular, its folder refactoring library
>>>> method [2]) for this; it should add Rope only as a build dependency. Does
>>>> this sound worth trying?
>>>>
>>>> [1] https://github.com/python-rope/rope
>>>> [2]
>>>> https://rope.readthedocs.io/en/latest/library.html#list-of-refactorings
>>>>
>>>>
>>>> Yaron.
>>>> ________________________________
>>>> From: Jeroen van Straten <je...@gmail.com>
>>>> Sent: Monday, July 4, 2022 12:38 PM
>>>> To: dev@arrow.apache.org <de...@arrow.apache.org>
>>>> Subject: Re: accessing Substrait protobuf Python classes from PyArrow
>>>>
>>>> Ah, I see. I guess that complicates things a little.
>>>>
>>>> About the exposure part in C++, the idea is that if we statically link
>>>> libprotobuf and use private linkage for the generated classes, *hopefully*
>>>> users won't run into linking issues when they link to both Arrow and to
>>>> some libprotobuf of their own, especially when they or another one of their
>>>> dependencies also includes Substrait protobuf. The classes would absolutely
>>>> conflict otherwise, and the libprotobuf version may or may not conflict.
>>>> The latter is kind of an unsolved problem; AFAIK there have been talks to
>>>> replace libprotobuf with a third-party alternative to avoid all this
>>>> nonsense, but nothing concrete yet.
>>>>
>>>> For Python, the problems are not quite the same:
>>>>
>>>>   - There is no such thing as private linkage in Python, so we *have* to
>>>> re-namespace the generated Python files. Note that the output of protoc
>>>> assumes that the files are in exactly the same namespace/module
>>>> path/whatever as the proto files specify, so the toplevel module would be
>>>> "substrait", not "arrow". See
>>>> https://github.com/substrait-io/substrait/pull/207 and
>>>>
>>>> https://github.com/substrait-io/substrait-validator/blob/941b81ddf65d8ee6266d6eea0d151a3f9d8a512b/py/build.rs#L105-L142
>>>> for my ugly workarounds for this thus far.
>>>>   - There is also no such thing as static linkage in Python, so if you'd
>>>> want to use Google's protobuf implementation in Python pyarrow *will* need
>>>> to depend on it, and it'd become the user's problem to make sure that the
>>>> installed Python protobuf version matches the version of their system
>>>> library. It looks like pyarrow currently only depends on numpy, which is
>>>> pretty awesome... so I feel like we should keep it that way.
>>>>
>>>> Not sure what the best course of action is.
>>>>
>>>> Jeroen
>>>>
>>>> On Sun, 3 Jul 2022 at 22:55, Yaron Gvili <rt...@hotmail.com> wrote:
>>>>
>>>>> Thanks, the Google protobuf exposure concerns are clear. Another concern
>>>>> would be consistent maintenance of the same version of Substrait protobuf
>>>>> used in Arrow C++ and in PyArrow.
>>>>>
>>>>> I didn't mean access to users but internally. That is, I didn't mean to
>>>>> expose Substrait protobuf Python classes to PyArrow users, just to use
>>>> them
>>>>> internally in PyArrow code I'll be developing, per the use case I
>>>> described
>>>>> at the start of this thread. IIUC, this should address the exposure
>>>>> concerns, and the `arrow/engine/substrait` module in Arrow C++ does this
>>>>> too. If the technical approach I just described would actually expose the
>>>>> classes, what would be a proper way to avoid exposing them? Perhaps the
>>>>> classes should be generated into a private package, e.g., under
>>>>> `python/_ep`? (ep stands for external project)
>>>>>
>>>>>
>>>>> Yaron.
>>>>> ________________________________
>>>>> From: Antoine Pitrou <an...@python.org>
>>>>> Sent: Sunday, July 3, 2022 3:20 PM
>>>>> To: dev@arrow.apache.org <de...@arrow.apache.org>
>>>>> Subject: Re: accessing Substrait protobuf Python classes from PyArrow
>>>>>
>>>>>
>>>>> I agree that giving direct access to protobuf classes is not Arrow's
>>>>> job. You can probably take the upstream (i.e. Substrait's) protobuf
>>>>> definitions and compile them yourself, using whatever settings required
>>>>> by your project.
>>>>>
>>>>> Regards
>>>>>
>>>>> Antoine.
>>>>>
>>>>>
>>>>> Le 03/07/2022 à 21:16, Jeroen van Straten a écrit :
>>>>>> It's not so much about whether or not we *can*, but about whether or
>>>> not
>>>>> we
>>>>>> *should* generate and expose these files.
>>>>>>
>>>>>> Fundamentally, Substrait aims to be a means to connect different
>>>> systems
>>>>>> together by standardizing some interchange format. Currently that
>>>> happens
>>>>>> to be based on protobuf, so *one* (obvious) way to generate and
>>>> interpret
>>>>>> Substrait plans is to use Google's own protobuf implementation, as
>>>>>> generated by protoc for various languages. It's true that there's
>>>> nothing
>>>>>> fundamentally blocking Arrow from exposing those.
>>>>>>
>>>>>> However, it's by no means the *only* way to interact with protobuf
>>>>>> serializations, and Google's own implementation is a hot mess when it
>>>>> comes
>>>>>> to dependencies; there are lots of good reasons why you might not want
>>>> to
>>>>>> depend on it and opt for a third-party implementation instead. For one
>>>>>> thing, the Python wheel depends on system libraries beyond manylinux,
>>>> and
>>>>>> if they happen to be incompatible (which is likely) it just explodes
>>>>> unless
>>>>>> you set some environment variable. Therefore, assuming pyarrow doesn't
>>>>>> already depend on protobuf, I feel like we should keep it that way, and
>>>>>> thus that we should not include the generated Python files in the
>>>>> release.
>>>>>> Note that we don't even expose/leak the protoc-generated C++ classes
>>>> for
>>>>>> similar reasons.
>>>>>>
>>>>>> Also, as Weston already pointed out, it's not really our job; Substrait
>>>>>> aims to publish bindings for various languages by itself. It just
>>>> doesn't
>>>>>> expose them for Python *yet*. In the interim I suppose you could use
>>>> the
>>>>>> substrait-validator package from PyPI. It does expose them, as well as
>>>>> some
>>>>>> convenient conversion functions, but I'm having trouble finding people
>>>> to
>>>>>> help me keep the validator maintained.
>>>>>>
>>>>>> I suppose versioning would be difficult either way, since Substrait
>>>>>> semi-regularly pushes breaking changes and Arrow currently lags behind
>>>> by
>>>>>> several months (though I have a PR open for Substrait 0.6). I guess
>>>> from
>>>>>> that point of view distributing the right version along with pyarrow
>>>>> seems
>>>>>> nice, but the issues of Google's protobuf implementation remain. This
>>>>> being
>>>>>> an issue at all is also very much a Substrait problem, not an Arrow
>>>>>> problem; at best we can try to mitigate it.
>>>>>>
>>>>>> Jeroen
>>>>>>
>>>>>> On Sun, Jul 3, 2022, 17:51 Yaron Gvili <rt...@hotmail.com> wrote:
>>>>>>
>>>>>>> I looked into the Arrow build system some more. It is possible to get
>>>>> the
>>>>>>> Python classes generated by adding "--python-out" flag (set to a
>>>>> directory
>>>>>>> created for it) to the `${ARROW_PROTOBUF_PROTOC}` command under
>>>>>>> `macro(build_substrait)` in
>>>>> `cpp/cmake_modules/ThirdpartyToolchain.cmake`.
>>>>>>> However, this makes them available only in the Arrow C++ build whereas
>>>>> for
>>>>>>> the current purpose they need to be available in the PyArrow build.
>>>> The
>>>>>>> PyArrow build calls `cmake` on `python/CMakeLists.txt`, which AFAICS
>>>> has
>>>>>>> access to `cpp/cmake_modules`. So, one solution could be to pull
>>>>>>> `macro(build_substrait)` into `python/CMakeLists.txt` and call it to
>>>>>>> generate the Python protobuf classes under `python/`, making them
>>>>> available
>>>>>>> for import by PyArrow code. This would probably be cleaner with some
>>>>> macro
>>>>>>> parameters to distinguish between C++ and Python generation.
>>>>>>>
>>>>>>> Does this sound like a reasonable approach?
>>>>>>>
>>>>>>>
>>>>>>> Yaron.
>>>>>>>
>>>>>>> ________________________________
>>>>>>> From: Yaron Gvili <rt...@hotmail.com>
>>>>>>> Sent: Saturday, July 2, 2022 8:55 AM
>>>>>>> To: dev@arrow.apache.org <de...@arrow.apache.org>; Phillip Cloud <
>>>>>>> cpcloud@gmail.com>
>>>>>>> Subject: Re: accessing Substrait protobuf Python classes from PyArrow
>>>>>>>
>>>>>>> I'm somewhat confused by this answer because I think resolving the
>>>>> issue I
>>>>>>> raised does not require any change outside PyArrow. I'll try to
>>>> explain
>>>>> the
>>>>>>> issue differently.
>>>>>>>
>>>>>>> First, let me describe the current situation with Substrait protobuf
>>>> in
>>>>>>> Arrow C++. The Arrow C++ build system handles external projects in
>>>>>>> `cpp/cmake_modules/ThirdpartyToolchain.cmake`, and one of these
>>>> external
>>>>>>> projects is "substrait". By default, the build system takes the source
>>>>> code
>>>>>>> for "substrait" from `
>>>>>>>
>>>>>
>>>> https://github.com/substrait-io/substrait/archive/${ARROW_SUBSTRAIT_BUILD_VERSION}.tar.gz
>>>>>>> ` where `ARROW_SUBSTRAIT_BUILD_VERSION` is set in
>>>>>>> `cpp/thirdparty/versions.txt`. The source code is check-summed and
>>>>> unpacked
>>>>>>> in `substrait_ep-prefix` under the build directory and from this the
>>>>>>> protobuf C++ classes are generated in `*.pb.{h,cc}` files in
>>>>>>> `substrait_ep-generated` under the build directory. The build system
>>>>> makes
>>>>>>> a library using the `*.cc` files and makes the `*.h` files available
>>>> for
>>>>>>> other C++ modules to use.
>>>>>>>
>>>>>>> Setting up the above mechanism did not require any change in the
>>>>>>> `substrait-io/substrait` repo, nor any coordination with its authors.
>>>>> What
>>>>>>> I'm looking for is a similar build mechanism for PyArrow that builds
>>>>>>> Substrait protobuf Python classes and makes them available for use by
>>>>> other
>>>>>>> PyArrow modules. I believe this PyArrow build mechanism does not exist
>>>>>>> currently and that setting up one would not require any changes
>>>> outside
>>>>>>> PyArrow. I'm asking (1) whether that's indeed the case, (2) whether
>>>>> others
>>>>>>> agree this mechanism is needed at least due to the problem I ran into
>>>>> that
>>>>>>> I previously described, and (3) for any thoughts about how to set up
>>>>> this
>>>>>>> mechanism assuming it is needed.
>>>>>>>
>>>>>>> Weston, perhaps your thinking was that the Substrait protobuf Python
>>>>>>> classes need to be built by a repo in the substrait-io space and made
>>>>>>> available as a binary+headers package? This can be done but will
>>>> require
>>>>>>> involving Substrait people and appears to be inconsistent with current
>>>>>>> patterns in the Arrow build system. Note that for my purposes here,
>>>> the
>>>>>>> Substrait protobuf Python classes will be used for composing or
>>>>>>> interpreting a Substrait plan, not for transforming it by an
>>>> optimizer,
>>>>>>> though a Python-based optimizer is a valid use case for them.
>>>>>>>
>>>>>>>
>>>>>>> Yaron.
>>>>>>> ________________________________
>>>>>>> From: Weston Pace <we...@gmail.com>
>>>>>>> Sent: Friday, July 1, 2022 12:42 PM
>>>>>>> To: dev@arrow.apache.org <de...@arrow.apache.org>; Phillip Cloud <
>>>>>>> cpcloud@gmail.com>
>>>>>>> Subject: Re: accessing Substrait protobuf Python classes from PyArrow
>>>>>>>
>>>>>>> Given that Acero does not do any planner / optimizer type tasks I'm
>>>>>>> not sure you will find anything like this in arrow-cpp or pyarrow.
>>>>>>> What you are describing I sometimes refer to as "plan slicing and
>>>>>>> dicing".  I have wondered if we will someday need this in Acero but I
>>>>>>> fear it is a slippery slope between "a little bit of plan
>>>>>>> manipulation" and "a full blown planner" so I've shied away from it.
>>>>>>> My first spot to look would be a substrait-python repository which
>>>>>>> would belong here: https://github.com/substrait-io
>>>>>>>
>>>>>>> However, it does not appear that such a repository exists.  If you're
>>>>>>> willing to create one then a quick ask on the Substrait Slack instance
>>>>>>> should be enough to get the repository created.  Perhaps there is some
>>>>>>> genesis of this library in Ibis although I think Ibis would use its
>>>>>>> own representation for slicing and dicing and only use Substrait for
>>>>>>> serialization.
>>>>>>>
>>>>>>> Once that repository is created pyarrow could probably import it but
>>>>>>> unless this plan manipulation makes sense purely from a pyarrow
>>>>>>> perspective I would rather prefer that the user application import
>>>>>>> both pyarrow and substrait-python independently.
>>>>>>>
>>>>>>> Perhaps @Phillip Cloud or someone from the Ibis space might have some
>>>>>>> ideas on where this might be found.
>>>>>>>
>>>>>>> -Weston
>>>>>>>
>>>>>>> On Thu, Jun 30, 2022 at 10:06 AM Yaron Gvili <rt...@hotmail.com>
>>>> wrote:
>>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> Is there support for accessing Substrait protobuf Python classes
>>>> (such
>>>>>>> as Plan) from PyArrow? If not, how should such support be added? For
>>>>>>> example, should the PyArrow build system pull in the Substrait repo as
>>>>> an
>>>>>>> external project and build its protobuf Python classes, in a manner
>>>>> similar
>>>>>>> to how Arrow C++ does it?
>>>>>>>>
>>>>>>>> I'm pondering these questions after running into an issue with code
>>>> I'm
>>>>>>> writing under PyArrow that parses a Substrait plan represented as a
>>>>>>> dictionary. The current (and kind of shaky) parsing operation in this
>>>>> code
>>>>>>> uses json.dumps() on the dictionary, which results in a string that is
>>>>>>> passed to a Cython API that handles it using Arrow C++ code that has
>>>>> access
>>>>>>> to Substrait protobuf C++ classes. But when the Substrait plan
>>>> contains
>>>>> a
>>>>>>> bytes-type, json.dump() no longer works and fails with "TypeError:
>>>>> Object
>>>>>>> of type bytes is not JSON serializable". A fix for this, and a better
>>>>> way
>>>>>>> to parse, is using google.protobuf.json_format.ParseDict() [1] on the
>>>>>>> dictionary. However, this invocation requires a second argument,
>>>> namely
>>>>> a
>>>>>>> protobuf message instance to merge with. The class of this message
>>>>> (such as
>>>>>>> Plan) is a Substrait protobuf Python class, hence the need to access
>>>>> such
>>>>>>> classes from PyArrow.
>>>>>>>>
>>>>>>>> [1]
>>>>>>>
>>>>>
>>>> https://googleapis.dev/python/protobuf/latest/google/protobuf/json_format.html
>>>>>>>>
>>>>>>>>
>>>>>>>> Yaron.
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
> 

Re: accessing Substrait protobuf Python classes from PyArrow

Posted by Yaron Gvili <rt...@hotmail.com>.
It looks like the main decision to make is whether accessing Substrait protobuf Python classes from PyArrow is what we want here, because it would require some effort to implement as discussed.

> These are two reasons to need the protobuf in python.  However, I still don't fully understand why this is needed for pyarrow.

For #1, the invocations of google.protobuf.json_format.ParseDict() are going to be in PyArrow unit tests of UDF functionality. As described earlier, I was able to avoid this invocation until I got a "TypeError: Object of type bytes is not JSON serializable" for a bytes-type field holding the UDF pickle.

For #2, this is related to the more general preference for PyArrow when dealing with UDFs. Basically, it is easier for the Substrait-producer to pickle a Python-implemented UDF into a Substrait plan via PyArrow APIs and, similarly, for the Substrait-consumer to unpickle the UDF from the Substrait plan via PyArrow APIs.  Doing it via Arrow C++ APIs, using callbacks or by launching an embedded Python interpreter, is undesirable IMO. More specifically, the use case of #2 is for UDTs (user-defined tables), which are objects somewhat more complex than UDFs and are yet easier to deal with using Python code.

> In the first approach it would make sense that you would need Substrait access from python.

The first approach is basically the design I'm advocating, except that the registrations allow the replacing-step to be avoided, so the Substrait plan need not be modified. Indeed, this approach leads to a need for Substrait access from Python (and PyArrow, as explained above).

> In the second approach it wouldn't be needed.

Agreed. Though this approach has its disadvantages. For example, each new kind of user-defined object (like UDT) would require at least one new callback in Arrow C++ code. It would also be harder for the user to manually register UDFs outside the context of a Substrait plan, and I think also the Arrow Substrait APIs would need to be changed to accept a FunctionRegistry in order to support a nested registry; it's easier, and probably more flexible, to let PyArrow drive instead of getting called back.

> Is this something I might be able to infer from [1] (I haven't yet had a chance to look at that in detail)?

[1] is focused on UDFs, so it could give you an idea but won't show you the details of UDT integration.


Yaron.
________________________________
From: Weston Pace <we...@gmail.com>
Sent: Wednesday, July 6, 2022 1:30 PM
To: dev@arrow.apache.org <de...@arrow.apache.org>
Subject: Re: accessing Substrait protobuf Python classes from PyArrow

> 1.  Parsing a dictionary using google.protobuf.json_format.ParseDict(). This method requires an instance of a Substrait protobuf Python class, such as Plan, to merge with; hence, the need to access such classes, but no need to modify the Substrait plan. This case was described in the first message in this thread.
> 2.  Accessing a Python object, which is more convenient to manipulate from Python, after unpickling in from a field in the Substrait plan. It's just read-only access to the field from Python, but still needs access to the Substrait protobuf Python classes. This case was mentioned in my previous message in this thread in response to Li's specific question about UDFs.

These are two reasons to need the protobuf in python.  However, I
still don't fully understand why this is needed for pyarrow.

I can maybe try and infer a little.  For example, is this required to
add support in pyarrow for running UDFs that are embedded in the
Substrait plan?

One possible implementation could be to:

 * Have pyarrow accept the Substrait plan
 * Extract the embedded UDFs
 * Register Arrow UDFs
 * Replace the calls to the UDF in the Substrait plan with calls to
the newly registered UDF
 * Submit the modified Substrait plan to Arrow-C++

However, another potential implementation could be to:

 * Define a "UDF provider" in C++.  This is a callback that receives a
buffer (the pickled UDF) and a name and also a function that creates
an Arrow "call" given the name of a UDF
 * Implement the UDF provider in pyarrow
    * The implementation would unpickle the UDF and register a UDF
with Arrow.  Then making the Arrow call would just be creating a call
into the registered UDF.
 * Continue to have Substrait plans go directly into Arrow-C++, but
now pyarrow is a "UDF provider" for "pickle UDFs"

In the first approach it would make sense that you would need
Substrait access from python.  In the second approach it wouldn't be
needed.  I'm not advocating for one approach vs the other but I am
trying to understand the overall goal with adding this support into
pyarrow.  However, a PR may be clearer.  Is this something I might be
able to infer from [1] (I haven't yet had a chance to look at that in
detail)?

> Regarding Rope that I mentioned earlier in this thread, it has an LGPL v3+ license. Is this license acceptable for a build dependency?

TMK, that should be fine for a build dependency.

[1] https://github.com/apache/arrow/pull/13500

On Wed, Jul 6, 2022 at 5:07 AM Yaron Gvili <rt...@hotmail.com> wrote:
>
> Regarding Rope that I mentioned earlier in this thread, it has an LGPL v3+ license. Is this license acceptable for a build dependency?
>
>
> Yaron.
> ________________________________
> From: Yaron Gvili <rt...@hotmail.com>
> Sent: Wednesday, July 6, 2022 7:26 AM
> To: dev@arrow.apache.org <de...@arrow.apache.org>
> Subject: Re: accessing Substrait protobuf Python classes from PyArrow
>
> @Weston: The need for accessing the Substrait protobuf Python classes is not for the purpose of modifying the Substrait plan. There are two relevant cases I went into:
>
>   1.  Parsing a dictionary using google.protobuf.json_format.ParseDict(). This method requires an instance of a Substrait protobuf Python class, such as Plan, to merge with; hence, the need to access such classes, but no need to modify the Substrait plan. This case was described in the first message in this thread.
>   2.  Accessing a Python object, which is more convenient to manipulate from Python, after unpickling in from a field in the Substrait plan. It's just read-only access to the field from Python, but still needs access to the Substrait protobuf Python classes. This case was mentioned in my previous message in this thread in response to Li's specific question about UDFs.
>
> Yaron.
> ________________________________
> From: Weston Pace <we...@gmail.com>
> Sent: Tuesday, July 5, 2022 7:59 PM
> To: dev@arrow.apache.org <de...@arrow.apache.org>
> Subject: Re: accessing Substrait protobuf Python classes from PyArrow
>
> If this is to be used within pyarrow then I think it makes sense as
> something to do.  Can you expand a bit more on what you are trying to
> do?  Why do you need to modify the Substrait plan in pyarrow?
>
> > This came up in a data-source UDF scenario where the implementation is a Python stream factory, i.e., a Python callable returning a callable. I think it would be easier to get the factory and then the stream using Python code than using Cython or C++.
>
> I'm not quite certain why this requires a modification to the plan.
>
>
> On Tue, Jul 5, 2022 at 7:45 AM Yaron Gvili <rt...@hotmail.com> wrote:
> >
> > @Li, yes though in a new way. This came up in a data-source UDF scenario where the implementation is a Python stream factory, i.e., a Python callable returning a callable. I think it would be easier to get the factory and then the stream using Python code than using Cython or C++.
> >
> >
> > Yaron.
> >
> > ________________________________
> > From: Li Jin <ic...@gmail.com>
> > Sent: Tuesday, July 5, 2022, 4:22 PM
> > To: dev@arrow.apache.org <de...@arrow.apache.org>
> > Subject: Re: accessing Substrait protobuf Python classes from PyArrow
> >
> > Yaron, do we need to parse the subtrait protobuf in Python so that we can
> > get the UDFs and register them with Pyarrow?
> >
> > On Mon, Jul 4, 2022 at 1:24 PM Yaron Gvili <rt...@hotmail.com> wrote:
> >
> > > This rewriting of the package is basically what I had in mind; the `_ep`
> > > was just to signal a private package, which cannot be enforced, of course.
> > > Assuming this rewriting would indeed avoid conflict with any standard
> > > protobuf package a user might want to use, it would be nicer to do this
> > > using a robust refactoring
> > > tool. We could try Rope [1] (in particular, its folder refactoring library
> > > method [2]) for this; it should add Rope only as a build dependency. Does
> > > this sound worth trying?
> > >
> > > [1] https://github.com/python-rope/rope
> > > [2]
> > > https://rope.readthedocs.io/en/latest/library.html#list-of-refactorings
> > >
> > >
> > > Yaron.
> > > ________________________________
> > > From: Jeroen van Straten <je...@gmail.com>
> > > Sent: Monday, July 4, 2022 12:38 PM
> > > To: dev@arrow.apache.org <de...@arrow.apache.org>
> > > Subject: Re: accessing Substrait protobuf Python classes from PyArrow
> > >
> > > Ah, I see. I guess that complicates things a little.
> > >
> > > About the exposure part in C++, the idea is that if we statically link
> > > libprotobuf and use private linkage for the generated classes, *hopefully*
> > > users won't run into linking issues when they link to both Arrow and to
> > > some libprotobuf of their own, especially when they or another one of their
> > > dependencies also includes Substrait protobuf. The classes would absolutely
> > > conflict otherwise, and the libprotobuf version may or may not conflict.
> > > The latter is kind of an unsolved problem; AFAIK there have been talks to
> > > replace libprotobuf with a third-party alternative to avoid all this
> > > nonsense, but nothing concrete yet.
> > >
> > > For Python, the problems are not quite the same:
> > >
> > >  - There is no such thing as private linkage in Python, so we *have* to
> > > re-namespace the generated Python files. Note that the output of protoc
> > > assumes that the files are in exactly the same namespace/module
> > > path/whatever as the proto files specify, so the toplevel module would be
> > > "substrait", not "arrow". See
> > > https://github.com/substrait-io/substrait/pull/207 and
> > >
> > > https://github.com/substrait-io/substrait-validator/blob/941b81ddf65d8ee6266d6eea0d151a3f9d8a512b/py/build.rs#L105-L142
> > > for my ugly workarounds for this thus far.
> > >  - There is also no such thing as static linkage in Python, so if you'd
> > > want to use Google's protobuf implementation in Python pyarrow *will* need
> > > to depend on it, and it'd become the user's problem to make sure that the
> > > installed Python protobuf version matches the version of their system
> > > library. It looks like pyarrow currently only depends on numpy, which is
> > > pretty awesome... so I feel like we should keep it that way.
> > >
> > > Not sure what the best course of action is.
> > >
> > > Jeroen
> > >
> > > On Sun, 3 Jul 2022 at 22:55, Yaron Gvili <rt...@hotmail.com> wrote:
> > >
> > > > Thanks, the Google protobuf exposure concerns are clear. Another concern
> > > > would be consistent maintenance of the same version of Substrait protobuf
> > > > used in Arrow C++ and in PyArrow.
> > > >
> > > > I didn't mean access to users but internally. That is, I didn't mean to
> > > > expose Substrait protobuf Python classes to PyArrow users, just to use
> > > them
> > > > internally in PyArrow code I'll be developing, per the use case I
> > > described
> > > > at the start of this thread. IIUC, this should address the exposure
> > > > concerns, and the `arrow/engine/substrait` module in Arrow C++ does this
> > > > too. If the technical approach I just described would actually expose the
> > > > classes, what would be a proper way to avoid exposing them? Perhaps the
> > > > classes should be generated into a private package, e.g., under
> > > > `python/_ep`? (ep stands for external project)
> > > >
> > > >
> > > > Yaron.
> > > > ________________________________
> > > > From: Antoine Pitrou <an...@python.org>
> > > > Sent: Sunday, July 3, 2022 3:20 PM
> > > > To: dev@arrow.apache.org <de...@arrow.apache.org>
> > > > Subject: Re: accessing Substrait protobuf Python classes from PyArrow
> > > >
> > > >
> > > > I agree that giving direct access to protobuf classes is not Arrow's
> > > > job. You can probably take the upstream (i.e. Substrait's) protobuf
> > > > definitions and compile them yourself, using whatever settings required
> > > > by your project.
> > > >
> > > > Regards
> > > >
> > > > Antoine.
> > > >
> > > >
> > > > Le 03/07/2022 à 21:16, Jeroen van Straten a écrit :
> > > > > It's not so much about whether or not we *can*, but about whether or
> > > not
> > > > we
> > > > > *should* generate and expose these files.
> > > > >
> > > > > Fundamentally, Substrait aims to be a means to connect different
> > > systems
> > > > > together by standardizing some interchange format. Currently that
> > > happens
> > > > > to be based on protobuf, so *one* (obvious) way to generate and
> > > interpret
> > > > > Substrait plans is to use Google's own protobuf implementation, as
> > > > > generated by protoc for various languages. It's true that there's
> > > nothing
> > > > > fundamentally blocking Arrow from exposing those.
> > > > >
> > > > > However, it's by no means the *only* way to interact with protobuf
> > > > > serializations, and Google's own implementation is a hot mess when it
> > > > comes
> > > > > to dependencies; there are lots of good reasons why you might not want
> > > to
> > > > > depend on it and opt for a third-party implementation instead. For one
> > > > > thing, the Python wheel depends on system libraries beyond manylinux,
> > > and
> > > > > if they happen to be incompatible (which is likely) it just explodes
> > > > unless
> > > > > you set some environment variable. Therefore, assuming pyarrow doesn't
> > > > > already depend on protobuf, I feel like we should keep it that way, and
> > > > > thus that we should not include the generated Python files in the
> > > > release.
> > > > > Note that we don't even expose/leak the protoc-generated C++ classes
> > > for
> > > > > similar reasons.
> > > > >
> > > > > Also, as Weston already pointed out, it's not really our job; Substrait
> > > > > aims to publish bindings for various languages by itself. It just
> > > doesn't
> > > > > expose them for Python *yet*. In the interim I suppose you could use
> > > the
> > > > > substrait-validator package from PyPI. It does expose them, as well as
> > > > some
> > > > > convenient conversion functions, but I'm having trouble finding people
> > > to
> > > > > help me keep the validator maintained.
> > > > >
> > > > > I suppose versioning would be difficult either way, since Substrait
> > > > > semi-regularly pushes breaking changes and Arrow currently lags behind
> > > by
> > > > > several months (though I have a PR open for Substrait 0.6). I guess
> > > from
> > > > > that point of view distributing the right version along with pyarrow
> > > > seems
> > > > > nice, but the issues of Google's protobuf implementation remain. This
> > > > being
> > > > > an issue at all is also very much a Substrait problem, not an Arrow
> > > > > problem; at best we can try to mitigate it.
> > > > >
> > > > > Jeroen
> > > > >
> > > > > On Sun, Jul 3, 2022, 17:51 Yaron Gvili <rt...@hotmail.com> wrote:
> > > > >
> > > > >> I looked into the Arrow build system some more. It is possible to get
> > > > the
> > > > >> Python classes generated by adding "--python-out" flag (set to a
> > > > directory
> > > > >> created for it) to the `${ARROW_PROTOBUF_PROTOC}` command under
> > > > >> `macro(build_substrait)` in
> > > > `cpp/cmake_modules/ThirdpartyToolchain.cmake`.
> > > > >> However, this makes them available only in the Arrow C++ build whereas
> > > > for
> > > > >> the current purpose they need to be available in the PyArrow build.
> > > The
> > > > >> PyArrow build calls `cmake` on `python/CMakeLists.txt`, which AFAICS
> > > has
> > > > >> access to `cpp/cmake_modules`. So, one solution could be to pull
> > > > >> `macro(build_substrait)` into `python/CMakeLists.txt` and call it to
> > > > >> generate the Python protobuf classes under `python/`, making them
> > > > available
> > > > >> for import by PyArrow code. This would probably be cleaner with some
> > > > macro
> > > > >> parameters to distinguish between C++ and Python generation.
> > > > >>
> > > > >> Does this sound like a reasonable approach?
> > > > >>
> > > > >>
> > > > >> Yaron.
> > > > >>
> > > > >> ________________________________
> > > > >> From: Yaron Gvili <rt...@hotmail.com>
> > > > >> Sent: Saturday, July 2, 2022 8:55 AM
> > > > >> To: dev@arrow.apache.org <de...@arrow.apache.org>; Phillip Cloud <
> > > > >> cpcloud@gmail.com>
> > > > >> Subject: Re: accessing Substrait protobuf Python classes from PyArrow
> > > > >>
> > > > >> I'm somewhat confused by this answer because I think resolving the
> > > > issue I
> > > > >> raised does not require any change outside PyArrow. I'll try to
> > > explain
> > > > the
> > > > >> issue differently.
> > > > >>
> > > > >> First, let me describe the current situation with Substrait protobuf
> > > in
> > > > >> Arrow C++. The Arrow C++ build system handles external projects in
> > > > >> `cpp/cmake_modules/ThirdpartyToolchain.cmake`, and one of these
> > > external
> > > > >> projects is "substrait". By default, the build system takes the source
> > > > code
> > > > >> for "substrait" from `
> > > > >>
> > > >
> > > https://github.com/substrait-io/substrait/archive/${ARROW_SUBSTRAIT_BUILD_VERSION}.tar.gz
> > > > >> ` where `ARROW_SUBSTRAIT_BUILD_VERSION` is set in
> > > > >> `cpp/thirdparty/versions.txt`. The source code is check-summed and
> > > > unpacked
> > > > >> in `substrait_ep-prefix` under the build directory and from this the
> > > > >> protobuf C++ classes are generated in `*.pb.{h,cc}` files in
> > > > >> `substrait_ep-generated` under the build directory. The build system
> > > > makes
> > > > >> a library using the `*.cc` files and makes the `*.h` files available
> > > for
> > > > >> other C++ modules to use.
> > > > >>
> > > > >> Setting up the above mechanism did not require any change in the
> > > > >> `substrait-io/substrait` repo, nor any coordination with its authors.
> > > > What
> > > > >> I'm looking for is a similar build mechanism for PyArrow that builds
> > > > >> Substrait protobuf Python classes and makes them available for use by
> > > > other
> > > > >> PyArrow modules. I believe this PyArrow build mechanism does not exist
> > > > >> currently and that setting up one would not require any changes
> > > outside
> > > > >> PyArrow. I'm asking (1) whether that's indeed the case, (2) whether
> > > > others
> > > > >> agree this mechanism is needed at least due to the problem I ran into
> > > > that
> > > > >> I previously described, and (3) for any thoughts about how to set up
> > > > this
> > > > >> mechanism assuming it is needed.
> > > > >>
> > > > >> Weston, perhaps your thinking was that the Substrait protobuf Python
> > > > >> classes need to be built by a repo in the substrait-io space and made
> > > > >> available as a binary+headers package? This can be done but will
> > > require
> > > > >> involving Substrait people and appears to be inconsistent with current
> > > > >> patterns in the Arrow build system. Note that for my purposes here,
> > > the
> > > > >> Substrait protobuf Python classes will be used for composing or
> > > > >> interpreting a Substrait plan, not for transforming it by an
> > > optimizer,
> > > > >> though a Python-based optimizer is a valid use case for them.
> > > > >>
> > > > >>
> > > > >> Yaron.
> > > > >> ________________________________
> > > > >> From: Weston Pace <we...@gmail.com>
> > > > >> Sent: Friday, July 1, 2022 12:42 PM
> > > > >> To: dev@arrow.apache.org <de...@arrow.apache.org>; Phillip Cloud <
> > > > >> cpcloud@gmail.com>
> > > > >> Subject: Re: accessing Substrait protobuf Python classes from PyArrow
> > > > >>
> > > > >> Given that Acero does not do any planner / optimizer type tasks I'm
> > > > >> not sure you will find anything like this in arrow-cpp or pyarrow.
> > > > >> What you are describing I sometimes refer to as "plan slicing and
> > > > >> dicing".  I have wondered if we will someday need this in Acero but I
> > > > >> fear it is a slippery slope between "a little bit of plan
> > > > >> manipulation" and "a full blown planner" so I've shied away from it.
> > > > >> My first spot to look would be a substrait-python repository which
> > > > >> would belong here: https://github.com/substrait-io
> > > > >>
> > > > >> However, it does not appear that such a repository exists.  If you're
> > > > >> willing to create one then a quick ask on the Substrait Slack instance
> > > > >> should be enough to get the repository created.  Perhaps there is some
> > > > >> genesis of this library in Ibis although I think Ibis would use its
> > > > >> own representation for slicing and dicing and only use Substrait for
> > > > >> serialization.
> > > > >>
> > > > >> Once that repository is created pyarrow could probably import it but
> > > > >> unless this plan manipulation makes sense purely from a pyarrow
> > > > >> perspective I would rather prefer that the user application import
> > > > >> both pyarrow and substrait-python independently.
> > > > >>
> > > > >> Perhaps @Phillip Cloud or someone from the Ibis space might have some
> > > > >> ideas on where this might be found.
> > > > >>
> > > > >> -Weston
> > > > >>
> > > > >> On Thu, Jun 30, 2022 at 10:06 AM Yaron Gvili <rt...@hotmail.com>
> > > wrote:
> > > > >>>
> > > > >>> Hi,
> > > > >>>
> > > > >>> Is there support for accessing Substrait protobuf Python classes
> > > (such
> > > > >> as Plan) from PyArrow? If not, how should such support be added? For
> > > > >> example, should the PyArrow build system pull in the Substrait repo as
> > > > an
> > > > >> external project and build its protobuf Python classes, in a manner
> > > > similar
> > > > >> to how Arrow C++ does it?
> > > > >>>
> > > > >>> I'm pondering these questions after running into an issue with code
> > > I'm
> > > > >> writing under PyArrow that parses a Substrait plan represented as a
> > > > >> dictionary. The current (and kind of shaky) parsing operation in this
> > > > code
> > > > >> uses json.dumps() on the dictionary, which results in a string that is
> > > > >> passed to a Cython API that handles it using Arrow C++ code that has
> > > > access
> > > > >> to Substrait protobuf C++ classes. But when the Substrait plan
> > > contains
> > > > a
> > > > >> bytes-type, json.dump() no longer works and fails with "TypeError:
> > > > Object
> > > > >> of type bytes is not JSON serializable". A fix for this, and a better
> > > > way
> > > > >> to parse, is using google.protobuf.json_format.ParseDict() [1] on the
> > > > >> dictionary. However, this invocation requires a second argument,
> > > namely
> > > > a
> > > > >> protobuf message instance to merge with. The class of this message
> > > > (such as
> > > > >> Plan) is a Substrait protobuf Python class, hence the need to access
> > > > such
> > > > >> classes from PyArrow.
> > > > >>>
> > > > >>> [1]
> > > > >>
> > > >
> > > https://googleapis.dev/python/protobuf/latest/google/protobuf/json_format.html
> > > > >>>
> > > > >>>
> > > > >>> Yaron.
> > > > >>
> > > > >
> > > >
> > >
> >

Re: accessing Substrait protobuf Python classes from PyArrow

Posted by Weston Pace <we...@gmail.com>.
> 1.  Parsing a dictionary using google.protobuf.json_format.ParseDict(). This method requires an instance of a Substrait protobuf Python class, such as Plan, to merge with; hence, the need to access such classes, but no need to modify the Substrait plan. This case was described in the first message in this thread.
> 2.  Accessing a Python object, which is more convenient to manipulate from Python, after unpickling in from a field in the Substrait plan. It's just read-only access to the field from Python, but still needs access to the Substrait protobuf Python classes. This case was mentioned in my previous message in this thread in response to Li's specific question about UDFs.

These are two reasons to need the protobuf in python.  However, I
still don't fully understand why this is needed for pyarrow.

I can maybe try and infer a little.  For example, is this required to
add support in pyarrow for running UDFs that are embedded in the
Substrait plan?

One possible implementation could be to:

 * Have pyarrow accept the Substrait plan
 * Extract the embedded UDFs
 * Register Arrow UDFs
 * Replace the calls to the UDF in the Substrait plan with calls to
the newly registered UDF
 * Submit the modified Substrait plan to Arrow-C++

However, another potential implementation could be to:

 * Define a "UDF provider" in C++.  This is a callback that receives a
buffer (the pickled UDF) and a name and also a function that creates
an Arrow "call" given the name of a UDF
 * Implement the UDF provider in pyarrow
    * The implementation would unpickle the UDF and register a UDF
with Arrow.  Then making the Arrow call would just be creating a call
into the registered UDF.
 * Continue to have Substrait plans go directly into Arrow-C++, but
now pyarrow is a "UDF provider" for "pickle UDFs"

In the first approach it would make sense that you would need
Substrait access from python.  In the second approach it wouldn't be
needed.  I'm not advocating for one approach vs the other but I am
trying to understand the overall goal with adding this support into
pyarrow.  However, a PR may be clearer.  Is this something I might be
able to infer from [1] (I haven't yet had a chance to look at that in
detail)?

> Regarding Rope that I mentioned earlier in this thread, it has an LGPL v3+ license. Is this license acceptable for a build dependency?

TMK, that should be fine for a build dependency.

[1] https://github.com/apache/arrow/pull/13500

On Wed, Jul 6, 2022 at 5:07 AM Yaron Gvili <rt...@hotmail.com> wrote:
>
> Regarding Rope that I mentioned earlier in this thread, it has an LGPL v3+ license. Is this license acceptable for a build dependency?
>
>
> Yaron.
> ________________________________
> From: Yaron Gvili <rt...@hotmail.com>
> Sent: Wednesday, July 6, 2022 7:26 AM
> To: dev@arrow.apache.org <de...@arrow.apache.org>
> Subject: Re: accessing Substrait protobuf Python classes from PyArrow
>
> @Weston: The need for accessing the Substrait protobuf Python classes is not for the purpose of modifying the Substrait plan. There are two relevant cases I went into:
>
>   1.  Parsing a dictionary using google.protobuf.json_format.ParseDict(). This method requires an instance of a Substrait protobuf Python class, such as Plan, to merge with; hence, the need to access such classes, but no need to modify the Substrait plan. This case was described in the first message in this thread.
>   2.  Accessing a Python object, which is more convenient to manipulate from Python, after unpickling in from a field in the Substrait plan. It's just read-only access to the field from Python, but still needs access to the Substrait protobuf Python classes. This case was mentioned in my previous message in this thread in response to Li's specific question about UDFs.
>
> Yaron.
> ________________________________
> From: Weston Pace <we...@gmail.com>
> Sent: Tuesday, July 5, 2022 7:59 PM
> To: dev@arrow.apache.org <de...@arrow.apache.org>
> Subject: Re: accessing Substrait protobuf Python classes from PyArrow
>
> If this is to be used within pyarrow then I think it makes sense as
> something to do.  Can you expand a bit more on what you are trying to
> do?  Why do you need to modify the Substrait plan in pyarrow?
>
> > This came up in a data-source UDF scenario where the implementation is a Python stream factory, i.e., a Python callable returning a callable. I think it would be easier to get the factory and then the stream using Python code than using Cython or C++.
>
> I'm not quite certain why this requires a modification to the plan.
>
>
> On Tue, Jul 5, 2022 at 7:45 AM Yaron Gvili <rt...@hotmail.com> wrote:
> >
> > @Li, yes though in a new way. This came up in a data-source UDF scenario where the implementation is a Python stream factory, i.e., a Python callable returning a callable. I think it would be easier to get the factory and then the stream using Python code than using Cython or C++.
> >
> >
> > Yaron.
> >
> > ________________________________
> > From: Li Jin <ic...@gmail.com>
> > Sent: Tuesday, July 5, 2022, 4:22 PM
> > To: dev@arrow.apache.org <de...@arrow.apache.org>
> > Subject: Re: accessing Substrait protobuf Python classes from PyArrow
> >
> > Yaron, do we need to parse the subtrait protobuf in Python so that we can
> > get the UDFs and register them with Pyarrow?
> >
> > On Mon, Jul 4, 2022 at 1:24 PM Yaron Gvili <rt...@hotmail.com> wrote:
> >
> > > This rewriting of the package is basically what I had in mind; the `_ep`
> > > was just to signal a private package, which cannot be enforced, of course.
> > > Assuming this rewriting would indeed avoid conflict with any standard
> > > protobuf package a user might want to use, it would be nicer to do this
> > > using a robust refactoring
> > > tool. We could try Rope [1] (in particular, its folder refactoring library
> > > method [2]) for this; it should add Rope only as a build dependency. Does
> > > this sound worth trying?
> > >
> > > [1] https://github.com/python-rope/rope
> > > [2]
> > > https://rope.readthedocs.io/en/latest/library.html#list-of-refactorings
> > >
> > >
> > > Yaron.
> > > ________________________________
> > > From: Jeroen van Straten <je...@gmail.com>
> > > Sent: Monday, July 4, 2022 12:38 PM
> > > To: dev@arrow.apache.org <de...@arrow.apache.org>
> > > Subject: Re: accessing Substrait protobuf Python classes from PyArrow
> > >
> > > Ah, I see. I guess that complicates things a little.
> > >
> > > About the exposure part in C++, the idea is that if we statically link
> > > libprotobuf and use private linkage for the generated classes, *hopefully*
> > > users won't run into linking issues when they link to both Arrow and to
> > > some libprotobuf of their own, especially when they or another one of their
> > > dependencies also includes Substrait protobuf. The classes would absolutely
> > > conflict otherwise, and the libprotobuf version may or may not conflict.
> > > The latter is kind of an unsolved problem; AFAIK there have been talks to
> > > replace libprotobuf with a third-party alternative to avoid all this
> > > nonsense, but nothing concrete yet.
> > >
> > > For Python, the problems are not quite the same:
> > >
> > >  - There is no such thing as private linkage in Python, so we *have* to
> > > re-namespace the generated Python files. Note that the output of protoc
> > > assumes that the files are in exactly the same namespace/module
> > > path/whatever as the proto files specify, so the toplevel module would be
> > > "substrait", not "arrow". See
> > > https://github.com/substrait-io/substrait/pull/207 and
> > >
> > > https://github.com/substrait-io/substrait-validator/blob/941b81ddf65d8ee6266d6eea0d151a3f9d8a512b/py/build.rs#L105-L142
> > > for my ugly workarounds for this thus far.
> > >  - There is also no such thing as static linkage in Python, so if you'd
> > > want to use Google's protobuf implementation in Python pyarrow *will* need
> > > to depend on it, and it'd become the user's problem to make sure that the
> > > installed Python protobuf version matches the version of their system
> > > library. It looks like pyarrow currently only depends on numpy, which is
> > > pretty awesome... so I feel like we should keep it that way.
> > >
> > > Not sure what the best course of action is.
> > >
> > > Jeroen
> > >
> > > On Sun, 3 Jul 2022 at 22:55, Yaron Gvili <rt...@hotmail.com> wrote:
> > >
> > > > Thanks, the Google protobuf exposure concerns are clear. Another concern
> > > > would be consistent maintenance of the same version of Substrait protobuf
> > > > used in Arrow C++ and in PyArrow.
> > > >
> > > > I didn't mean access to users but internally. That is, I didn't mean to
> > > > expose Substrait protobuf Python classes to PyArrow users, just to use
> > > them
> > > > internally in PyArrow code I'll be developing, per the use case I
> > > described
> > > > at the start of this thread. IIUC, this should address the exposure
> > > > concerns, and the `arrow/engine/substrait` module in Arrow C++ does this
> > > > too. If the technical approach I just described would actually expose the
> > > > classes, what would be a proper way to avoid exposing them? Perhaps the
> > > > classes should be generated into a private package, e.g., under
> > > > `python/_ep`? (ep stands for external project)
> > > >
> > > >
> > > > Yaron.
> > > > ________________________________
> > > > From: Antoine Pitrou <an...@python.org>
> > > > Sent: Sunday, July 3, 2022 3:20 PM
> > > > To: dev@arrow.apache.org <de...@arrow.apache.org>
> > > > Subject: Re: accessing Substrait protobuf Python classes from PyArrow
> > > >
> > > >
> > > > I agree that giving direct access to protobuf classes is not Arrow's
> > > > job. You can probably take the upstream (i.e. Substrait's) protobuf
> > > > definitions and compile them yourself, using whatever settings required
> > > > by your project.
> > > >
> > > > Regards
> > > >
> > > > Antoine.
> > > >
> > > >
> > > > Le 03/07/2022 à 21:16, Jeroen van Straten a écrit :
> > > > > It's not so much about whether or not we *can*, but about whether or
> > > not
> > > > we
> > > > > *should* generate and expose these files.
> > > > >
> > > > > Fundamentally, Substrait aims to be a means to connect different
> > > systems
> > > > > together by standardizing some interchange format. Currently that
> > > happens
> > > > > to be based on protobuf, so *one* (obvious) way to generate and
> > > interpret
> > > > > Substrait plans is to use Google's own protobuf implementation, as
> > > > > generated by protoc for various languages. It's true that there's
> > > nothing
> > > > > fundamentally blocking Arrow from exposing those.
> > > > >
> > > > > However, it's by no means the *only* way to interact with protobuf
> > > > > serializations, and Google's own implementation is a hot mess when it
> > > > comes
> > > > > to dependencies; there are lots of good reasons why you might not want
> > > to
> > > > > depend on it and opt for a third-party implementation instead. For one
> > > > > thing, the Python wheel depends on system libraries beyond manylinux,
> > > and
> > > > > if they happen to be incompatible (which is likely) it just explodes
> > > > unless
> > > > > you set some environment variable. Therefore, assuming pyarrow doesn't
> > > > > already depend on protobuf, I feel like we should keep it that way, and
> > > > > thus that we should not include the generated Python files in the
> > > > release.
> > > > > Note that we don't even expose/leak the protoc-generated C++ classes
> > > for
> > > > > similar reasons.
> > > > >
> > > > > Also, as Weston already pointed out, it's not really our job; Substrait
> > > > > aims to publish bindings for various languages by itself. It just
> > > doesn't
> > > > > expose them for Python *yet*. In the interim I suppose you could use
> > > the
> > > > > substrait-validator package from PyPI. It does expose them, as well as
> > > > some
> > > > > convenient conversion functions, but I'm having trouble finding people
> > > to
> > > > > help me keep the validator maintained.
> > > > >
> > > > > I suppose versioning would be difficult either way, since Substrait
> > > > > semi-regularly pushes breaking changes and Arrow currently lags behind
> > > by
> > > > > several months (though I have a PR open for Substrait 0.6). I guess
> > > from
> > > > > that point of view distributing the right version along with pyarrow
> > > > seems
> > > > > nice, but the issues of Google's protobuf implementation remain. This
> > > > being
> > > > > an issue at all is also very much a Substrait problem, not an Arrow
> > > > > problem; at best we can try to mitigate it.
> > > > >
> > > > > Jeroen
> > > > >
> > > > > On Sun, Jul 3, 2022, 17:51 Yaron Gvili <rt...@hotmail.com> wrote:
> > > > >
> > > > >> I looked into the Arrow build system some more. It is possible to get
> > > > the
> > > > >> Python classes generated by adding "--python-out" flag (set to a
> > > > directory
> > > > >> created for it) to the `${ARROW_PROTOBUF_PROTOC}` command under
> > > > >> `macro(build_substrait)` in
> > > > `cpp/cmake_modules/ThirdpartyToolchain.cmake`.
> > > > >> However, this makes them available only in the Arrow C++ build whereas
> > > > for
> > > > >> the current purpose they need to be available in the PyArrow build.
> > > The
> > > > >> PyArrow build calls `cmake` on `python/CMakeLists.txt`, which AFAICS
> > > has
> > > > >> access to `cpp/cmake_modules`. So, one solution could be to pull
> > > > >> `macro(build_substrait)` into `python/CMakeLists.txt` and call it to
> > > > >> generate the Python protobuf classes under `python/`, making them
> > > > available
> > > > >> for import by PyArrow code. This would probably be cleaner with some
> > > > macro
> > > > >> parameters to distinguish between C++ and Python generation.
> > > > >>
> > > > >> Does this sound like a reasonable approach?
> > > > >>
> > > > >>
> > > > >> Yaron.
> > > > >>
> > > > >> ________________________________
> > > > >> From: Yaron Gvili <rt...@hotmail.com>
> > > > >> Sent: Saturday, July 2, 2022 8:55 AM
> > > > >> To: dev@arrow.apache.org <de...@arrow.apache.org>; Phillip Cloud <
> > > > >> cpcloud@gmail.com>
> > > > >> Subject: Re: accessing Substrait protobuf Python classes from PyArrow
> > > > >>
> > > > >> I'm somewhat confused by this answer because I think resolving the
> > > > issue I
> > > > >> raised does not require any change outside PyArrow. I'll try to
> > > explain
> > > > the
> > > > >> issue differently.
> > > > >>
> > > > >> First, let me describe the current situation with Substrait protobuf
> > > in
> > > > >> Arrow C++. The Arrow C++ build system handles external projects in
> > > > >> `cpp/cmake_modules/ThirdpartyToolchain.cmake`, and one of these
> > > external
> > > > >> projects is "substrait". By default, the build system takes the source
> > > > code
> > > > >> for "substrait" from `
> > > > >>
> > > >
> > > https://github.com/substrait-io/substrait/archive/${ARROW_SUBSTRAIT_BUILD_VERSION}.tar.gz
> > > > >> ` where `ARROW_SUBSTRAIT_BUILD_VERSION` is set in
> > > > >> `cpp/thirdparty/versions.txt`. The source code is check-summed and
> > > > unpacked
> > > > >> in `substrait_ep-prefix` under the build directory and from this the
> > > > >> protobuf C++ classes are generated in `*.pb.{h,cc}` files in
> > > > >> `substrait_ep-generated` under the build directory. The build system
> > > > makes
> > > > >> a library using the `*.cc` files and makes the `*.h` files available
> > > for
> > > > >> other C++ modules to use.
> > > > >>
> > > > >> Setting up the above mechanism did not require any change in the
> > > > >> `substrait-io/substrait` repo, nor any coordination with its authors.
> > > > What
> > > > >> I'm looking for is a similar build mechanism for PyArrow that builds
> > > > >> Substrait protobuf Python classes and makes them available for use by
> > > > other
> > > > >> PyArrow modules. I believe this PyArrow build mechanism does not exist
> > > > >> currently and that setting up one would not require any changes
> > > outside
> > > > >> PyArrow. I'm asking (1) whether that's indeed the case, (2) whether
> > > > others
> > > > >> agree this mechanism is needed at least due to the problem I ran into
> > > > that
> > > > >> I previously described, and (3) for any thoughts about how to set up
> > > > this
> > > > >> mechanism assuming it is needed.
> > > > >>
> > > > >> Weston, perhaps your thinking was that the Substrait protobuf Python
> > > > >> classes need to be built by a repo in the substrait-io space and made
> > > > >> available as a binary+headers package? This can be done but will
> > > require
> > > > >> involving Substrait people and appears to be inconsistent with current
> > > > >> patterns in the Arrow build system. Note that for my purposes here,
> > > the
> > > > >> Substrait protobuf Python classes will be used for composing or
> > > > >> interpreting a Substrait plan, not for transforming it by an
> > > optimizer,
> > > > >> though a Python-based optimizer is a valid use case for them.
> > > > >>
> > > > >>
> > > > >> Yaron.
> > > > >> ________________________________
> > > > >> From: Weston Pace <we...@gmail.com>
> > > > >> Sent: Friday, July 1, 2022 12:42 PM
> > > > >> To: dev@arrow.apache.org <de...@arrow.apache.org>; Phillip Cloud <
> > > > >> cpcloud@gmail.com>
> > > > >> Subject: Re: accessing Substrait protobuf Python classes from PyArrow
> > > > >>
> > > > >> Given that Acero does not do any planner / optimizer type tasks I'm
> > > > >> not sure you will find anything like this in arrow-cpp or pyarrow.
> > > > >> What you are describing I sometimes refer to as "plan slicing and
> > > > >> dicing".  I have wondered if we will someday need this in Acero but I
> > > > >> fear it is a slippery slope between "a little bit of plan
> > > > >> manipulation" and "a full blown planner" so I've shied away from it.
> > > > >> My first spot to look would be a substrait-python repository which
> > > > >> would belong here: https://github.com/substrait-io
> > > > >>
> > > > >> However, it does not appear that such a repository exists.  If you're
> > > > >> willing to create one then a quick ask on the Substrait Slack instance
> > > > >> should be enough to get the repository created.  Perhaps there is some
> > > > >> genesis of this library in Ibis although I think Ibis would use its
> > > > >> own representation for slicing and dicing and only use Substrait for
> > > > >> serialization.
> > > > >>
> > > > >> Once that repository is created pyarrow could probably import it but
> > > > >> unless this plan manipulation makes sense purely from a pyarrow
> > > > >> perspective I would rather prefer that the user application import
> > > > >> both pyarrow and substrait-python independently.
> > > > >>
> > > > >> Perhaps @Phillip Cloud or someone from the Ibis space might have some
> > > > >> ideas on where this might be found.
> > > > >>
> > > > >> -Weston
> > > > >>
> > > > >> On Thu, Jun 30, 2022 at 10:06 AM Yaron Gvili <rt...@hotmail.com>
> > > wrote:
> > > > >>>
> > > > >>> Hi,
> > > > >>>
> > > > >>> Is there support for accessing Substrait protobuf Python classes
> > > (such
> > > > >> as Plan) from PyArrow? If not, how should such support be added? For
> > > > >> example, should the PyArrow build system pull in the Substrait repo as
> > > > an
> > > > >> external project and build its protobuf Python classes, in a manner
> > > > similar
> > > > >> to how Arrow C++ does it?
> > > > >>>
> > > > >>> I'm pondering these questions after running into an issue with code
> > > I'm
> > > > >> writing under PyArrow that parses a Substrait plan represented as a
> > > > >> dictionary. The current (and kind of shaky) parsing operation in this
> > > > code
> > > > >> uses json.dumps() on the dictionary, which results in a string that is
> > > > >> passed to a Cython API that handles it using Arrow C++ code that has
> > > > access
> > > > >> to Substrait protobuf C++ classes. But when the Substrait plan
> > > contains
> > > > a
> > > > >> bytes-type, json.dump() no longer works and fails with "TypeError:
> > > > Object
> > > > >> of type bytes is not JSON serializable". A fix for this, and a better
> > > > way
> > > > >> to parse, is using google.protobuf.json_format.ParseDict() [1] on the
> > > > >> dictionary. However, this invocation requires a second argument,
> > > namely
> > > > a
> > > > >> protobuf message instance to merge with. The class of this message
> > > > (such as
> > > > >> Plan) is a Substrait protobuf Python class, hence the need to access
> > > > such
> > > > >> classes from PyArrow.
> > > > >>>
> > > > >>> [1]
> > > > >>
> > > >
> > > https://googleapis.dev/python/protobuf/latest/google/protobuf/json_format.html
> > > > >>>
> > > > >>>
> > > > >>> Yaron.
> > > > >>
> > > > >
> > > >
> > >
> >

Re: accessing Substrait protobuf Python classes from PyArrow

Posted by Yaron Gvili <rt...@hotmail.com>.
Regarding Rope that I mentioned earlier in this thread, it has an LGPL v3+ license. Is this license acceptable for a build dependency?


Yaron.
________________________________
From: Yaron Gvili <rt...@hotmail.com>
Sent: Wednesday, July 6, 2022 7:26 AM
To: dev@arrow.apache.org <de...@arrow.apache.org>
Subject: Re: accessing Substrait protobuf Python classes from PyArrow

@Weston: The need for accessing the Substrait protobuf Python classes is not for the purpose of modifying the Substrait plan. There are two relevant cases I went into:

  1.  Parsing a dictionary using google.protobuf.json_format.ParseDict(). This method requires an instance of a Substrait protobuf Python class, such as Plan, to merge with; hence, the need to access such classes, but no need to modify the Substrait plan. This case was described in the first message in this thread.
  2.  Accessing a Python object, which is more convenient to manipulate from Python, after unpickling in from a field in the Substrait plan. It's just read-only access to the field from Python, but still needs access to the Substrait protobuf Python classes. This case was mentioned in my previous message in this thread in response to Li's specific question about UDFs.

Yaron.
________________________________
From: Weston Pace <we...@gmail.com>
Sent: Tuesday, July 5, 2022 7:59 PM
To: dev@arrow.apache.org <de...@arrow.apache.org>
Subject: Re: accessing Substrait protobuf Python classes from PyArrow

If this is to be used within pyarrow then I think it makes sense as
something to do.  Can you expand a bit more on what you are trying to
do?  Why do you need to modify the Substrait plan in pyarrow?

> This came up in a data-source UDF scenario where the implementation is a Python stream factory, i.e., a Python callable returning a callable. I think it would be easier to get the factory and then the stream using Python code than using Cython or C++.

I'm not quite certain why this requires a modification to the plan.


On Tue, Jul 5, 2022 at 7:45 AM Yaron Gvili <rt...@hotmail.com> wrote:
>
> @Li, yes though in a new way. This came up in a data-source UDF scenario where the implementation is a Python stream factory, i.e., a Python callable returning a callable. I think it would be easier to get the factory and then the stream using Python code than using Cython or C++.
>
>
> Yaron.
>
> ________________________________
> From: Li Jin <ic...@gmail.com>
> Sent: Tuesday, July 5, 2022, 4:22 PM
> To: dev@arrow.apache.org <de...@arrow.apache.org>
> Subject: Re: accessing Substrait protobuf Python classes from PyArrow
>
> Yaron, do we need to parse the subtrait protobuf in Python so that we can
> get the UDFs and register them with Pyarrow?
>
> On Mon, Jul 4, 2022 at 1:24 PM Yaron Gvili <rt...@hotmail.com> wrote:
>
> > This rewriting of the package is basically what I had in mind; the `_ep`
> > was just to signal a private package, which cannot be enforced, of course.
> > Assuming this rewriting would indeed avoid conflict with any standard
> > protobuf package a user might want to use, it would be nicer to do this
> > using a robust refactoring
> > tool. We could try Rope [1] (in particular, its folder refactoring library
> > method [2]) for this; it should add Rope only as a build dependency. Does
> > this sound worth trying?
> >
> > [1] https://github.com/python-rope/rope
> > [2]
> > https://rope.readthedocs.io/en/latest/library.html#list-of-refactorings
> >
> >
> > Yaron.
> > ________________________________
> > From: Jeroen van Straten <je...@gmail.com>
> > Sent: Monday, July 4, 2022 12:38 PM
> > To: dev@arrow.apache.org <de...@arrow.apache.org>
> > Subject: Re: accessing Substrait protobuf Python classes from PyArrow
> >
> > Ah, I see. I guess that complicates things a little.
> >
> > About the exposure part in C++, the idea is that if we statically link
> > libprotobuf and use private linkage for the generated classes, *hopefully*
> > users won't run into linking issues when they link to both Arrow and to
> > some libprotobuf of their own, especially when they or another one of their
> > dependencies also includes Substrait protobuf. The classes would absolutely
> > conflict otherwise, and the libprotobuf version may or may not conflict.
> > The latter is kind of an unsolved problem; AFAIK there have been talks to
> > replace libprotobuf with a third-party alternative to avoid all this
> > nonsense, but nothing concrete yet.
> >
> > For Python, the problems are not quite the same:
> >
> >  - There is no such thing as private linkage in Python, so we *have* to
> > re-namespace the generated Python files. Note that the output of protoc
> > assumes that the files are in exactly the same namespace/module
> > path/whatever as the proto files specify, so the toplevel module would be
> > "substrait", not "arrow". See
> > https://github.com/substrait-io/substrait/pull/207 and
> >
> > https://github.com/substrait-io/substrait-validator/blob/941b81ddf65d8ee6266d6eea0d151a3f9d8a512b/py/build.rs#L105-L142
> > for my ugly workarounds for this thus far.
> >  - There is also no such thing as static linkage in Python, so if you'd
> > want to use Google's protobuf implementation in Python pyarrow *will* need
> > to depend on it, and it'd become the user's problem to make sure that the
> > installed Python protobuf version matches the version of their system
> > library. It looks like pyarrow currently only depends on numpy, which is
> > pretty awesome... so I feel like we should keep it that way.
> >
> > Not sure what the best course of action is.
> >
> > Jeroen
> >
> > On Sun, 3 Jul 2022 at 22:55, Yaron Gvili <rt...@hotmail.com> wrote:
> >
> > > Thanks, the Google protobuf exposure concerns are clear. Another concern
> > > would be consistent maintenance of the same version of Substrait protobuf
> > > used in Arrow C++ and in PyArrow.
> > >
> > > I didn't mean access to users but internally. That is, I didn't mean to
> > > expose Substrait protobuf Python classes to PyArrow users, just to use
> > them
> > > internally in PyArrow code I'll be developing, per the use case I
> > described
> > > at the start of this thread. IIUC, this should address the exposure
> > > concerns, and the `arrow/engine/substrait` module in Arrow C++ does this
> > > too. If the technical approach I just described would actually expose the
> > > classes, what would be a proper way to avoid exposing them? Perhaps the
> > > classes should be generated into a private package, e.g., under
> > > `python/_ep`? (ep stands for external project)
> > >
> > >
> > > Yaron.
> > > ________________________________
> > > From: Antoine Pitrou <an...@python.org>
> > > Sent: Sunday, July 3, 2022 3:20 PM
> > > To: dev@arrow.apache.org <de...@arrow.apache.org>
> > > Subject: Re: accessing Substrait protobuf Python classes from PyArrow
> > >
> > >
> > > I agree that giving direct access to protobuf classes is not Arrow's
> > > job. You can probably take the upstream (i.e. Substrait's) protobuf
> > > definitions and compile them yourself, using whatever settings required
> > > by your project.
> > >
> > > Regards
> > >
> > > Antoine.
> > >
> > >
> > > Le 03/07/2022 à 21:16, Jeroen van Straten a écrit :
> > > > It's not so much about whether or not we *can*, but about whether or
> > not
> > > we
> > > > *should* generate and expose these files.
> > > >
> > > > Fundamentally, Substrait aims to be a means to connect different
> > systems
> > > > together by standardizing some interchange format. Currently that
> > happens
> > > > to be based on protobuf, so *one* (obvious) way to generate and
> > interpret
> > > > Substrait plans is to use Google's own protobuf implementation, as
> > > > generated by protoc for various languages. It's true that there's
> > nothing
> > > > fundamentally blocking Arrow from exposing those.
> > > >
> > > > However, it's by no means the *only* way to interact with protobuf
> > > > serializations, and Google's own implementation is a hot mess when it
> > > comes
> > > > to dependencies; there are lots of good reasons why you might not want
> > to
> > > > depend on it and opt for a third-party implementation instead. For one
> > > > thing, the Python wheel depends on system libraries beyond manylinux,
> > and
> > > > if they happen to be incompatible (which is likely) it just explodes
> > > unless
> > > > you set some environment variable. Therefore, assuming pyarrow doesn't
> > > > already depend on protobuf, I feel like we should keep it that way, and
> > > > thus that we should not include the generated Python files in the
> > > release.
> > > > Note that we don't even expose/leak the protoc-generated C++ classes
> > for
> > > > similar reasons.
> > > >
> > > > Also, as Weston already pointed out, it's not really our job; Substrait
> > > > aims to publish bindings for various languages by itself. It just
> > doesn't
> > > > expose them for Python *yet*. In the interim I suppose you could use
> > the
> > > > substrait-validator package from PyPI. It does expose them, as well as
> > > some
> > > > convenient conversion functions, but I'm having trouble finding people
> > to
> > > > help me keep the validator maintained.
> > > >
> > > > I suppose versioning would be difficult either way, since Substrait
> > > > semi-regularly pushes breaking changes and Arrow currently lags behind
> > by
> > > > several months (though I have a PR open for Substrait 0.6). I guess
> > from
> > > > that point of view distributing the right version along with pyarrow
> > > seems
> > > > nice, but the issues of Google's protobuf implementation remain. This
> > > being
> > > > an issue at all is also very much a Substrait problem, not an Arrow
> > > > problem; at best we can try to mitigate it.
> > > >
> > > > Jeroen
> > > >
> > > > On Sun, Jul 3, 2022, 17:51 Yaron Gvili <rt...@hotmail.com> wrote:
> > > >
> > > >> I looked into the Arrow build system some more. It is possible to get
> > > the
> > > >> Python classes generated by adding "--python-out" flag (set to a
> > > directory
> > > >> created for it) to the `${ARROW_PROTOBUF_PROTOC}` command under
> > > >> `macro(build_substrait)` in
> > > `cpp/cmake_modules/ThirdpartyToolchain.cmake`.
> > > >> However, this makes them available only in the Arrow C++ build whereas
> > > for
> > > >> the current purpose they need to be available in the PyArrow build.
> > The
> > > >> PyArrow build calls `cmake` on `python/CMakeLists.txt`, which AFAICS
> > has
> > > >> access to `cpp/cmake_modules`. So, one solution could be to pull
> > > >> `macro(build_substrait)` into `python/CMakeLists.txt` and call it to
> > > >> generate the Python protobuf classes under `python/`, making them
> > > available
> > > >> for import by PyArrow code. This would probably be cleaner with some
> > > macro
> > > >> parameters to distinguish between C++ and Python generation.
> > > >>
> > > >> Does this sound like a reasonable approach?
> > > >>
> > > >>
> > > >> Yaron.
> > > >>
> > > >> ________________________________
> > > >> From: Yaron Gvili <rt...@hotmail.com>
> > > >> Sent: Saturday, July 2, 2022 8:55 AM
> > > >> To: dev@arrow.apache.org <de...@arrow.apache.org>; Phillip Cloud <
> > > >> cpcloud@gmail.com>
> > > >> Subject: Re: accessing Substrait protobuf Python classes from PyArrow
> > > >>
> > > >> I'm somewhat confused by this answer because I think resolving the
> > > issue I
> > > >> raised does not require any change outside PyArrow. I'll try to
> > explain
> > > the
> > > >> issue differently.
> > > >>
> > > >> First, let me describe the current situation with Substrait protobuf
> > in
> > > >> Arrow C++. The Arrow C++ build system handles external projects in
> > > >> `cpp/cmake_modules/ThirdpartyToolchain.cmake`, and one of these
> > external
> > > >> projects is "substrait". By default, the build system takes the source
> > > code
> > > >> for "substrait" from `
> > > >>
> > >
> > https://github.com/substrait-io/substrait/archive/${ARROW_SUBSTRAIT_BUILD_VERSION}.tar.gz
> > > >> ` where `ARROW_SUBSTRAIT_BUILD_VERSION` is set in
> > > >> `cpp/thirdparty/versions.txt`. The source code is check-summed and
> > > unpacked
> > > >> in `substrait_ep-prefix` under the build directory and from this the
> > > >> protobuf C++ classes are generated in `*.pb.{h,cc}` files in
> > > >> `substrait_ep-generated` under the build directory. The build system
> > > makes
> > > >> a library using the `*.cc` files and makes the `*.h` files available
> > for
> > > >> other C++ modules to use.
> > > >>
> > > >> Setting up the above mechanism did not require any change in the
> > > >> `substrait-io/substrait` repo, nor any coordination with its authors.
> > > What
> > > >> I'm looking for is a similar build mechanism for PyArrow that builds
> > > >> Substrait protobuf Python classes and makes them available for use by
> > > other
> > > >> PyArrow modules. I believe this PyArrow build mechanism does not exist
> > > >> currently and that setting up one would not require any changes
> > outside
> > > >> PyArrow. I'm asking (1) whether that's indeed the case, (2) whether
> > > others
> > > >> agree this mechanism is needed at least due to the problem I ran into
> > > that
> > > >> I previously described, and (3) for any thoughts about how to set up
> > > this
> > > >> mechanism assuming it is needed.
> > > >>
> > > >> Weston, perhaps your thinking was that the Substrait protobuf Python
> > > >> classes need to be built by a repo in the substrait-io space and made
> > > >> available as a binary+headers package? This can be done but will
> > require
> > > >> involving Substrait people and appears to be inconsistent with current
> > > >> patterns in the Arrow build system. Note that for my purposes here,
> > the
> > > >> Substrait protobuf Python classes will be used for composing or
> > > >> interpreting a Substrait plan, not for transforming it by an
> > optimizer,
> > > >> though a Python-based optimizer is a valid use case for them.
> > > >>
> > > >>
> > > >> Yaron.
> > > >> ________________________________
> > > >> From: Weston Pace <we...@gmail.com>
> > > >> Sent: Friday, July 1, 2022 12:42 PM
> > > >> To: dev@arrow.apache.org <de...@arrow.apache.org>; Phillip Cloud <
> > > >> cpcloud@gmail.com>
> > > >> Subject: Re: accessing Substrait protobuf Python classes from PyArrow
> > > >>
> > > >> Given that Acero does not do any planner / optimizer type tasks I'm
> > > >> not sure you will find anything like this in arrow-cpp or pyarrow.
> > > >> What you are describing I sometimes refer to as "plan slicing and
> > > >> dicing".  I have wondered if we will someday need this in Acero but I
> > > >> fear it is a slippery slope between "a little bit of plan
> > > >> manipulation" and "a full blown planner" so I've shied away from it.
> > > >> My first spot to look would be a substrait-python repository which
> > > >> would belong here: https://github.com/substrait-io
> > > >>
> > > >> However, it does not appear that such a repository exists.  If you're
> > > >> willing to create one then a quick ask on the Substrait Slack instance
> > > >> should be enough to get the repository created.  Perhaps there is some
> > > >> genesis of this library in Ibis although I think Ibis would use its
> > > >> own representation for slicing and dicing and only use Substrait for
> > > >> serialization.
> > > >>
> > > >> Once that repository is created pyarrow could probably import it but
> > > >> unless this plan manipulation makes sense purely from a pyarrow
> > > >> perspective I would rather prefer that the user application import
> > > >> both pyarrow and substrait-python independently.
> > > >>
> > > >> Perhaps @Phillip Cloud or someone from the Ibis space might have some
> > > >> ideas on where this might be found.
> > > >>
> > > >> -Weston
> > > >>
> > > >> On Thu, Jun 30, 2022 at 10:06 AM Yaron Gvili <rt...@hotmail.com>
> > wrote:
> > > >>>
> > > >>> Hi,
> > > >>>
> > > >>> Is there support for accessing Substrait protobuf Python classes
> > (such
> > > >> as Plan) from PyArrow? If not, how should such support be added? For
> > > >> example, should the PyArrow build system pull in the Substrait repo as
> > > an
> > > >> external project and build its protobuf Python classes, in a manner
> > > similar
> > > >> to how Arrow C++ does it?
> > > >>>
> > > >>> I'm pondering these questions after running into an issue with code
> > I'm
> > > >> writing under PyArrow that parses a Substrait plan represented as a
> > > >> dictionary. The current (and kind of shaky) parsing operation in this
> > > code
> > > >> uses json.dumps() on the dictionary, which results in a string that is
> > > >> passed to a Cython API that handles it using Arrow C++ code that has
> > > access
> > > >> to Substrait protobuf C++ classes. But when the Substrait plan
> > contains
> > > a
> > > >> bytes-type, json.dump() no longer works and fails with "TypeError:
> > > Object
> > > >> of type bytes is not JSON serializable". A fix for this, and a better
> > > way
> > > >> to parse, is using google.protobuf.json_format.ParseDict() [1] on the
> > > >> dictionary. However, this invocation requires a second argument,
> > namely
> > > a
> > > >> protobuf message instance to merge with. The class of this message
> > > (such as
> > > >> Plan) is a Substrait protobuf Python class, hence the need to access
> > > such
> > > >> classes from PyArrow.
> > > >>>
> > > >>> [1]
> > > >>
> > >
> > https://googleapis.dev/python/protobuf/latest/google/protobuf/json_format.html
> > > >>>
> > > >>>
> > > >>> Yaron.
> > > >>
> > > >
> > >
> >
>

Re: accessing Substrait protobuf Python classes from PyArrow

Posted by Yaron Gvili <rt...@hotmail.com>.
@Weston: The need for accessing the Substrait protobuf Python classes is not for the purpose of modifying the Substrait plan. There are two relevant cases I went into:

  1.  Parsing a dictionary using google.protobuf.json_format.ParseDict(). This method requires an instance of a Substrait protobuf Python class, such as Plan, to merge with; hence, the need to access such classes, but no need to modify the Substrait plan. This case was described in the first message in this thread.
  2.  Accessing a Python object, which is more convenient to manipulate from Python, after unpickling in from a field in the Substrait plan. It's just read-only access to the field from Python, but still needs access to the Substrait protobuf Python classes. This case was mentioned in my previous message in this thread in response to Li's specific question about UDFs.

Yaron.
________________________________
From: Weston Pace <we...@gmail.com>
Sent: Tuesday, July 5, 2022 7:59 PM
To: dev@arrow.apache.org <de...@arrow.apache.org>
Subject: Re: accessing Substrait protobuf Python classes from PyArrow

If this is to be used within pyarrow then I think it makes sense as
something to do.  Can you expand a bit more on what you are trying to
do?  Why do you need to modify the Substrait plan in pyarrow?

> This came up in a data-source UDF scenario where the implementation is a Python stream factory, i.e., a Python callable returning a callable. I think it would be easier to get the factory and then the stream using Python code than using Cython or C++.

I'm not quite certain why this requires a modification to the plan.


On Tue, Jul 5, 2022 at 7:45 AM Yaron Gvili <rt...@hotmail.com> wrote:
>
> @Li, yes though in a new way. This came up in a data-source UDF scenario where the implementation is a Python stream factory, i.e., a Python callable returning a callable. I think it would be easier to get the factory and then the stream using Python code than using Cython or C++.
>
>
> Yaron.
>
> ________________________________
> From: Li Jin <ic...@gmail.com>
> Sent: Tuesday, July 5, 2022, 4:22 PM
> To: dev@arrow.apache.org <de...@arrow.apache.org>
> Subject: Re: accessing Substrait protobuf Python classes from PyArrow
>
> Yaron, do we need to parse the subtrait protobuf in Python so that we can
> get the UDFs and register them with Pyarrow?
>
> On Mon, Jul 4, 2022 at 1:24 PM Yaron Gvili <rt...@hotmail.com> wrote:
>
> > This rewriting of the package is basically what I had in mind; the `_ep`
> > was just to signal a private package, which cannot be enforced, of course.
> > Assuming this rewriting would indeed avoid conflict with any standard
> > protobuf package a user might want to use, it would be nicer to do this
> > using a robust refactoring
> > tool. We could try Rope [1] (in particular, its folder refactoring library
> > method [2]) for this; it should add Rope only as a build dependency. Does
> > this sound worth trying?
> >
> > [1] https://github.com/python-rope/rope
> > [2]
> > https://rope.readthedocs.io/en/latest/library.html#list-of-refactorings
> >
> >
> > Yaron.
> > ________________________________
> > From: Jeroen van Straten <je...@gmail.com>
> > Sent: Monday, July 4, 2022 12:38 PM
> > To: dev@arrow.apache.org <de...@arrow.apache.org>
> > Subject: Re: accessing Substrait protobuf Python classes from PyArrow
> >
> > Ah, I see. I guess that complicates things a little.
> >
> > About the exposure part in C++, the idea is that if we statically link
> > libprotobuf and use private linkage for the generated classes, *hopefully*
> > users won't run into linking issues when they link to both Arrow and to
> > some libprotobuf of their own, especially when they or another one of their
> > dependencies also includes Substrait protobuf. The classes would absolutely
> > conflict otherwise, and the libprotobuf version may or may not conflict.
> > The latter is kind of an unsolved problem; AFAIK there have been talks to
> > replace libprotobuf with a third-party alternative to avoid all this
> > nonsense, but nothing concrete yet.
> >
> > For Python, the problems are not quite the same:
> >
> >  - There is no such thing as private linkage in Python, so we *have* to
> > re-namespace the generated Python files. Note that the output of protoc
> > assumes that the files are in exactly the same namespace/module
> > path/whatever as the proto files specify, so the toplevel module would be
> > "substrait", not "arrow". See
> > https://github.com/substrait-io/substrait/pull/207 and
> >
> > https://github.com/substrait-io/substrait-validator/blob/941b81ddf65d8ee6266d6eea0d151a3f9d8a512b/py/build.rs#L105-L142
> > for my ugly workarounds for this thus far.
> >  - There is also no such thing as static linkage in Python, so if you'd
> > want to use Google's protobuf implementation in Python pyarrow *will* need
> > to depend on it, and it'd become the user's problem to make sure that the
> > installed Python protobuf version matches the version of their system
> > library. It looks like pyarrow currently only depends on numpy, which is
> > pretty awesome... so I feel like we should keep it that way.
> >
> > Not sure what the best course of action is.
> >
> > Jeroen
> >
> > On Sun, 3 Jul 2022 at 22:55, Yaron Gvili <rt...@hotmail.com> wrote:
> >
> > > Thanks, the Google protobuf exposure concerns are clear. Another concern
> > > would be consistent maintenance of the same version of Substrait protobuf
> > > used in Arrow C++ and in PyArrow.
> > >
> > > I didn't mean access to users but internally. That is, I didn't mean to
> > > expose Substrait protobuf Python classes to PyArrow users, just to use
> > them
> > > internally in PyArrow code I'll be developing, per the use case I
> > described
> > > at the start of this thread. IIUC, this should address the exposure
> > > concerns, and the `arrow/engine/substrait` module in Arrow C++ does this
> > > too. If the technical approach I just described would actually expose the
> > > classes, what would be a proper way to avoid exposing them? Perhaps the
> > > classes should be generated into a private package, e.g., under
> > > `python/_ep`? (ep stands for external project)
> > >
> > >
> > > Yaron.
> > > ________________________________
> > > From: Antoine Pitrou <an...@python.org>
> > > Sent: Sunday, July 3, 2022 3:20 PM
> > > To: dev@arrow.apache.org <de...@arrow.apache.org>
> > > Subject: Re: accessing Substrait protobuf Python classes from PyArrow
> > >
> > >
> > > I agree that giving direct access to protobuf classes is not Arrow's
> > > job. You can probably take the upstream (i.e. Substrait's) protobuf
> > > definitions and compile them yourself, using whatever settings required
> > > by your project.
> > >
> > > Regards
> > >
> > > Antoine.
> > >
> > >
> > > Le 03/07/2022 à 21:16, Jeroen van Straten a écrit :
> > > > It's not so much about whether or not we *can*, but about whether or
> > not
> > > we
> > > > *should* generate and expose these files.
> > > >
> > > > Fundamentally, Substrait aims to be a means to connect different
> > systems
> > > > together by standardizing some interchange format. Currently that
> > happens
> > > > to be based on protobuf, so *one* (obvious) way to generate and
> > interpret
> > > > Substrait plans is to use Google's own protobuf implementation, as
> > > > generated by protoc for various languages. It's true that there's
> > nothing
> > > > fundamentally blocking Arrow from exposing those.
> > > >
> > > > However, it's by no means the *only* way to interact with protobuf
> > > > serializations, and Google's own implementation is a hot mess when it
> > > comes
> > > > to dependencies; there are lots of good reasons why you might not want
> > to
> > > > depend on it and opt for a third-party implementation instead. For one
> > > > thing, the Python wheel depends on system libraries beyond manylinux,
> > and
> > > > if they happen to be incompatible (which is likely) it just explodes
> > > unless
> > > > you set some environment variable. Therefore, assuming pyarrow doesn't
> > > > already depend on protobuf, I feel like we should keep it that way, and
> > > > thus that we should not include the generated Python files in the
> > > release.
> > > > Note that we don't even expose/leak the protoc-generated C++ classes
> > for
> > > > similar reasons.
> > > >
> > > > Also, as Weston already pointed out, it's not really our job; Substrait
> > > > aims to publish bindings for various languages by itself. It just
> > doesn't
> > > > expose them for Python *yet*. In the interim I suppose you could use
> > the
> > > > substrait-validator package from PyPI. It does expose them, as well as
> > > some
> > > > convenient conversion functions, but I'm having trouble finding people
> > to
> > > > help me keep the validator maintained.
> > > >
> > > > I suppose versioning would be difficult either way, since Substrait
> > > > semi-regularly pushes breaking changes and Arrow currently lags behind
> > by
> > > > several months (though I have a PR open for Substrait 0.6). I guess
> > from
> > > > that point of view distributing the right version along with pyarrow
> > > seems
> > > > nice, but the issues of Google's protobuf implementation remain. This
> > > being
> > > > an issue at all is also very much a Substrait problem, not an Arrow
> > > > problem; at best we can try to mitigate it.
> > > >
> > > > Jeroen
> > > >
> > > > On Sun, Jul 3, 2022, 17:51 Yaron Gvili <rt...@hotmail.com> wrote:
> > > >
> > > >> I looked into the Arrow build system some more. It is possible to get
> > > the
> > > >> Python classes generated by adding "--python-out" flag (set to a
> > > directory
> > > >> created for it) to the `${ARROW_PROTOBUF_PROTOC}` command under
> > > >> `macro(build_substrait)` in
> > > `cpp/cmake_modules/ThirdpartyToolchain.cmake`.
> > > >> However, this makes them available only in the Arrow C++ build whereas
> > > for
> > > >> the current purpose they need to be available in the PyArrow build.
> > The
> > > >> PyArrow build calls `cmake` on `python/CMakeLists.txt`, which AFAICS
> > has
> > > >> access to `cpp/cmake_modules`. So, one solution could be to pull
> > > >> `macro(build_substrait)` into `python/CMakeLists.txt` and call it to
> > > >> generate the Python protobuf classes under `python/`, making them
> > > available
> > > >> for import by PyArrow code. This would probably be cleaner with some
> > > macro
> > > >> parameters to distinguish between C++ and Python generation.
> > > >>
> > > >> Does this sound like a reasonable approach?
> > > >>
> > > >>
> > > >> Yaron.
> > > >>
> > > >> ________________________________
> > > >> From: Yaron Gvili <rt...@hotmail.com>
> > > >> Sent: Saturday, July 2, 2022 8:55 AM
> > > >> To: dev@arrow.apache.org <de...@arrow.apache.org>; Phillip Cloud <
> > > >> cpcloud@gmail.com>
> > > >> Subject: Re: accessing Substrait protobuf Python classes from PyArrow
> > > >>
> > > >> I'm somewhat confused by this answer because I think resolving the
> > > issue I
> > > >> raised does not require any change outside PyArrow. I'll try to
> > explain
> > > the
> > > >> issue differently.
> > > >>
> > > >> First, let me describe the current situation with Substrait protobuf
> > in
> > > >> Arrow C++. The Arrow C++ build system handles external projects in
> > > >> `cpp/cmake_modules/ThirdpartyToolchain.cmake`, and one of these
> > external
> > > >> projects is "substrait". By default, the build system takes the source
> > > code
> > > >> for "substrait" from `
> > > >>
> > >
> > https://github.com/substrait-io/substrait/archive/${ARROW_SUBSTRAIT_BUILD_VERSION}.tar.gz
> > > >> ` where `ARROW_SUBSTRAIT_BUILD_VERSION` is set in
> > > >> `cpp/thirdparty/versions.txt`. The source code is check-summed and
> > > unpacked
> > > >> in `substrait_ep-prefix` under the build directory and from this the
> > > >> protobuf C++ classes are generated in `*.pb.{h,cc}` files in
> > > >> `substrait_ep-generated` under the build directory. The build system
> > > makes
> > > >> a library using the `*.cc` files and makes the `*.h` files available
> > for
> > > >> other C++ modules to use.
> > > >>
> > > >> Setting up the above mechanism did not require any change in the
> > > >> `substrait-io/substrait` repo, nor any coordination with its authors.
> > > What
> > > >> I'm looking for is a similar build mechanism for PyArrow that builds
> > > >> Substrait protobuf Python classes and makes them available for use by
> > > other
> > > >> PyArrow modules. I believe this PyArrow build mechanism does not exist
> > > >> currently and that setting up one would not require any changes
> > outside
> > > >> PyArrow. I'm asking (1) whether that's indeed the case, (2) whether
> > > others
> > > >> agree this mechanism is needed at least due to the problem I ran into
> > > that
> > > >> I previously described, and (3) for any thoughts about how to set up
> > > this
> > > >> mechanism assuming it is needed.
> > > >>
> > > >> Weston, perhaps your thinking was that the Substrait protobuf Python
> > > >> classes need to be built by a repo in the substrait-io space and made
> > > >> available as a binary+headers package? This can be done but will
> > require
> > > >> involving Substrait people and appears to be inconsistent with current
> > > >> patterns in the Arrow build system. Note that for my purposes here,
> > the
> > > >> Substrait protobuf Python classes will be used for composing or
> > > >> interpreting a Substrait plan, not for transforming it by an
> > optimizer,
> > > >> though a Python-based optimizer is a valid use case for them.
> > > >>
> > > >>
> > > >> Yaron.
> > > >> ________________________________
> > > >> From: Weston Pace <we...@gmail.com>
> > > >> Sent: Friday, July 1, 2022 12:42 PM
> > > >> To: dev@arrow.apache.org <de...@arrow.apache.org>; Phillip Cloud <
> > > >> cpcloud@gmail.com>
> > > >> Subject: Re: accessing Substrait protobuf Python classes from PyArrow
> > > >>
> > > >> Given that Acero does not do any planner / optimizer type tasks I'm
> > > >> not sure you will find anything like this in arrow-cpp or pyarrow.
> > > >> What you are describing I sometimes refer to as "plan slicing and
> > > >> dicing".  I have wondered if we will someday need this in Acero but I
> > > >> fear it is a slippery slope between "a little bit of plan
> > > >> manipulation" and "a full blown planner" so I've shied away from it.
> > > >> My first spot to look would be a substrait-python repository which
> > > >> would belong here: https://github.com/substrait-io
> > > >>
> > > >> However, it does not appear that such a repository exists.  If you're
> > > >> willing to create one then a quick ask on the Substrait Slack instance
> > > >> should be enough to get the repository created.  Perhaps there is some
> > > >> genesis of this library in Ibis although I think Ibis would use its
> > > >> own representation for slicing and dicing and only use Substrait for
> > > >> serialization.
> > > >>
> > > >> Once that repository is created pyarrow could probably import it but
> > > >> unless this plan manipulation makes sense purely from a pyarrow
> > > >> perspective I would rather prefer that the user application import
> > > >> both pyarrow and substrait-python independently.
> > > >>
> > > >> Perhaps @Phillip Cloud or someone from the Ibis space might have some
> > > >> ideas on where this might be found.
> > > >>
> > > >> -Weston
> > > >>
> > > >> On Thu, Jun 30, 2022 at 10:06 AM Yaron Gvili <rt...@hotmail.com>
> > wrote:
> > > >>>
> > > >>> Hi,
> > > >>>
> > > >>> Is there support for accessing Substrait protobuf Python classes
> > (such
> > > >> as Plan) from PyArrow? If not, how should such support be added? For
> > > >> example, should the PyArrow build system pull in the Substrait repo as
> > > an
> > > >> external project and build its protobuf Python classes, in a manner
> > > similar
> > > >> to how Arrow C++ does it?
> > > >>>
> > > >>> I'm pondering these questions after running into an issue with code
> > I'm
> > > >> writing under PyArrow that parses a Substrait plan represented as a
> > > >> dictionary. The current (and kind of shaky) parsing operation in this
> > > code
> > > >> uses json.dumps() on the dictionary, which results in a string that is
> > > >> passed to a Cython API that handles it using Arrow C++ code that has
> > > access
> > > >> to Substrait protobuf C++ classes. But when the Substrait plan
> > contains
> > > a
> > > >> bytes-type, json.dump() no longer works and fails with "TypeError:
> > > Object
> > > >> of type bytes is not JSON serializable". A fix for this, and a better
> > > way
> > > >> to parse, is using google.protobuf.json_format.ParseDict() [1] on the
> > > >> dictionary. However, this invocation requires a second argument,
> > namely
> > > a
> > > >> protobuf message instance to merge with. The class of this message
> > > (such as
> > > >> Plan) is a Substrait protobuf Python class, hence the need to access
> > > such
> > > >> classes from PyArrow.
> > > >>>
> > > >>> [1]
> > > >>
> > >
> > https://googleapis.dev/python/protobuf/latest/google/protobuf/json_format.html
> > > >>>
> > > >>>
> > > >>> Yaron.
> > > >>
> > > >
> > >
> >
>

Re: accessing Substrait protobuf Python classes from PyArrow

Posted by Weston Pace <we...@gmail.com>.
If this is to be used within pyarrow then I think it makes sense as
something to do.  Can you expand a bit more on what you are trying to
do?  Why do you need to modify the Substrait plan in pyarrow?

> This came up in a data-source UDF scenario where the implementation is a Python stream factory, i.e., a Python callable returning a callable. I think it would be easier to get the factory and then the stream using Python code than using Cython or C++.

I'm not quite certain why this requires a modification to the plan.


On Tue, Jul 5, 2022 at 7:45 AM Yaron Gvili <rt...@hotmail.com> wrote:
>
> @Li, yes though in a new way. This came up in a data-source UDF scenario where the implementation is a Python stream factory, i.e., a Python callable returning a callable. I think it would be easier to get the factory and then the stream using Python code than using Cython or C++.
>
>
> Yaron.
>
> ________________________________
> From: Li Jin <ic...@gmail.com>
> Sent: Tuesday, July 5, 2022, 4:22 PM
> To: dev@arrow.apache.org <de...@arrow.apache.org>
> Subject: Re: accessing Substrait protobuf Python classes from PyArrow
>
> Yaron, do we need to parse the subtrait protobuf in Python so that we can
> get the UDFs and register them with Pyarrow?
>
> On Mon, Jul 4, 2022 at 1:24 PM Yaron Gvili <rt...@hotmail.com> wrote:
>
> > This rewriting of the package is basically what I had in mind; the `_ep`
> > was just to signal a private package, which cannot be enforced, of course.
> > Assuming this rewriting would indeed avoid conflict with any standard
> > protobuf package a user might want to use, it would be nicer to do this
> > using a robust refactoring
> > tool. We could try Rope [1] (in particular, its folder refactoring library
> > method [2]) for this; it should add Rope only as a build dependency. Does
> > this sound worth trying?
> >
> > [1] https://github.com/python-rope/rope
> > [2]
> > https://rope.readthedocs.io/en/latest/library.html#list-of-refactorings
> >
> >
> > Yaron.
> > ________________________________
> > From: Jeroen van Straten <je...@gmail.com>
> > Sent: Monday, July 4, 2022 12:38 PM
> > To: dev@arrow.apache.org <de...@arrow.apache.org>
> > Subject: Re: accessing Substrait protobuf Python classes from PyArrow
> >
> > Ah, I see. I guess that complicates things a little.
> >
> > About the exposure part in C++, the idea is that if we statically link
> > libprotobuf and use private linkage for the generated classes, *hopefully*
> > users won't run into linking issues when they link to both Arrow and to
> > some libprotobuf of their own, especially when they or another one of their
> > dependencies also includes Substrait protobuf. The classes would absolutely
> > conflict otherwise, and the libprotobuf version may or may not conflict.
> > The latter is kind of an unsolved problem; AFAIK there have been talks to
> > replace libprotobuf with a third-party alternative to avoid all this
> > nonsense, but nothing concrete yet.
> >
> > For Python, the problems are not quite the same:
> >
> >  - There is no such thing as private linkage in Python, so we *have* to
> > re-namespace the generated Python files. Note that the output of protoc
> > assumes that the files are in exactly the same namespace/module
> > path/whatever as the proto files specify, so the toplevel module would be
> > "substrait", not "arrow". See
> > https://github.com/substrait-io/substrait/pull/207 and
> >
> > https://github.com/substrait-io/substrait-validator/blob/941b81ddf65d8ee6266d6eea0d151a3f9d8a512b/py/build.rs#L105-L142
> > for my ugly workarounds for this thus far.
> >  - There is also no such thing as static linkage in Python, so if you'd
> > want to use Google's protobuf implementation in Python pyarrow *will* need
> > to depend on it, and it'd become the user's problem to make sure that the
> > installed Python protobuf version matches the version of their system
> > library. It looks like pyarrow currently only depends on numpy, which is
> > pretty awesome... so I feel like we should keep it that way.
> >
> > Not sure what the best course of action is.
> >
> > Jeroen
> >
> > On Sun, 3 Jul 2022 at 22:55, Yaron Gvili <rt...@hotmail.com> wrote:
> >
> > > Thanks, the Google protobuf exposure concerns are clear. Another concern
> > > would be consistent maintenance of the same version of Substrait protobuf
> > > used in Arrow C++ and in PyArrow.
> > >
> > > I didn't mean access to users but internally. That is, I didn't mean to
> > > expose Substrait protobuf Python classes to PyArrow users, just to use
> > them
> > > internally in PyArrow code I'll be developing, per the use case I
> > described
> > > at the start of this thread. IIUC, this should address the exposure
> > > concerns, and the `arrow/engine/substrait` module in Arrow C++ does this
> > > too. If the technical approach I just described would actually expose the
> > > classes, what would be a proper way to avoid exposing them? Perhaps the
> > > classes should be generated into a private package, e.g., under
> > > `python/_ep`? (ep stands for external project)
> > >
> > >
> > > Yaron.
> > > ________________________________
> > > From: Antoine Pitrou <an...@python.org>
> > > Sent: Sunday, July 3, 2022 3:20 PM
> > > To: dev@arrow.apache.org <de...@arrow.apache.org>
> > > Subject: Re: accessing Substrait protobuf Python classes from PyArrow
> > >
> > >
> > > I agree that giving direct access to protobuf classes is not Arrow's
> > > job. You can probably take the upstream (i.e. Substrait's) protobuf
> > > definitions and compile them yourself, using whatever settings required
> > > by your project.
> > >
> > > Regards
> > >
> > > Antoine.
> > >
> > >
> > > Le 03/07/2022 à 21:16, Jeroen van Straten a écrit :
> > > > It's not so much about whether or not we *can*, but about whether or
> > not
> > > we
> > > > *should* generate and expose these files.
> > > >
> > > > Fundamentally, Substrait aims to be a means to connect different
> > systems
> > > > together by standardizing some interchange format. Currently that
> > happens
> > > > to be based on protobuf, so *one* (obvious) way to generate and
> > interpret
> > > > Substrait plans is to use Google's own protobuf implementation, as
> > > > generated by protoc for various languages. It's true that there's
> > nothing
> > > > fundamentally blocking Arrow from exposing those.
> > > >
> > > > However, it's by no means the *only* way to interact with protobuf
> > > > serializations, and Google's own implementation is a hot mess when it
> > > comes
> > > > to dependencies; there are lots of good reasons why you might not want
> > to
> > > > depend on it and opt for a third-party implementation instead. For one
> > > > thing, the Python wheel depends on system libraries beyond manylinux,
> > and
> > > > if they happen to be incompatible (which is likely) it just explodes
> > > unless
> > > > you set some environment variable. Therefore, assuming pyarrow doesn't
> > > > already depend on protobuf, I feel like we should keep it that way, and
> > > > thus that we should not include the generated Python files in the
> > > release.
> > > > Note that we don't even expose/leak the protoc-generated C++ classes
> > for
> > > > similar reasons.
> > > >
> > > > Also, as Weston already pointed out, it's not really our job; Substrait
> > > > aims to publish bindings for various languages by itself. It just
> > doesn't
> > > > expose them for Python *yet*. In the interim I suppose you could use
> > the
> > > > substrait-validator package from PyPI. It does expose them, as well as
> > > some
> > > > convenient conversion functions, but I'm having trouble finding people
> > to
> > > > help me keep the validator maintained.
> > > >
> > > > I suppose versioning would be difficult either way, since Substrait
> > > > semi-regularly pushes breaking changes and Arrow currently lags behind
> > by
> > > > several months (though I have a PR open for Substrait 0.6). I guess
> > from
> > > > that point of view distributing the right version along with pyarrow
> > > seems
> > > > nice, but the issues of Google's protobuf implementation remain. This
> > > being
> > > > an issue at all is also very much a Substrait problem, not an Arrow
> > > > problem; at best we can try to mitigate it.
> > > >
> > > > Jeroen
> > > >
> > > > On Sun, Jul 3, 2022, 17:51 Yaron Gvili <rt...@hotmail.com> wrote:
> > > >
> > > >> I looked into the Arrow build system some more. It is possible to get
> > > the
> > > >> Python classes generated by adding "--python-out" flag (set to a
> > > directory
> > > >> created for it) to the `${ARROW_PROTOBUF_PROTOC}` command under
> > > >> `macro(build_substrait)` in
> > > `cpp/cmake_modules/ThirdpartyToolchain.cmake`.
> > > >> However, this makes them available only in the Arrow C++ build whereas
> > > for
> > > >> the current purpose they need to be available in the PyArrow build.
> > The
> > > >> PyArrow build calls `cmake` on `python/CMakeLists.txt`, which AFAICS
> > has
> > > >> access to `cpp/cmake_modules`. So, one solution could be to pull
> > > >> `macro(build_substrait)` into `python/CMakeLists.txt` and call it to
> > > >> generate the Python protobuf classes under `python/`, making them
> > > available
> > > >> for import by PyArrow code. This would probably be cleaner with some
> > > macro
> > > >> parameters to distinguish between C++ and Python generation.
> > > >>
> > > >> Does this sound like a reasonable approach?
> > > >>
> > > >>
> > > >> Yaron.
> > > >>
> > > >> ________________________________
> > > >> From: Yaron Gvili <rt...@hotmail.com>
> > > >> Sent: Saturday, July 2, 2022 8:55 AM
> > > >> To: dev@arrow.apache.org <de...@arrow.apache.org>; Phillip Cloud <
> > > >> cpcloud@gmail.com>
> > > >> Subject: Re: accessing Substrait protobuf Python classes from PyArrow
> > > >>
> > > >> I'm somewhat confused by this answer because I think resolving the
> > > issue I
> > > >> raised does not require any change outside PyArrow. I'll try to
> > explain
> > > the
> > > >> issue differently.
> > > >>
> > > >> First, let me describe the current situation with Substrait protobuf
> > in
> > > >> Arrow C++. The Arrow C++ build system handles external projects in
> > > >> `cpp/cmake_modules/ThirdpartyToolchain.cmake`, and one of these
> > external
> > > >> projects is "substrait". By default, the build system takes the source
> > > code
> > > >> for "substrait" from `
> > > >>
> > >
> > https://github.com/substrait-io/substrait/archive/${ARROW_SUBSTRAIT_BUILD_VERSION}.tar.gz
> > > >> ` where `ARROW_SUBSTRAIT_BUILD_VERSION` is set in
> > > >> `cpp/thirdparty/versions.txt`. The source code is check-summed and
> > > unpacked
> > > >> in `substrait_ep-prefix` under the build directory and from this the
> > > >> protobuf C++ classes are generated in `*.pb.{h,cc}` files in
> > > >> `substrait_ep-generated` under the build directory. The build system
> > > makes
> > > >> a library using the `*.cc` files and makes the `*.h` files available
> > for
> > > >> other C++ modules to use.
> > > >>
> > > >> Setting up the above mechanism did not require any change in the
> > > >> `substrait-io/substrait` repo, nor any coordination with its authors.
> > > What
> > > >> I'm looking for is a similar build mechanism for PyArrow that builds
> > > >> Substrait protobuf Python classes and makes them available for use by
> > > other
> > > >> PyArrow modules. I believe this PyArrow build mechanism does not exist
> > > >> currently and that setting up one would not require any changes
> > outside
> > > >> PyArrow. I'm asking (1) whether that's indeed the case, (2) whether
> > > others
> > > >> agree this mechanism is needed at least due to the problem I ran into
> > > that
> > > >> I previously described, and (3) for any thoughts about how to set up
> > > this
> > > >> mechanism assuming it is needed.
> > > >>
> > > >> Weston, perhaps your thinking was that the Substrait protobuf Python
> > > >> classes need to be built by a repo in the substrait-io space and made
> > > >> available as a binary+headers package? This can be done but will
> > require
> > > >> involving Substrait people and appears to be inconsistent with current
> > > >> patterns in the Arrow build system. Note that for my purposes here,
> > the
> > > >> Substrait protobuf Python classes will be used for composing or
> > > >> interpreting a Substrait plan, not for transforming it by an
> > optimizer,
> > > >> though a Python-based optimizer is a valid use case for them.
> > > >>
> > > >>
> > > >> Yaron.
> > > >> ________________________________
> > > >> From: Weston Pace <we...@gmail.com>
> > > >> Sent: Friday, July 1, 2022 12:42 PM
> > > >> To: dev@arrow.apache.org <de...@arrow.apache.org>; Phillip Cloud <
> > > >> cpcloud@gmail.com>
> > > >> Subject: Re: accessing Substrait protobuf Python classes from PyArrow
> > > >>
> > > >> Given that Acero does not do any planner / optimizer type tasks I'm
> > > >> not sure you will find anything like this in arrow-cpp or pyarrow.
> > > >> What you are describing I sometimes refer to as "plan slicing and
> > > >> dicing".  I have wondered if we will someday need this in Acero but I
> > > >> fear it is a slippery slope between "a little bit of plan
> > > >> manipulation" and "a full blown planner" so I've shied away from it.
> > > >> My first spot to look would be a substrait-python repository which
> > > >> would belong here: https://github.com/substrait-io
> > > >>
> > > >> However, it does not appear that such a repository exists.  If you're
> > > >> willing to create one then a quick ask on the Substrait Slack instance
> > > >> should be enough to get the repository created.  Perhaps there is some
> > > >> genesis of this library in Ibis although I think Ibis would use its
> > > >> own representation for slicing and dicing and only use Substrait for
> > > >> serialization.
> > > >>
> > > >> Once that repository is created pyarrow could probably import it but
> > > >> unless this plan manipulation makes sense purely from a pyarrow
> > > >> perspective I would rather prefer that the user application import
> > > >> both pyarrow and substrait-python independently.
> > > >>
> > > >> Perhaps @Phillip Cloud or someone from the Ibis space might have some
> > > >> ideas on where this might be found.
> > > >>
> > > >> -Weston
> > > >>
> > > >> On Thu, Jun 30, 2022 at 10:06 AM Yaron Gvili <rt...@hotmail.com>
> > wrote:
> > > >>>
> > > >>> Hi,
> > > >>>
> > > >>> Is there support for accessing Substrait protobuf Python classes
> > (such
> > > >> as Plan) from PyArrow? If not, how should such support be added? For
> > > >> example, should the PyArrow build system pull in the Substrait repo as
> > > an
> > > >> external project and build its protobuf Python classes, in a manner
> > > similar
> > > >> to how Arrow C++ does it?
> > > >>>
> > > >>> I'm pondering these questions after running into an issue with code
> > I'm
> > > >> writing under PyArrow that parses a Substrait plan represented as a
> > > >> dictionary. The current (and kind of shaky) parsing operation in this
> > > code
> > > >> uses json.dumps() on the dictionary, which results in a string that is
> > > >> passed to a Cython API that handles it using Arrow C++ code that has
> > > access
> > > >> to Substrait protobuf C++ classes. But when the Substrait plan
> > contains
> > > a
> > > >> bytes-type, json.dump() no longer works and fails with "TypeError:
> > > Object
> > > >> of type bytes is not JSON serializable". A fix for this, and a better
> > > way
> > > >> to parse, is using google.protobuf.json_format.ParseDict() [1] on the
> > > >> dictionary. However, this invocation requires a second argument,
> > namely
> > > a
> > > >> protobuf message instance to merge with. The class of this message
> > > (such as
> > > >> Plan) is a Substrait protobuf Python class, hence the need to access
> > > such
> > > >> classes from PyArrow.
> > > >>>
> > > >>> [1]
> > > >>
> > >
> > https://googleapis.dev/python/protobuf/latest/google/protobuf/json_format.html
> > > >>>
> > > >>>
> > > >>> Yaron.
> > > >>
> > > >
> > >
> >
>

Re: accessing Substrait protobuf Python classes from PyArrow

Posted by Yaron Gvili <rt...@hotmail.com>.
@Li, yes though in a new way. This came up in a data-source UDF scenario where the implementation is a Python stream factory, i.e., a Python callable returning a callable. I think it would be easier to get the factory and then the stream using Python code than using Cython or C++.


Yaron.

________________________________
From: Li Jin <ic...@gmail.com>
Sent: Tuesday, July 5, 2022, 4:22 PM
To: dev@arrow.apache.org <de...@arrow.apache.org>
Subject: Re: accessing Substrait protobuf Python classes from PyArrow

Yaron, do we need to parse the subtrait protobuf in Python so that we can
get the UDFs and register them with Pyarrow?

On Mon, Jul 4, 2022 at 1:24 PM Yaron Gvili <rt...@hotmail.com> wrote:

> This rewriting of the package is basically what I had in mind; the `_ep`
> was just to signal a private package, which cannot be enforced, of course.
> Assuming this rewriting would indeed avoid conflict with any standard
> protobuf package a user might want to use, it would be nicer to do this
> using a robust refactoring
> tool. We could try Rope [1] (in particular, its folder refactoring library
> method [2]) for this; it should add Rope only as a build dependency. Does
> this sound worth trying?
>
> [1] https://github.com/python-rope/rope
> [2]
> https://rope.readthedocs.io/en/latest/library.html#list-of-refactorings
>
>
> Yaron.
> ________________________________
> From: Jeroen van Straten <je...@gmail.com>
> Sent: Monday, July 4, 2022 12:38 PM
> To: dev@arrow.apache.org <de...@arrow.apache.org>
> Subject: Re: accessing Substrait protobuf Python classes from PyArrow
>
> Ah, I see. I guess that complicates things a little.
>
> About the exposure part in C++, the idea is that if we statically link
> libprotobuf and use private linkage for the generated classes, *hopefully*
> users won't run into linking issues when they link to both Arrow and to
> some libprotobuf of their own, especially when they or another one of their
> dependencies also includes Substrait protobuf. The classes would absolutely
> conflict otherwise, and the libprotobuf version may or may not conflict.
> The latter is kind of an unsolved problem; AFAIK there have been talks to
> replace libprotobuf with a third-party alternative to avoid all this
> nonsense, but nothing concrete yet.
>
> For Python, the problems are not quite the same:
>
>  - There is no such thing as private linkage in Python, so we *have* to
> re-namespace the generated Python files. Note that the output of protoc
> assumes that the files are in exactly the same namespace/module
> path/whatever as the proto files specify, so the toplevel module would be
> "substrait", not "arrow". See
> https://github.com/substrait-io/substrait/pull/207 and
>
> https://github.com/substrait-io/substrait-validator/blob/941b81ddf65d8ee6266d6eea0d151a3f9d8a512b/py/build.rs#L105-L142
> for my ugly workarounds for this thus far.
>  - There is also no such thing as static linkage in Python, so if you'd
> want to use Google's protobuf implementation in Python pyarrow *will* need
> to depend on it, and it'd become the user's problem to make sure that the
> installed Python protobuf version matches the version of their system
> library. It looks like pyarrow currently only depends on numpy, which is
> pretty awesome... so I feel like we should keep it that way.
>
> Not sure what the best course of action is.
>
> Jeroen
>
> On Sun, 3 Jul 2022 at 22:55, Yaron Gvili <rt...@hotmail.com> wrote:
>
> > Thanks, the Google protobuf exposure concerns are clear. Another concern
> > would be consistent maintenance of the same version of Substrait protobuf
> > used in Arrow C++ and in PyArrow.
> >
> > I didn't mean access to users but internally. That is, I didn't mean to
> > expose Substrait protobuf Python classes to PyArrow users, just to use
> them
> > internally in PyArrow code I'll be developing, per the use case I
> described
> > at the start of this thread. IIUC, this should address the exposure
> > concerns, and the `arrow/engine/substrait` module in Arrow C++ does this
> > too. If the technical approach I just described would actually expose the
> > classes, what would be a proper way to avoid exposing them? Perhaps the
> > classes should be generated into a private package, e.g., under
> > `python/_ep`? (ep stands for external project)
> >
> >
> > Yaron.
> > ________________________________
> > From: Antoine Pitrou <an...@python.org>
> > Sent: Sunday, July 3, 2022 3:20 PM
> > To: dev@arrow.apache.org <de...@arrow.apache.org>
> > Subject: Re: accessing Substrait protobuf Python classes from PyArrow
> >
> >
> > I agree that giving direct access to protobuf classes is not Arrow's
> > job. You can probably take the upstream (i.e. Substrait's) protobuf
> > definitions and compile them yourself, using whatever settings required
> > by your project.
> >
> > Regards
> >
> > Antoine.
> >
> >
> > Le 03/07/2022 à 21:16, Jeroen van Straten a écrit :
> > > It's not so much about whether or not we *can*, but about whether or
> not
> > we
> > > *should* generate and expose these files.
> > >
> > > Fundamentally, Substrait aims to be a means to connect different
> systems
> > > together by standardizing some interchange format. Currently that
> happens
> > > to be based on protobuf, so *one* (obvious) way to generate and
> interpret
> > > Substrait plans is to use Google's own protobuf implementation, as
> > > generated by protoc for various languages. It's true that there's
> nothing
> > > fundamentally blocking Arrow from exposing those.
> > >
> > > However, it's by no means the *only* way to interact with protobuf
> > > serializations, and Google's own implementation is a hot mess when it
> > comes
> > > to dependencies; there are lots of good reasons why you might not want
> to
> > > depend on it and opt for a third-party implementation instead. For one
> > > thing, the Python wheel depends on system libraries beyond manylinux,
> and
> > > if they happen to be incompatible (which is likely) it just explodes
> > unless
> > > you set some environment variable. Therefore, assuming pyarrow doesn't
> > > already depend on protobuf, I feel like we should keep it that way, and
> > > thus that we should not include the generated Python files in the
> > release.
> > > Note that we don't even expose/leak the protoc-generated C++ classes
> for
> > > similar reasons.
> > >
> > > Also, as Weston already pointed out, it's not really our job; Substrait
> > > aims to publish bindings for various languages by itself. It just
> doesn't
> > > expose them for Python *yet*. In the interim I suppose you could use
> the
> > > substrait-validator package from PyPI. It does expose them, as well as
> > some
> > > convenient conversion functions, but I'm having trouble finding people
> to
> > > help me keep the validator maintained.
> > >
> > > I suppose versioning would be difficult either way, since Substrait
> > > semi-regularly pushes breaking changes and Arrow currently lags behind
> by
> > > several months (though I have a PR open for Substrait 0.6). I guess
> from
> > > that point of view distributing the right version along with pyarrow
> > seems
> > > nice, but the issues of Google's protobuf implementation remain. This
> > being
> > > an issue at all is also very much a Substrait problem, not an Arrow
> > > problem; at best we can try to mitigate it.
> > >
> > > Jeroen
> > >
> > > On Sun, Jul 3, 2022, 17:51 Yaron Gvili <rt...@hotmail.com> wrote:
> > >
> > >> I looked into the Arrow build system some more. It is possible to get
> > the
> > >> Python classes generated by adding "--python-out" flag (set to a
> > directory
> > >> created for it) to the `${ARROW_PROTOBUF_PROTOC}` command under
> > >> `macro(build_substrait)` in
> > `cpp/cmake_modules/ThirdpartyToolchain.cmake`.
> > >> However, this makes them available only in the Arrow C++ build whereas
> > for
> > >> the current purpose they need to be available in the PyArrow build.
> The
> > >> PyArrow build calls `cmake` on `python/CMakeLists.txt`, which AFAICS
> has
> > >> access to `cpp/cmake_modules`. So, one solution could be to pull
> > >> `macro(build_substrait)` into `python/CMakeLists.txt` and call it to
> > >> generate the Python protobuf classes under `python/`, making them
> > available
> > >> for import by PyArrow code. This would probably be cleaner with some
> > macro
> > >> parameters to distinguish between C++ and Python generation.
> > >>
> > >> Does this sound like a reasonable approach?
> > >>
> > >>
> > >> Yaron.
> > >>
> > >> ________________________________
> > >> From: Yaron Gvili <rt...@hotmail.com>
> > >> Sent: Saturday, July 2, 2022 8:55 AM
> > >> To: dev@arrow.apache.org <de...@arrow.apache.org>; Phillip Cloud <
> > >> cpcloud@gmail.com>
> > >> Subject: Re: accessing Substrait protobuf Python classes from PyArrow
> > >>
> > >> I'm somewhat confused by this answer because I think resolving the
> > issue I
> > >> raised does not require any change outside PyArrow. I'll try to
> explain
> > the
> > >> issue differently.
> > >>
> > >> First, let me describe the current situation with Substrait protobuf
> in
> > >> Arrow C++. The Arrow C++ build system handles external projects in
> > >> `cpp/cmake_modules/ThirdpartyToolchain.cmake`, and one of these
> external
> > >> projects is "substrait". By default, the build system takes the source
> > code
> > >> for "substrait" from `
> > >>
> >
> https://github.com/substrait-io/substrait/archive/${ARROW_SUBSTRAIT_BUILD_VERSION}.tar.gz
> > >> ` where `ARROW_SUBSTRAIT_BUILD_VERSION` is set in
> > >> `cpp/thirdparty/versions.txt`. The source code is check-summed and
> > unpacked
> > >> in `substrait_ep-prefix` under the build directory and from this the
> > >> protobuf C++ classes are generated in `*.pb.{h,cc}` files in
> > >> `substrait_ep-generated` under the build directory. The build system
> > makes
> > >> a library using the `*.cc` files and makes the `*.h` files available
> for
> > >> other C++ modules to use.
> > >>
> > >> Setting up the above mechanism did not require any change in the
> > >> `substrait-io/substrait` repo, nor any coordination with its authors.
> > What
> > >> I'm looking for is a similar build mechanism for PyArrow that builds
> > >> Substrait protobuf Python classes and makes them available for use by
> > other
> > >> PyArrow modules. I believe this PyArrow build mechanism does not exist
> > >> currently and that setting up one would not require any changes
> outside
> > >> PyArrow. I'm asking (1) whether that's indeed the case, (2) whether
> > others
> > >> agree this mechanism is needed at least due to the problem I ran into
> > that
> > >> I previously described, and (3) for any thoughts about how to set up
> > this
> > >> mechanism assuming it is needed.
> > >>
> > >> Weston, perhaps your thinking was that the Substrait protobuf Python
> > >> classes need to be built by a repo in the substrait-io space and made
> > >> available as a binary+headers package? This can be done but will
> require
> > >> involving Substrait people and appears to be inconsistent with current
> > >> patterns in the Arrow build system. Note that for my purposes here,
> the
> > >> Substrait protobuf Python classes will be used for composing or
> > >> interpreting a Substrait plan, not for transforming it by an
> optimizer,
> > >> though a Python-based optimizer is a valid use case for them.
> > >>
> > >>
> > >> Yaron.
> > >> ________________________________
> > >> From: Weston Pace <we...@gmail.com>
> > >> Sent: Friday, July 1, 2022 12:42 PM
> > >> To: dev@arrow.apache.org <de...@arrow.apache.org>; Phillip Cloud <
> > >> cpcloud@gmail.com>
> > >> Subject: Re: accessing Substrait protobuf Python classes from PyArrow
> > >>
> > >> Given that Acero does not do any planner / optimizer type tasks I'm
> > >> not sure you will find anything like this in arrow-cpp or pyarrow.
> > >> What you are describing I sometimes refer to as "plan slicing and
> > >> dicing".  I have wondered if we will someday need this in Acero but I
> > >> fear it is a slippery slope between "a little bit of plan
> > >> manipulation" and "a full blown planner" so I've shied away from it.
> > >> My first spot to look would be a substrait-python repository which
> > >> would belong here: https://github.com/substrait-io
> > >>
> > >> However, it does not appear that such a repository exists.  If you're
> > >> willing to create one then a quick ask on the Substrait Slack instance
> > >> should be enough to get the repository created.  Perhaps there is some
> > >> genesis of this library in Ibis although I think Ibis would use its
> > >> own representation for slicing and dicing and only use Substrait for
> > >> serialization.
> > >>
> > >> Once that repository is created pyarrow could probably import it but
> > >> unless this plan manipulation makes sense purely from a pyarrow
> > >> perspective I would rather prefer that the user application import
> > >> both pyarrow and substrait-python independently.
> > >>
> > >> Perhaps @Phillip Cloud or someone from the Ibis space might have some
> > >> ideas on where this might be found.
> > >>
> > >> -Weston
> > >>
> > >> On Thu, Jun 30, 2022 at 10:06 AM Yaron Gvili <rt...@hotmail.com>
> wrote:
> > >>>
> > >>> Hi,
> > >>>
> > >>> Is there support for accessing Substrait protobuf Python classes
> (such
> > >> as Plan) from PyArrow? If not, how should such support be added? For
> > >> example, should the PyArrow build system pull in the Substrait repo as
> > an
> > >> external project and build its protobuf Python classes, in a manner
> > similar
> > >> to how Arrow C++ does it?
> > >>>
> > >>> I'm pondering these questions after running into an issue with code
> I'm
> > >> writing under PyArrow that parses a Substrait plan represented as a
> > >> dictionary. The current (and kind of shaky) parsing operation in this
> > code
> > >> uses json.dumps() on the dictionary, which results in a string that is
> > >> passed to a Cython API that handles it using Arrow C++ code that has
> > access
> > >> to Substrait protobuf C++ classes. But when the Substrait plan
> contains
> > a
> > >> bytes-type, json.dump() no longer works and fails with "TypeError:
> > Object
> > >> of type bytes is not JSON serializable". A fix for this, and a better
> > way
> > >> to parse, is using google.protobuf.json_format.ParseDict() [1] on the
> > >> dictionary. However, this invocation requires a second argument,
> namely
> > a
> > >> protobuf message instance to merge with. The class of this message
> > (such as
> > >> Plan) is a Substrait protobuf Python class, hence the need to access
> > such
> > >> classes from PyArrow.
> > >>>
> > >>> [1]
> > >>
> >
> https://googleapis.dev/python/protobuf/latest/google/protobuf/json_format.html
> > >>>
> > >>>
> > >>> Yaron.
> > >>
> > >
> >
>


Re: accessing Substrait protobuf Python classes from PyArrow

Posted by Li Jin <ic...@gmail.com>.
Yaron, do we need to parse the subtrait protobuf in Python so that we can
get the UDFs and register them with Pyarrow?

On Mon, Jul 4, 2022 at 1:24 PM Yaron Gvili <rt...@hotmail.com> wrote:

> This rewriting of the package is basically what I had in mind; the `_ep`
> was just to signal a private package, which cannot be enforced, of course.
> Assuming this rewriting would indeed avoid conflict with any standard
> protobuf package a user might want to use, it would be nicer to do this
> using a robust refactoring
> tool. We could try Rope [1] (in particular, its folder refactoring library
> method [2]) for this; it should add Rope only as a build dependency. Does
> this sound worth trying?
>
> [1] https://github.com/python-rope/rope
> [2]
> https://rope.readthedocs.io/en/latest/library.html#list-of-refactorings
>
>
> Yaron.
> ________________________________
> From: Jeroen van Straten <je...@gmail.com>
> Sent: Monday, July 4, 2022 12:38 PM
> To: dev@arrow.apache.org <de...@arrow.apache.org>
> Subject: Re: accessing Substrait protobuf Python classes from PyArrow
>
> Ah, I see. I guess that complicates things a little.
>
> About the exposure part in C++, the idea is that if we statically link
> libprotobuf and use private linkage for the generated classes, *hopefully*
> users won't run into linking issues when they link to both Arrow and to
> some libprotobuf of their own, especially when they or another one of their
> dependencies also includes Substrait protobuf. The classes would absolutely
> conflict otherwise, and the libprotobuf version may or may not conflict.
> The latter is kind of an unsolved problem; AFAIK there have been talks to
> replace libprotobuf with a third-party alternative to avoid all this
> nonsense, but nothing concrete yet.
>
> For Python, the problems are not quite the same:
>
>  - There is no such thing as private linkage in Python, so we *have* to
> re-namespace the generated Python files. Note that the output of protoc
> assumes that the files are in exactly the same namespace/module
> path/whatever as the proto files specify, so the toplevel module would be
> "substrait", not "arrow". See
> https://github.com/substrait-io/substrait/pull/207 and
>
> https://github.com/substrait-io/substrait-validator/blob/941b81ddf65d8ee6266d6eea0d151a3f9d8a512b/py/build.rs#L105-L142
> for my ugly workarounds for this thus far.
>  - There is also no such thing as static linkage in Python, so if you'd
> want to use Google's protobuf implementation in Python pyarrow *will* need
> to depend on it, and it'd become the user's problem to make sure that the
> installed Python protobuf version matches the version of their system
> library. It looks like pyarrow currently only depends on numpy, which is
> pretty awesome... so I feel like we should keep it that way.
>
> Not sure what the best course of action is.
>
> Jeroen
>
> On Sun, 3 Jul 2022 at 22:55, Yaron Gvili <rt...@hotmail.com> wrote:
>
> > Thanks, the Google protobuf exposure concerns are clear. Another concern
> > would be consistent maintenance of the same version of Substrait protobuf
> > used in Arrow C++ and in PyArrow.
> >
> > I didn't mean access to users but internally. That is, I didn't mean to
> > expose Substrait protobuf Python classes to PyArrow users, just to use
> them
> > internally in PyArrow code I'll be developing, per the use case I
> described
> > at the start of this thread. IIUC, this should address the exposure
> > concerns, and the `arrow/engine/substrait` module in Arrow C++ does this
> > too. If the technical approach I just described would actually expose the
> > classes, what would be a proper way to avoid exposing them? Perhaps the
> > classes should be generated into a private package, e.g., under
> > `python/_ep`? (ep stands for external project)
> >
> >
> > Yaron.
> > ________________________________
> > From: Antoine Pitrou <an...@python.org>
> > Sent: Sunday, July 3, 2022 3:20 PM
> > To: dev@arrow.apache.org <de...@arrow.apache.org>
> > Subject: Re: accessing Substrait protobuf Python classes from PyArrow
> >
> >
> > I agree that giving direct access to protobuf classes is not Arrow's
> > job. You can probably take the upstream (i.e. Substrait's) protobuf
> > definitions and compile them yourself, using whatever settings required
> > by your project.
> >
> > Regards
> >
> > Antoine.
> >
> >
> > Le 03/07/2022 à 21:16, Jeroen van Straten a écrit :
> > > It's not so much about whether or not we *can*, but about whether or
> not
> > we
> > > *should* generate and expose these files.
> > >
> > > Fundamentally, Substrait aims to be a means to connect different
> systems
> > > together by standardizing some interchange format. Currently that
> happens
> > > to be based on protobuf, so *one* (obvious) way to generate and
> interpret
> > > Substrait plans is to use Google's own protobuf implementation, as
> > > generated by protoc for various languages. It's true that there's
> nothing
> > > fundamentally blocking Arrow from exposing those.
> > >
> > > However, it's by no means the *only* way to interact with protobuf
> > > serializations, and Google's own implementation is a hot mess when it
> > comes
> > > to dependencies; there are lots of good reasons why you might not want
> to
> > > depend on it and opt for a third-party implementation instead. For one
> > > thing, the Python wheel depends on system libraries beyond manylinux,
> and
> > > if they happen to be incompatible (which is likely) it just explodes
> > unless
> > > you set some environment variable. Therefore, assuming pyarrow doesn't
> > > already depend on protobuf, I feel like we should keep it that way, and
> > > thus that we should not include the generated Python files in the
> > release.
> > > Note that we don't even expose/leak the protoc-generated C++ classes
> for
> > > similar reasons.
> > >
> > > Also, as Weston already pointed out, it's not really our job; Substrait
> > > aims to publish bindings for various languages by itself. It just
> doesn't
> > > expose them for Python *yet*. In the interim I suppose you could use
> the
> > > substrait-validator package from PyPI. It does expose them, as well as
> > some
> > > convenient conversion functions, but I'm having trouble finding people
> to
> > > help me keep the validator maintained.
> > >
> > > I suppose versioning would be difficult either way, since Substrait
> > > semi-regularly pushes breaking changes and Arrow currently lags behind
> by
> > > several months (though I have a PR open for Substrait 0.6). I guess
> from
> > > that point of view distributing the right version along with pyarrow
> > seems
> > > nice, but the issues of Google's protobuf implementation remain. This
> > being
> > > an issue at all is also very much a Substrait problem, not an Arrow
> > > problem; at best we can try to mitigate it.
> > >
> > > Jeroen
> > >
> > > On Sun, Jul 3, 2022, 17:51 Yaron Gvili <rt...@hotmail.com> wrote:
> > >
> > >> I looked into the Arrow build system some more. It is possible to get
> > the
> > >> Python classes generated by adding "--python-out" flag (set to a
> > directory
> > >> created for it) to the `${ARROW_PROTOBUF_PROTOC}` command under
> > >> `macro(build_substrait)` in
> > `cpp/cmake_modules/ThirdpartyToolchain.cmake`.
> > >> However, this makes them available only in the Arrow C++ build whereas
> > for
> > >> the current purpose they need to be available in the PyArrow build.
> The
> > >> PyArrow build calls `cmake` on `python/CMakeLists.txt`, which AFAICS
> has
> > >> access to `cpp/cmake_modules`. So, one solution could be to pull
> > >> `macro(build_substrait)` into `python/CMakeLists.txt` and call it to
> > >> generate the Python protobuf classes under `python/`, making them
> > available
> > >> for import by PyArrow code. This would probably be cleaner with some
> > macro
> > >> parameters to distinguish between C++ and Python generation.
> > >>
> > >> Does this sound like a reasonable approach?
> > >>
> > >>
> > >> Yaron.
> > >>
> > >> ________________________________
> > >> From: Yaron Gvili <rt...@hotmail.com>
> > >> Sent: Saturday, July 2, 2022 8:55 AM
> > >> To: dev@arrow.apache.org <de...@arrow.apache.org>; Phillip Cloud <
> > >> cpcloud@gmail.com>
> > >> Subject: Re: accessing Substrait protobuf Python classes from PyArrow
> > >>
> > >> I'm somewhat confused by this answer because I think resolving the
> > issue I
> > >> raised does not require any change outside PyArrow. I'll try to
> explain
> > the
> > >> issue differently.
> > >>
> > >> First, let me describe the current situation with Substrait protobuf
> in
> > >> Arrow C++. The Arrow C++ build system handles external projects in
> > >> `cpp/cmake_modules/ThirdpartyToolchain.cmake`, and one of these
> external
> > >> projects is "substrait". By default, the build system takes the source
> > code
> > >> for "substrait" from `
> > >>
> >
> https://github.com/substrait-io/substrait/archive/${ARROW_SUBSTRAIT_BUILD_VERSION}.tar.gz
> > >> ` where `ARROW_SUBSTRAIT_BUILD_VERSION` is set in
> > >> `cpp/thirdparty/versions.txt`. The source code is check-summed and
> > unpacked
> > >> in `substrait_ep-prefix` under the build directory and from this the
> > >> protobuf C++ classes are generated in `*.pb.{h,cc}` files in
> > >> `substrait_ep-generated` under the build directory. The build system
> > makes
> > >> a library using the `*.cc` files and makes the `*.h` files available
> for
> > >> other C++ modules to use.
> > >>
> > >> Setting up the above mechanism did not require any change in the
> > >> `substrait-io/substrait` repo, nor any coordination with its authors.
> > What
> > >> I'm looking for is a similar build mechanism for PyArrow that builds
> > >> Substrait protobuf Python classes and makes them available for use by
> > other
> > >> PyArrow modules. I believe this PyArrow build mechanism does not exist
> > >> currently and that setting up one would not require any changes
> outside
> > >> PyArrow. I'm asking (1) whether that's indeed the case, (2) whether
> > others
> > >> agree this mechanism is needed at least due to the problem I ran into
> > that
> > >> I previously described, and (3) for any thoughts about how to set up
> > this
> > >> mechanism assuming it is needed.
> > >>
> > >> Weston, perhaps your thinking was that the Substrait protobuf Python
> > >> classes need to be built by a repo in the substrait-io space and made
> > >> available as a binary+headers package? This can be done but will
> require
> > >> involving Substrait people and appears to be inconsistent with current
> > >> patterns in the Arrow build system. Note that for my purposes here,
> the
> > >> Substrait protobuf Python classes will be used for composing or
> > >> interpreting a Substrait plan, not for transforming it by an
> optimizer,
> > >> though a Python-based optimizer is a valid use case for them.
> > >>
> > >>
> > >> Yaron.
> > >> ________________________________
> > >> From: Weston Pace <we...@gmail.com>
> > >> Sent: Friday, July 1, 2022 12:42 PM
> > >> To: dev@arrow.apache.org <de...@arrow.apache.org>; Phillip Cloud <
> > >> cpcloud@gmail.com>
> > >> Subject: Re: accessing Substrait protobuf Python classes from PyArrow
> > >>
> > >> Given that Acero does not do any planner / optimizer type tasks I'm
> > >> not sure you will find anything like this in arrow-cpp or pyarrow.
> > >> What you are describing I sometimes refer to as "plan slicing and
> > >> dicing".  I have wondered if we will someday need this in Acero but I
> > >> fear it is a slippery slope between "a little bit of plan
> > >> manipulation" and "a full blown planner" so I've shied away from it.
> > >> My first spot to look would be a substrait-python repository which
> > >> would belong here: https://github.com/substrait-io
> > >>
> > >> However, it does not appear that such a repository exists.  If you're
> > >> willing to create one then a quick ask on the Substrait Slack instance
> > >> should be enough to get the repository created.  Perhaps there is some
> > >> genesis of this library in Ibis although I think Ibis would use its
> > >> own representation for slicing and dicing and only use Substrait for
> > >> serialization.
> > >>
> > >> Once that repository is created pyarrow could probably import it but
> > >> unless this plan manipulation makes sense purely from a pyarrow
> > >> perspective I would rather prefer that the user application import
> > >> both pyarrow and substrait-python independently.
> > >>
> > >> Perhaps @Phillip Cloud or someone from the Ibis space might have some
> > >> ideas on where this might be found.
> > >>
> > >> -Weston
> > >>
> > >> On Thu, Jun 30, 2022 at 10:06 AM Yaron Gvili <rt...@hotmail.com>
> wrote:
> > >>>
> > >>> Hi,
> > >>>
> > >>> Is there support for accessing Substrait protobuf Python classes
> (such
> > >> as Plan) from PyArrow? If not, how should such support be added? For
> > >> example, should the PyArrow build system pull in the Substrait repo as
> > an
> > >> external project and build its protobuf Python classes, in a manner
> > similar
> > >> to how Arrow C++ does it?
> > >>>
> > >>> I'm pondering these questions after running into an issue with code
> I'm
> > >> writing under PyArrow that parses a Substrait plan represented as a
> > >> dictionary. The current (and kind of shaky) parsing operation in this
> > code
> > >> uses json.dumps() on the dictionary, which results in a string that is
> > >> passed to a Cython API that handles it using Arrow C++ code that has
> > access
> > >> to Substrait protobuf C++ classes. But when the Substrait plan
> contains
> > a
> > >> bytes-type, json.dump() no longer works and fails with "TypeError:
> > Object
> > >> of type bytes is not JSON serializable". A fix for this, and a better
> > way
> > >> to parse, is using google.protobuf.json_format.ParseDict() [1] on the
> > >> dictionary. However, this invocation requires a second argument,
> namely
> > a
> > >> protobuf message instance to merge with. The class of this message
> > (such as
> > >> Plan) is a Substrait protobuf Python class, hence the need to access
> > such
> > >> classes from PyArrow.
> > >>>
> > >>> [1]
> > >>
> >
> https://googleapis.dev/python/protobuf/latest/google/protobuf/json_format.html
> > >>>
> > >>>
> > >>> Yaron.
> > >>
> > >
> >
>

Re: accessing Substrait protobuf Python classes from PyArrow

Posted by Yaron Gvili <rt...@hotmail.com>.
This rewriting of the package is basically what I had in mind; the `_ep` was just to signal a private package, which cannot be enforced, of course. Assuming this rewriting would indeed avoid conflict with any standard protobuf package a user might want to use, it would be nicer to do this using a robust refactoring
tool. We could try Rope [1] (in particular, its folder refactoring library method [2]) for this; it should add Rope only as a build dependency. Does this sound worth trying?

[1] https://github.com/python-rope/rope
[2] https://rope.readthedocs.io/en/latest/library.html#list-of-refactorings


Yaron.
________________________________
From: Jeroen van Straten <je...@gmail.com>
Sent: Monday, July 4, 2022 12:38 PM
To: dev@arrow.apache.org <de...@arrow.apache.org>
Subject: Re: accessing Substrait protobuf Python classes from PyArrow

Ah, I see. I guess that complicates things a little.

About the exposure part in C++, the idea is that if we statically link
libprotobuf and use private linkage for the generated classes, *hopefully*
users won't run into linking issues when they link to both Arrow and to
some libprotobuf of their own, especially when they or another one of their
dependencies also includes Substrait protobuf. The classes would absolutely
conflict otherwise, and the libprotobuf version may or may not conflict.
The latter is kind of an unsolved problem; AFAIK there have been talks to
replace libprotobuf with a third-party alternative to avoid all this
nonsense, but nothing concrete yet.

For Python, the problems are not quite the same:

 - There is no such thing as private linkage in Python, so we *have* to
re-namespace the generated Python files. Note that the output of protoc
assumes that the files are in exactly the same namespace/module
path/whatever as the proto files specify, so the toplevel module would be
"substrait", not "arrow". See
https://github.com/substrait-io/substrait/pull/207 and
https://github.com/substrait-io/substrait-validator/blob/941b81ddf65d8ee6266d6eea0d151a3f9d8a512b/py/build.rs#L105-L142
for my ugly workarounds for this thus far.
 - There is also no such thing as static linkage in Python, so if you'd
want to use Google's protobuf implementation in Python pyarrow *will* need
to depend on it, and it'd become the user's problem to make sure that the
installed Python protobuf version matches the version of their system
library. It looks like pyarrow currently only depends on numpy, which is
pretty awesome... so I feel like we should keep it that way.

Not sure what the best course of action is.

Jeroen

On Sun, 3 Jul 2022 at 22:55, Yaron Gvili <rt...@hotmail.com> wrote:

> Thanks, the Google protobuf exposure concerns are clear. Another concern
> would be consistent maintenance of the same version of Substrait protobuf
> used in Arrow C++ and in PyArrow.
>
> I didn't mean access to users but internally. That is, I didn't mean to
> expose Substrait protobuf Python classes to PyArrow users, just to use them
> internally in PyArrow code I'll be developing, per the use case I described
> at the start of this thread. IIUC, this should address the exposure
> concerns, and the `arrow/engine/substrait` module in Arrow C++ does this
> too. If the technical approach I just described would actually expose the
> classes, what would be a proper way to avoid exposing them? Perhaps the
> classes should be generated into a private package, e.g., under
> `python/_ep`? (ep stands for external project)
>
>
> Yaron.
> ________________________________
> From: Antoine Pitrou <an...@python.org>
> Sent: Sunday, July 3, 2022 3:20 PM
> To: dev@arrow.apache.org <de...@arrow.apache.org>
> Subject: Re: accessing Substrait protobuf Python classes from PyArrow
>
>
> I agree that giving direct access to protobuf classes is not Arrow's
> job. You can probably take the upstream (i.e. Substrait's) protobuf
> definitions and compile them yourself, using whatever settings required
> by your project.
>
> Regards
>
> Antoine.
>
>
> Le 03/07/2022 à 21:16, Jeroen van Straten a écrit :
> > It's not so much about whether or not we *can*, but about whether or not
> we
> > *should* generate and expose these files.
> >
> > Fundamentally, Substrait aims to be a means to connect different systems
> > together by standardizing some interchange format. Currently that happens
> > to be based on protobuf, so *one* (obvious) way to generate and interpret
> > Substrait plans is to use Google's own protobuf implementation, as
> > generated by protoc for various languages. It's true that there's nothing
> > fundamentally blocking Arrow from exposing those.
> >
> > However, it's by no means the *only* way to interact with protobuf
> > serializations, and Google's own implementation is a hot mess when it
> comes
> > to dependencies; there are lots of good reasons why you might not want to
> > depend on it and opt for a third-party implementation instead. For one
> > thing, the Python wheel depends on system libraries beyond manylinux, and
> > if they happen to be incompatible (which is likely) it just explodes
> unless
> > you set some environment variable. Therefore, assuming pyarrow doesn't
> > already depend on protobuf, I feel like we should keep it that way, and
> > thus that we should not include the generated Python files in the
> release.
> > Note that we don't even expose/leak the protoc-generated C++ classes for
> > similar reasons.
> >
> > Also, as Weston already pointed out, it's not really our job; Substrait
> > aims to publish bindings for various languages by itself. It just doesn't
> > expose them for Python *yet*. In the interim I suppose you could use the
> > substrait-validator package from PyPI. It does expose them, as well as
> some
> > convenient conversion functions, but I'm having trouble finding people to
> > help me keep the validator maintained.
> >
> > I suppose versioning would be difficult either way, since Substrait
> > semi-regularly pushes breaking changes and Arrow currently lags behind by
> > several months (though I have a PR open for Substrait 0.6). I guess from
> > that point of view distributing the right version along with pyarrow
> seems
> > nice, but the issues of Google's protobuf implementation remain. This
> being
> > an issue at all is also very much a Substrait problem, not an Arrow
> > problem; at best we can try to mitigate it.
> >
> > Jeroen
> >
> > On Sun, Jul 3, 2022, 17:51 Yaron Gvili <rt...@hotmail.com> wrote:
> >
> >> I looked into the Arrow build system some more. It is possible to get
> the
> >> Python classes generated by adding "--python-out" flag (set to a
> directory
> >> created for it) to the `${ARROW_PROTOBUF_PROTOC}` command under
> >> `macro(build_substrait)` in
> `cpp/cmake_modules/ThirdpartyToolchain.cmake`.
> >> However, this makes them available only in the Arrow C++ build whereas
> for
> >> the current purpose they need to be available in the PyArrow build. The
> >> PyArrow build calls `cmake` on `python/CMakeLists.txt`, which AFAICS has
> >> access to `cpp/cmake_modules`. So, one solution could be to pull
> >> `macro(build_substrait)` into `python/CMakeLists.txt` and call it to
> >> generate the Python protobuf classes under `python/`, making them
> available
> >> for import by PyArrow code. This would probably be cleaner with some
> macro
> >> parameters to distinguish between C++ and Python generation.
> >>
> >> Does this sound like a reasonable approach?
> >>
> >>
> >> Yaron.
> >>
> >> ________________________________
> >> From: Yaron Gvili <rt...@hotmail.com>
> >> Sent: Saturday, July 2, 2022 8:55 AM
> >> To: dev@arrow.apache.org <de...@arrow.apache.org>; Phillip Cloud <
> >> cpcloud@gmail.com>
> >> Subject: Re: accessing Substrait protobuf Python classes from PyArrow
> >>
> >> I'm somewhat confused by this answer because I think resolving the
> issue I
> >> raised does not require any change outside PyArrow. I'll try to explain
> the
> >> issue differently.
> >>
> >> First, let me describe the current situation with Substrait protobuf in
> >> Arrow C++. The Arrow C++ build system handles external projects in
> >> `cpp/cmake_modules/ThirdpartyToolchain.cmake`, and one of these external
> >> projects is "substrait". By default, the build system takes the source
> code
> >> for "substrait" from `
> >>
> https://github.com/substrait-io/substrait/archive/${ARROW_SUBSTRAIT_BUILD_VERSION}.tar.gz
> >> ` where `ARROW_SUBSTRAIT_BUILD_VERSION` is set in
> >> `cpp/thirdparty/versions.txt`. The source code is check-summed and
> unpacked
> >> in `substrait_ep-prefix` under the build directory and from this the
> >> protobuf C++ classes are generated in `*.pb.{h,cc}` files in
> >> `substrait_ep-generated` under the build directory. The build system
> makes
> >> a library using the `*.cc` files and makes the `*.h` files available for
> >> other C++ modules to use.
> >>
> >> Setting up the above mechanism did not require any change in the
> >> `substrait-io/substrait` repo, nor any coordination with its authors.
> What
> >> I'm looking for is a similar build mechanism for PyArrow that builds
> >> Substrait protobuf Python classes and makes them available for use by
> other
> >> PyArrow modules. I believe this PyArrow build mechanism does not exist
> >> currently and that setting up one would not require any changes outside
> >> PyArrow. I'm asking (1) whether that's indeed the case, (2) whether
> others
> >> agree this mechanism is needed at least due to the problem I ran into
> that
> >> I previously described, and (3) for any thoughts about how to set up
> this
> >> mechanism assuming it is needed.
> >>
> >> Weston, perhaps your thinking was that the Substrait protobuf Python
> >> classes need to be built by a repo in the substrait-io space and made
> >> available as a binary+headers package? This can be done but will require
> >> involving Substrait people and appears to be inconsistent with current
> >> patterns in the Arrow build system. Note that for my purposes here, the
> >> Substrait protobuf Python classes will be used for composing or
> >> interpreting a Substrait plan, not for transforming it by an optimizer,
> >> though a Python-based optimizer is a valid use case for them.
> >>
> >>
> >> Yaron.
> >> ________________________________
> >> From: Weston Pace <we...@gmail.com>
> >> Sent: Friday, July 1, 2022 12:42 PM
> >> To: dev@arrow.apache.org <de...@arrow.apache.org>; Phillip Cloud <
> >> cpcloud@gmail.com>
> >> Subject: Re: accessing Substrait protobuf Python classes from PyArrow
> >>
> >> Given that Acero does not do any planner / optimizer type tasks I'm
> >> not sure you will find anything like this in arrow-cpp or pyarrow.
> >> What you are describing I sometimes refer to as "plan slicing and
> >> dicing".  I have wondered if we will someday need this in Acero but I
> >> fear it is a slippery slope between "a little bit of plan
> >> manipulation" and "a full blown planner" so I've shied away from it.
> >> My first spot to look would be a substrait-python repository which
> >> would belong here: https://github.com/substrait-io
> >>
> >> However, it does not appear that such a repository exists.  If you're
> >> willing to create one then a quick ask on the Substrait Slack instance
> >> should be enough to get the repository created.  Perhaps there is some
> >> genesis of this library in Ibis although I think Ibis would use its
> >> own representation for slicing and dicing and only use Substrait for
> >> serialization.
> >>
> >> Once that repository is created pyarrow could probably import it but
> >> unless this plan manipulation makes sense purely from a pyarrow
> >> perspective I would rather prefer that the user application import
> >> both pyarrow and substrait-python independently.
> >>
> >> Perhaps @Phillip Cloud or someone from the Ibis space might have some
> >> ideas on where this might be found.
> >>
> >> -Weston
> >>
> >> On Thu, Jun 30, 2022 at 10:06 AM Yaron Gvili <rt...@hotmail.com> wrote:
> >>>
> >>> Hi,
> >>>
> >>> Is there support for accessing Substrait protobuf Python classes (such
> >> as Plan) from PyArrow? If not, how should such support be added? For
> >> example, should the PyArrow build system pull in the Substrait repo as
> an
> >> external project and build its protobuf Python classes, in a manner
> similar
> >> to how Arrow C++ does it?
> >>>
> >>> I'm pondering these questions after running into an issue with code I'm
> >> writing under PyArrow that parses a Substrait plan represented as a
> >> dictionary. The current (and kind of shaky) parsing operation in this
> code
> >> uses json.dumps() on the dictionary, which results in a string that is
> >> passed to a Cython API that handles it using Arrow C++ code that has
> access
> >> to Substrait protobuf C++ classes. But when the Substrait plan contains
> a
> >> bytes-type, json.dump() no longer works and fails with "TypeError:
> Object
> >> of type bytes is not JSON serializable". A fix for this, and a better
> way
> >> to parse, is using google.protobuf.json_format.ParseDict() [1] on the
> >> dictionary. However, this invocation requires a second argument, namely
> a
> >> protobuf message instance to merge with. The class of this message
> (such as
> >> Plan) is a Substrait protobuf Python class, hence the need to access
> such
> >> classes from PyArrow.
> >>>
> >>> [1]
> >>
> https://googleapis.dev/python/protobuf/latest/google/protobuf/json_format.html
> >>>
> >>>
> >>> Yaron.
> >>
> >
>

Re: accessing Substrait protobuf Python classes from PyArrow

Posted by Jeroen van Straten <je...@gmail.com>.
Ah, I see. I guess that complicates things a little.

About the exposure part in C++, the idea is that if we statically link
libprotobuf and use private linkage for the generated classes, *hopefully*
users won't run into linking issues when they link to both Arrow and to
some libprotobuf of their own, especially when they or another one of their
dependencies also includes Substrait protobuf. The classes would absolutely
conflict otherwise, and the libprotobuf version may or may not conflict.
The latter is kind of an unsolved problem; AFAIK there have been talks to
replace libprotobuf with a third-party alternative to avoid all this
nonsense, but nothing concrete yet.

For Python, the problems are not quite the same:

 - There is no such thing as private linkage in Python, so we *have* to
re-namespace the generated Python files. Note that the output of protoc
assumes that the files are in exactly the same namespace/module
path/whatever as the proto files specify, so the toplevel module would be
"substrait", not "arrow". See
https://github.com/substrait-io/substrait/pull/207 and
https://github.com/substrait-io/substrait-validator/blob/941b81ddf65d8ee6266d6eea0d151a3f9d8a512b/py/build.rs#L105-L142
for my ugly workarounds for this thus far.
 - There is also no such thing as static linkage in Python, so if you'd
want to use Google's protobuf implementation in Python pyarrow *will* need
to depend on it, and it'd become the user's problem to make sure that the
installed Python protobuf version matches the version of their system
library. It looks like pyarrow currently only depends on numpy, which is
pretty awesome... so I feel like we should keep it that way.

Not sure what the best course of action is.

Jeroen

On Sun, 3 Jul 2022 at 22:55, Yaron Gvili <rt...@hotmail.com> wrote:

> Thanks, the Google protobuf exposure concerns are clear. Another concern
> would be consistent maintenance of the same version of Substrait protobuf
> used in Arrow C++ and in PyArrow.
>
> I didn't mean access to users but internally. That is, I didn't mean to
> expose Substrait protobuf Python classes to PyArrow users, just to use them
> internally in PyArrow code I'll be developing, per the use case I described
> at the start of this thread. IIUC, this should address the exposure
> concerns, and the `arrow/engine/substrait` module in Arrow C++ does this
> too. If the technical approach I just described would actually expose the
> classes, what would be a proper way to avoid exposing them? Perhaps the
> classes should be generated into a private package, e.g., under
> `python/_ep`? (ep stands for external project)
>
>
> Yaron.
> ________________________________
> From: Antoine Pitrou <an...@python.org>
> Sent: Sunday, July 3, 2022 3:20 PM
> To: dev@arrow.apache.org <de...@arrow.apache.org>
> Subject: Re: accessing Substrait protobuf Python classes from PyArrow
>
>
> I agree that giving direct access to protobuf classes is not Arrow's
> job. You can probably take the upstream (i.e. Substrait's) protobuf
> definitions and compile them yourself, using whatever settings required
> by your project.
>
> Regards
>
> Antoine.
>
>
> Le 03/07/2022 à 21:16, Jeroen van Straten a écrit :
> > It's not so much about whether or not we *can*, but about whether or not
> we
> > *should* generate and expose these files.
> >
> > Fundamentally, Substrait aims to be a means to connect different systems
> > together by standardizing some interchange format. Currently that happens
> > to be based on protobuf, so *one* (obvious) way to generate and interpret
> > Substrait plans is to use Google's own protobuf implementation, as
> > generated by protoc for various languages. It's true that there's nothing
> > fundamentally blocking Arrow from exposing those.
> >
> > However, it's by no means the *only* way to interact with protobuf
> > serializations, and Google's own implementation is a hot mess when it
> comes
> > to dependencies; there are lots of good reasons why you might not want to
> > depend on it and opt for a third-party implementation instead. For one
> > thing, the Python wheel depends on system libraries beyond manylinux, and
> > if they happen to be incompatible (which is likely) it just explodes
> unless
> > you set some environment variable. Therefore, assuming pyarrow doesn't
> > already depend on protobuf, I feel like we should keep it that way, and
> > thus that we should not include the generated Python files in the
> release.
> > Note that we don't even expose/leak the protoc-generated C++ classes for
> > similar reasons.
> >
> > Also, as Weston already pointed out, it's not really our job; Substrait
> > aims to publish bindings for various languages by itself. It just doesn't
> > expose them for Python *yet*. In the interim I suppose you could use the
> > substrait-validator package from PyPI. It does expose them, as well as
> some
> > convenient conversion functions, but I'm having trouble finding people to
> > help me keep the validator maintained.
> >
> > I suppose versioning would be difficult either way, since Substrait
> > semi-regularly pushes breaking changes and Arrow currently lags behind by
> > several months (though I have a PR open for Substrait 0.6). I guess from
> > that point of view distributing the right version along with pyarrow
> seems
> > nice, but the issues of Google's protobuf implementation remain. This
> being
> > an issue at all is also very much a Substrait problem, not an Arrow
> > problem; at best we can try to mitigate it.
> >
> > Jeroen
> >
> > On Sun, Jul 3, 2022, 17:51 Yaron Gvili <rt...@hotmail.com> wrote:
> >
> >> I looked into the Arrow build system some more. It is possible to get
> the
> >> Python classes generated by adding "--python-out" flag (set to a
> directory
> >> created for it) to the `${ARROW_PROTOBUF_PROTOC}` command under
> >> `macro(build_substrait)` in
> `cpp/cmake_modules/ThirdpartyToolchain.cmake`.
> >> However, this makes them available only in the Arrow C++ build whereas
> for
> >> the current purpose they need to be available in the PyArrow build. The
> >> PyArrow build calls `cmake` on `python/CMakeLists.txt`, which AFAICS has
> >> access to `cpp/cmake_modules`. So, one solution could be to pull
> >> `macro(build_substrait)` into `python/CMakeLists.txt` and call it to
> >> generate the Python protobuf classes under `python/`, making them
> available
> >> for import by PyArrow code. This would probably be cleaner with some
> macro
> >> parameters to distinguish between C++ and Python generation.
> >>
> >> Does this sound like a reasonable approach?
> >>
> >>
> >> Yaron.
> >>
> >> ________________________________
> >> From: Yaron Gvili <rt...@hotmail.com>
> >> Sent: Saturday, July 2, 2022 8:55 AM
> >> To: dev@arrow.apache.org <de...@arrow.apache.org>; Phillip Cloud <
> >> cpcloud@gmail.com>
> >> Subject: Re: accessing Substrait protobuf Python classes from PyArrow
> >>
> >> I'm somewhat confused by this answer because I think resolving the
> issue I
> >> raised does not require any change outside PyArrow. I'll try to explain
> the
> >> issue differently.
> >>
> >> First, let me describe the current situation with Substrait protobuf in
> >> Arrow C++. The Arrow C++ build system handles external projects in
> >> `cpp/cmake_modules/ThirdpartyToolchain.cmake`, and one of these external
> >> projects is "substrait". By default, the build system takes the source
> code
> >> for "substrait" from `
> >>
> https://github.com/substrait-io/substrait/archive/${ARROW_SUBSTRAIT_BUILD_VERSION}.tar.gz
> >> ` where `ARROW_SUBSTRAIT_BUILD_VERSION` is set in
> >> `cpp/thirdparty/versions.txt`. The source code is check-summed and
> unpacked
> >> in `substrait_ep-prefix` under the build directory and from this the
> >> protobuf C++ classes are generated in `*.pb.{h,cc}` files in
> >> `substrait_ep-generated` under the build directory. The build system
> makes
> >> a library using the `*.cc` files and makes the `*.h` files available for
> >> other C++ modules to use.
> >>
> >> Setting up the above mechanism did not require any change in the
> >> `substrait-io/substrait` repo, nor any coordination with its authors.
> What
> >> I'm looking for is a similar build mechanism for PyArrow that builds
> >> Substrait protobuf Python classes and makes them available for use by
> other
> >> PyArrow modules. I believe this PyArrow build mechanism does not exist
> >> currently and that setting up one would not require any changes outside
> >> PyArrow. I'm asking (1) whether that's indeed the case, (2) whether
> others
> >> agree this mechanism is needed at least due to the problem I ran into
> that
> >> I previously described, and (3) for any thoughts about how to set up
> this
> >> mechanism assuming it is needed.
> >>
> >> Weston, perhaps your thinking was that the Substrait protobuf Python
> >> classes need to be built by a repo in the substrait-io space and made
> >> available as a binary+headers package? This can be done but will require
> >> involving Substrait people and appears to be inconsistent with current
> >> patterns in the Arrow build system. Note that for my purposes here, the
> >> Substrait protobuf Python classes will be used for composing or
> >> interpreting a Substrait plan, not for transforming it by an optimizer,
> >> though a Python-based optimizer is a valid use case for them.
> >>
> >>
> >> Yaron.
> >> ________________________________
> >> From: Weston Pace <we...@gmail.com>
> >> Sent: Friday, July 1, 2022 12:42 PM
> >> To: dev@arrow.apache.org <de...@arrow.apache.org>; Phillip Cloud <
> >> cpcloud@gmail.com>
> >> Subject: Re: accessing Substrait protobuf Python classes from PyArrow
> >>
> >> Given that Acero does not do any planner / optimizer type tasks I'm
> >> not sure you will find anything like this in arrow-cpp or pyarrow.
> >> What you are describing I sometimes refer to as "plan slicing and
> >> dicing".  I have wondered if we will someday need this in Acero but I
> >> fear it is a slippery slope between "a little bit of plan
> >> manipulation" and "a full blown planner" so I've shied away from it.
> >> My first spot to look would be a substrait-python repository which
> >> would belong here: https://github.com/substrait-io
> >>
> >> However, it does not appear that such a repository exists.  If you're
> >> willing to create one then a quick ask on the Substrait Slack instance
> >> should be enough to get the repository created.  Perhaps there is some
> >> genesis of this library in Ibis although I think Ibis would use its
> >> own representation for slicing and dicing and only use Substrait for
> >> serialization.
> >>
> >> Once that repository is created pyarrow could probably import it but
> >> unless this plan manipulation makes sense purely from a pyarrow
> >> perspective I would rather prefer that the user application import
> >> both pyarrow and substrait-python independently.
> >>
> >> Perhaps @Phillip Cloud or someone from the Ibis space might have some
> >> ideas on where this might be found.
> >>
> >> -Weston
> >>
> >> On Thu, Jun 30, 2022 at 10:06 AM Yaron Gvili <rt...@hotmail.com> wrote:
> >>>
> >>> Hi,
> >>>
> >>> Is there support for accessing Substrait protobuf Python classes (such
> >> as Plan) from PyArrow? If not, how should such support be added? For
> >> example, should the PyArrow build system pull in the Substrait repo as
> an
> >> external project and build its protobuf Python classes, in a manner
> similar
> >> to how Arrow C++ does it?
> >>>
> >>> I'm pondering these questions after running into an issue with code I'm
> >> writing under PyArrow that parses a Substrait plan represented as a
> >> dictionary. The current (and kind of shaky) parsing operation in this
> code
> >> uses json.dumps() on the dictionary, which results in a string that is
> >> passed to a Cython API that handles it using Arrow C++ code that has
> access
> >> to Substrait protobuf C++ classes. But when the Substrait plan contains
> a
> >> bytes-type, json.dump() no longer works and fails with "TypeError:
> Object
> >> of type bytes is not JSON serializable". A fix for this, and a better
> way
> >> to parse, is using google.protobuf.json_format.ParseDict() [1] on the
> >> dictionary. However, this invocation requires a second argument, namely
> a
> >> protobuf message instance to merge with. The class of this message
> (such as
> >> Plan) is a Substrait protobuf Python class, hence the need to access
> such
> >> classes from PyArrow.
> >>>
> >>> [1]
> >>
> https://googleapis.dev/python/protobuf/latest/google/protobuf/json_format.html
> >>>
> >>>
> >>> Yaron.
> >>
> >
>

Re: accessing Substrait protobuf Python classes from PyArrow

Posted by Yaron Gvili <rt...@hotmail.com>.
Thanks, the Google protobuf exposure concerns are clear. Another concern would be consistent maintenance of the same version of Substrait protobuf used in Arrow C++ and in PyArrow.

I didn't mean access to users but internally. That is, I didn't mean to expose Substrait protobuf Python classes to PyArrow users, just to use them internally in PyArrow code I'll be developing, per the use case I described at the start of this thread. IIUC, this should address the exposure concerns, and the `arrow/engine/substrait` module in Arrow C++ does this too. If the technical approach I just described would actually expose the classes, what would be a proper way to avoid exposing them? Perhaps the classes should be generated into a private package, e.g., under `python/_ep`? (ep stands for external project)


Yaron.
________________________________
From: Antoine Pitrou <an...@python.org>
Sent: Sunday, July 3, 2022 3:20 PM
To: dev@arrow.apache.org <de...@arrow.apache.org>
Subject: Re: accessing Substrait protobuf Python classes from PyArrow


I agree that giving direct access to protobuf classes is not Arrow's
job. You can probably take the upstream (i.e. Substrait's) protobuf
definitions and compile them yourself, using whatever settings required
by your project.

Regards

Antoine.


Le 03/07/2022 à 21:16, Jeroen van Straten a écrit :
> It's not so much about whether or not we *can*, but about whether or not we
> *should* generate and expose these files.
>
> Fundamentally, Substrait aims to be a means to connect different systems
> together by standardizing some interchange format. Currently that happens
> to be based on protobuf, so *one* (obvious) way to generate and interpret
> Substrait plans is to use Google's own protobuf implementation, as
> generated by protoc for various languages. It's true that there's nothing
> fundamentally blocking Arrow from exposing those.
>
> However, it's by no means the *only* way to interact with protobuf
> serializations, and Google's own implementation is a hot mess when it comes
> to dependencies; there are lots of good reasons why you might not want to
> depend on it and opt for a third-party implementation instead. For one
> thing, the Python wheel depends on system libraries beyond manylinux, and
> if they happen to be incompatible (which is likely) it just explodes unless
> you set some environment variable. Therefore, assuming pyarrow doesn't
> already depend on protobuf, I feel like we should keep it that way, and
> thus that we should not include the generated Python files in the release.
> Note that we don't even expose/leak the protoc-generated C++ classes for
> similar reasons.
>
> Also, as Weston already pointed out, it's not really our job; Substrait
> aims to publish bindings for various languages by itself. It just doesn't
> expose them for Python *yet*. In the interim I suppose you could use the
> substrait-validator package from PyPI. It does expose them, as well as some
> convenient conversion functions, but I'm having trouble finding people to
> help me keep the validator maintained.
>
> I suppose versioning would be difficult either way, since Substrait
> semi-regularly pushes breaking changes and Arrow currently lags behind by
> several months (though I have a PR open for Substrait 0.6). I guess from
> that point of view distributing the right version along with pyarrow seems
> nice, but the issues of Google's protobuf implementation remain. This being
> an issue at all is also very much a Substrait problem, not an Arrow
> problem; at best we can try to mitigate it.
>
> Jeroen
>
> On Sun, Jul 3, 2022, 17:51 Yaron Gvili <rt...@hotmail.com> wrote:
>
>> I looked into the Arrow build system some more. It is possible to get the
>> Python classes generated by adding "--python-out" flag (set to a directory
>> created for it) to the `${ARROW_PROTOBUF_PROTOC}` command under
>> `macro(build_substrait)` in `cpp/cmake_modules/ThirdpartyToolchain.cmake`.
>> However, this makes them available only in the Arrow C++ build whereas for
>> the current purpose they need to be available in the PyArrow build. The
>> PyArrow build calls `cmake` on `python/CMakeLists.txt`, which AFAICS has
>> access to `cpp/cmake_modules`. So, one solution could be to pull
>> `macro(build_substrait)` into `python/CMakeLists.txt` and call it to
>> generate the Python protobuf classes under `python/`, making them available
>> for import by PyArrow code. This would probably be cleaner with some macro
>> parameters to distinguish between C++ and Python generation.
>>
>> Does this sound like a reasonable approach?
>>
>>
>> Yaron.
>>
>> ________________________________
>> From: Yaron Gvili <rt...@hotmail.com>
>> Sent: Saturday, July 2, 2022 8:55 AM
>> To: dev@arrow.apache.org <de...@arrow.apache.org>; Phillip Cloud <
>> cpcloud@gmail.com>
>> Subject: Re: accessing Substrait protobuf Python classes from PyArrow
>>
>> I'm somewhat confused by this answer because I think resolving the issue I
>> raised does not require any change outside PyArrow. I'll try to explain the
>> issue differently.
>>
>> First, let me describe the current situation with Substrait protobuf in
>> Arrow C++. The Arrow C++ build system handles external projects in
>> `cpp/cmake_modules/ThirdpartyToolchain.cmake`, and one of these external
>> projects is "substrait". By default, the build system takes the source code
>> for "substrait" from `
>> https://github.com/substrait-io/substrait/archive/${ARROW_SUBSTRAIT_BUILD_VERSION}.tar.gz
>> ` where `ARROW_SUBSTRAIT_BUILD_VERSION` is set in
>> `cpp/thirdparty/versions.txt`. The source code is check-summed and unpacked
>> in `substrait_ep-prefix` under the build directory and from this the
>> protobuf C++ classes are generated in `*.pb.{h,cc}` files in
>> `substrait_ep-generated` under the build directory. The build system makes
>> a library using the `*.cc` files and makes the `*.h` files available for
>> other C++ modules to use.
>>
>> Setting up the above mechanism did not require any change in the
>> `substrait-io/substrait` repo, nor any coordination with its authors. What
>> I'm looking for is a similar build mechanism for PyArrow that builds
>> Substrait protobuf Python classes and makes them available for use by other
>> PyArrow modules. I believe this PyArrow build mechanism does not exist
>> currently and that setting up one would not require any changes outside
>> PyArrow. I'm asking (1) whether that's indeed the case, (2) whether others
>> agree this mechanism is needed at least due to the problem I ran into that
>> I previously described, and (3) for any thoughts about how to set up this
>> mechanism assuming it is needed.
>>
>> Weston, perhaps your thinking was that the Substrait protobuf Python
>> classes need to be built by a repo in the substrait-io space and made
>> available as a binary+headers package? This can be done but will require
>> involving Substrait people and appears to be inconsistent with current
>> patterns in the Arrow build system. Note that for my purposes here, the
>> Substrait protobuf Python classes will be used for composing or
>> interpreting a Substrait plan, not for transforming it by an optimizer,
>> though a Python-based optimizer is a valid use case for them.
>>
>>
>> Yaron.
>> ________________________________
>> From: Weston Pace <we...@gmail.com>
>> Sent: Friday, July 1, 2022 12:42 PM
>> To: dev@arrow.apache.org <de...@arrow.apache.org>; Phillip Cloud <
>> cpcloud@gmail.com>
>> Subject: Re: accessing Substrait protobuf Python classes from PyArrow
>>
>> Given that Acero does not do any planner / optimizer type tasks I'm
>> not sure you will find anything like this in arrow-cpp or pyarrow.
>> What you are describing I sometimes refer to as "plan slicing and
>> dicing".  I have wondered if we will someday need this in Acero but I
>> fear it is a slippery slope between "a little bit of plan
>> manipulation" and "a full blown planner" so I've shied away from it.
>> My first spot to look would be a substrait-python repository which
>> would belong here: https://github.com/substrait-io
>>
>> However, it does not appear that such a repository exists.  If you're
>> willing to create one then a quick ask on the Substrait Slack instance
>> should be enough to get the repository created.  Perhaps there is some
>> genesis of this library in Ibis although I think Ibis would use its
>> own representation for slicing and dicing and only use Substrait for
>> serialization.
>>
>> Once that repository is created pyarrow could probably import it but
>> unless this plan manipulation makes sense purely from a pyarrow
>> perspective I would rather prefer that the user application import
>> both pyarrow and substrait-python independently.
>>
>> Perhaps @Phillip Cloud or someone from the Ibis space might have some
>> ideas on where this might be found.
>>
>> -Weston
>>
>> On Thu, Jun 30, 2022 at 10:06 AM Yaron Gvili <rt...@hotmail.com> wrote:
>>>
>>> Hi,
>>>
>>> Is there support for accessing Substrait protobuf Python classes (such
>> as Plan) from PyArrow? If not, how should such support be added? For
>> example, should the PyArrow build system pull in the Substrait repo as an
>> external project and build its protobuf Python classes, in a manner similar
>> to how Arrow C++ does it?
>>>
>>> I'm pondering these questions after running into an issue with code I'm
>> writing under PyArrow that parses a Substrait plan represented as a
>> dictionary. The current (and kind of shaky) parsing operation in this code
>> uses json.dumps() on the dictionary, which results in a string that is
>> passed to a Cython API that handles it using Arrow C++ code that has access
>> to Substrait protobuf C++ classes. But when the Substrait plan contains a
>> bytes-type, json.dump() no longer works and fails with "TypeError: Object
>> of type bytes is not JSON serializable". A fix for this, and a better way
>> to parse, is using google.protobuf.json_format.ParseDict() [1] on the
>> dictionary. However, this invocation requires a second argument, namely a
>> protobuf message instance to merge with. The class of this message (such as
>> Plan) is a Substrait protobuf Python class, hence the need to access such
>> classes from PyArrow.
>>>
>>> [1]
>> https://googleapis.dev/python/protobuf/latest/google/protobuf/json_format.html
>>>
>>>
>>> Yaron.
>>
>

Re: accessing Substrait protobuf Python classes from PyArrow

Posted by Antoine Pitrou <an...@python.org>.
I agree that giving direct access to protobuf classes is not Arrow's 
job. You can probably take the upstream (i.e. Substrait's) protobuf 
definitions and compile them yourself, using whatever settings required 
by your project.

Regards

Antoine.


Le 03/07/2022 à 21:16, Jeroen van Straten a écrit :
> It's not so much about whether or not we *can*, but about whether or not we
> *should* generate and expose these files.
> 
> Fundamentally, Substrait aims to be a means to connect different systems
> together by standardizing some interchange format. Currently that happens
> to be based on protobuf, so *one* (obvious) way to generate and interpret
> Substrait plans is to use Google's own protobuf implementation, as
> generated by protoc for various languages. It's true that there's nothing
> fundamentally blocking Arrow from exposing those.
> 
> However, it's by no means the *only* way to interact with protobuf
> serializations, and Google's own implementation is a hot mess when it comes
> to dependencies; there are lots of good reasons why you might not want to
> depend on it and opt for a third-party implementation instead. For one
> thing, the Python wheel depends on system libraries beyond manylinux, and
> if they happen to be incompatible (which is likely) it just explodes unless
> you set some environment variable. Therefore, assuming pyarrow doesn't
> already depend on protobuf, I feel like we should keep it that way, and
> thus that we should not include the generated Python files in the release.
> Note that we don't even expose/leak the protoc-generated C++ classes for
> similar reasons.
> 
> Also, as Weston already pointed out, it's not really our job; Substrait
> aims to publish bindings for various languages by itself. It just doesn't
> expose them for Python *yet*. In the interim I suppose you could use the
> substrait-validator package from PyPI. It does expose them, as well as some
> convenient conversion functions, but I'm having trouble finding people to
> help me keep the validator maintained.
> 
> I suppose versioning would be difficult either way, since Substrait
> semi-regularly pushes breaking changes and Arrow currently lags behind by
> several months (though I have a PR open for Substrait 0.6). I guess from
> that point of view distributing the right version along with pyarrow seems
> nice, but the issues of Google's protobuf implementation remain. This being
> an issue at all is also very much a Substrait problem, not an Arrow
> problem; at best we can try to mitigate it.
> 
> Jeroen
> 
> On Sun, Jul 3, 2022, 17:51 Yaron Gvili <rt...@hotmail.com> wrote:
> 
>> I looked into the Arrow build system some more. It is possible to get the
>> Python classes generated by adding "--python-out" flag (set to a directory
>> created for it) to the `${ARROW_PROTOBUF_PROTOC}` command under
>> `macro(build_substrait)` in `cpp/cmake_modules/ThirdpartyToolchain.cmake`.
>> However, this makes them available only in the Arrow C++ build whereas for
>> the current purpose they need to be available in the PyArrow build. The
>> PyArrow build calls `cmake` on `python/CMakeLists.txt`, which AFAICS has
>> access to `cpp/cmake_modules`. So, one solution could be to pull
>> `macro(build_substrait)` into `python/CMakeLists.txt` and call it to
>> generate the Python protobuf classes under `python/`, making them available
>> for import by PyArrow code. This would probably be cleaner with some macro
>> parameters to distinguish between C++ and Python generation.
>>
>> Does this sound like a reasonable approach?
>>
>>
>> Yaron.
>>
>> ________________________________
>> From: Yaron Gvili <rt...@hotmail.com>
>> Sent: Saturday, July 2, 2022 8:55 AM
>> To: dev@arrow.apache.org <de...@arrow.apache.org>; Phillip Cloud <
>> cpcloud@gmail.com>
>> Subject: Re: accessing Substrait protobuf Python classes from PyArrow
>>
>> I'm somewhat confused by this answer because I think resolving the issue I
>> raised does not require any change outside PyArrow. I'll try to explain the
>> issue differently.
>>
>> First, let me describe the current situation with Substrait protobuf in
>> Arrow C++. The Arrow C++ build system handles external projects in
>> `cpp/cmake_modules/ThirdpartyToolchain.cmake`, and one of these external
>> projects is "substrait". By default, the build system takes the source code
>> for "substrait" from `
>> https://github.com/substrait-io/substrait/archive/${ARROW_SUBSTRAIT_BUILD_VERSION}.tar.gz
>> ` where `ARROW_SUBSTRAIT_BUILD_VERSION` is set in
>> `cpp/thirdparty/versions.txt`. The source code is check-summed and unpacked
>> in `substrait_ep-prefix` under the build directory and from this the
>> protobuf C++ classes are generated in `*.pb.{h,cc}` files in
>> `substrait_ep-generated` under the build directory. The build system makes
>> a library using the `*.cc` files and makes the `*.h` files available for
>> other C++ modules to use.
>>
>> Setting up the above mechanism did not require any change in the
>> `substrait-io/substrait` repo, nor any coordination with its authors. What
>> I'm looking for is a similar build mechanism for PyArrow that builds
>> Substrait protobuf Python classes and makes them available for use by other
>> PyArrow modules. I believe this PyArrow build mechanism does not exist
>> currently and that setting up one would not require any changes outside
>> PyArrow. I'm asking (1) whether that's indeed the case, (2) whether others
>> agree this mechanism is needed at least due to the problem I ran into that
>> I previously described, and (3) for any thoughts about how to set up this
>> mechanism assuming it is needed.
>>
>> Weston, perhaps your thinking was that the Substrait protobuf Python
>> classes need to be built by a repo in the substrait-io space and made
>> available as a binary+headers package? This can be done but will require
>> involving Substrait people and appears to be inconsistent with current
>> patterns in the Arrow build system. Note that for my purposes here, the
>> Substrait protobuf Python classes will be used for composing or
>> interpreting a Substrait plan, not for transforming it by an optimizer,
>> though a Python-based optimizer is a valid use case for them.
>>
>>
>> Yaron.
>> ________________________________
>> From: Weston Pace <we...@gmail.com>
>> Sent: Friday, July 1, 2022 12:42 PM
>> To: dev@arrow.apache.org <de...@arrow.apache.org>; Phillip Cloud <
>> cpcloud@gmail.com>
>> Subject: Re: accessing Substrait protobuf Python classes from PyArrow
>>
>> Given that Acero does not do any planner / optimizer type tasks I'm
>> not sure you will find anything like this in arrow-cpp or pyarrow.
>> What you are describing I sometimes refer to as "plan slicing and
>> dicing".  I have wondered if we will someday need this in Acero but I
>> fear it is a slippery slope between "a little bit of plan
>> manipulation" and "a full blown planner" so I've shied away from it.
>> My first spot to look would be a substrait-python repository which
>> would belong here: https://github.com/substrait-io
>>
>> However, it does not appear that such a repository exists.  If you're
>> willing to create one then a quick ask on the Substrait Slack instance
>> should be enough to get the repository created.  Perhaps there is some
>> genesis of this library in Ibis although I think Ibis would use its
>> own representation for slicing and dicing and only use Substrait for
>> serialization.
>>
>> Once that repository is created pyarrow could probably import it but
>> unless this plan manipulation makes sense purely from a pyarrow
>> perspective I would rather prefer that the user application import
>> both pyarrow and substrait-python independently.
>>
>> Perhaps @Phillip Cloud or someone from the Ibis space might have some
>> ideas on where this might be found.
>>
>> -Weston
>>
>> On Thu, Jun 30, 2022 at 10:06 AM Yaron Gvili <rt...@hotmail.com> wrote:
>>>
>>> Hi,
>>>
>>> Is there support for accessing Substrait protobuf Python classes (such
>> as Plan) from PyArrow? If not, how should such support be added? For
>> example, should the PyArrow build system pull in the Substrait repo as an
>> external project and build its protobuf Python classes, in a manner similar
>> to how Arrow C++ does it?
>>>
>>> I'm pondering these questions after running into an issue with code I'm
>> writing under PyArrow that parses a Substrait plan represented as a
>> dictionary. The current (and kind of shaky) parsing operation in this code
>> uses json.dumps() on the dictionary, which results in a string that is
>> passed to a Cython API that handles it using Arrow C++ code that has access
>> to Substrait protobuf C++ classes. But when the Substrait plan contains a
>> bytes-type, json.dump() no longer works and fails with "TypeError: Object
>> of type bytes is not JSON serializable". A fix for this, and a better way
>> to parse, is using google.protobuf.json_format.ParseDict() [1] on the
>> dictionary. However, this invocation requires a second argument, namely a
>> protobuf message instance to merge with. The class of this message (such as
>> Plan) is a Substrait protobuf Python class, hence the need to access such
>> classes from PyArrow.
>>>
>>> [1]
>> https://googleapis.dev/python/protobuf/latest/google/protobuf/json_format.html
>>>
>>>
>>> Yaron.
>>
> 

Re: accessing Substrait protobuf Python classes from PyArrow

Posted by Jeroen van Straten <je...@gmail.com>.
It's not so much about whether or not we *can*, but about whether or not we
*should* generate and expose these files.

Fundamentally, Substrait aims to be a means to connect different systems
together by standardizing some interchange format. Currently that happens
to be based on protobuf, so *one* (obvious) way to generate and interpret
Substrait plans is to use Google's own protobuf implementation, as
generated by protoc for various languages. It's true that there's nothing
fundamentally blocking Arrow from exposing those.

However, it's by no means the *only* way to interact with protobuf
serializations, and Google's own implementation is a hot mess when it comes
to dependencies; there are lots of good reasons why you might not want to
depend on it and opt for a third-party implementation instead. For one
thing, the Python wheel depends on system libraries beyond manylinux, and
if they happen to be incompatible (which is likely) it just explodes unless
you set some environment variable. Therefore, assuming pyarrow doesn't
already depend on protobuf, I feel like we should keep it that way, and
thus that we should not include the generated Python files in the release.
Note that we don't even expose/leak the protoc-generated C++ classes for
similar reasons.

Also, as Weston already pointed out, it's not really our job; Substrait
aims to publish bindings for various languages by itself. It just doesn't
expose them for Python *yet*. In the interim I suppose you could use the
substrait-validator package from PyPI. It does expose them, as well as some
convenient conversion functions, but I'm having trouble finding people to
help me keep the validator maintained.

I suppose versioning would be difficult either way, since Substrait
semi-regularly pushes breaking changes and Arrow currently lags behind by
several months (though I have a PR open for Substrait 0.6). I guess from
that point of view distributing the right version along with pyarrow seems
nice, but the issues of Google's protobuf implementation remain. This being
an issue at all is also very much a Substrait problem, not an Arrow
problem; at best we can try to mitigate it.

Jeroen

On Sun, Jul 3, 2022, 17:51 Yaron Gvili <rt...@hotmail.com> wrote:

> I looked into the Arrow build system some more. It is possible to get the
> Python classes generated by adding "--python-out" flag (set to a directory
> created for it) to the `${ARROW_PROTOBUF_PROTOC}` command under
> `macro(build_substrait)` in `cpp/cmake_modules/ThirdpartyToolchain.cmake`.
> However, this makes them available only in the Arrow C++ build whereas for
> the current purpose they need to be available in the PyArrow build. The
> PyArrow build calls `cmake` on `python/CMakeLists.txt`, which AFAICS has
> access to `cpp/cmake_modules`. So, one solution could be to pull
> `macro(build_substrait)` into `python/CMakeLists.txt` and call it to
> generate the Python protobuf classes under `python/`, making them available
> for import by PyArrow code. This would probably be cleaner with some macro
> parameters to distinguish between C++ and Python generation.
>
> Does this sound like a reasonable approach?
>
>
> Yaron.
>
> ________________________________
> From: Yaron Gvili <rt...@hotmail.com>
> Sent: Saturday, July 2, 2022 8:55 AM
> To: dev@arrow.apache.org <de...@arrow.apache.org>; Phillip Cloud <
> cpcloud@gmail.com>
> Subject: Re: accessing Substrait protobuf Python classes from PyArrow
>
> I'm somewhat confused by this answer because I think resolving the issue I
> raised does not require any change outside PyArrow. I'll try to explain the
> issue differently.
>
> First, let me describe the current situation with Substrait protobuf in
> Arrow C++. The Arrow C++ build system handles external projects in
> `cpp/cmake_modules/ThirdpartyToolchain.cmake`, and one of these external
> projects is "substrait". By default, the build system takes the source code
> for "substrait" from `
> https://github.com/substrait-io/substrait/archive/${ARROW_SUBSTRAIT_BUILD_VERSION}.tar.gz
> ` where `ARROW_SUBSTRAIT_BUILD_VERSION` is set in
> `cpp/thirdparty/versions.txt`. The source code is check-summed and unpacked
> in `substrait_ep-prefix` under the build directory and from this the
> protobuf C++ classes are generated in `*.pb.{h,cc}` files in
> `substrait_ep-generated` under the build directory. The build system makes
> a library using the `*.cc` files and makes the `*.h` files available for
> other C++ modules to use.
>
> Setting up the above mechanism did not require any change in the
> `substrait-io/substrait` repo, nor any coordination with its authors. What
> I'm looking for is a similar build mechanism for PyArrow that builds
> Substrait protobuf Python classes and makes them available for use by other
> PyArrow modules. I believe this PyArrow build mechanism does not exist
> currently and that setting up one would not require any changes outside
> PyArrow. I'm asking (1) whether that's indeed the case, (2) whether others
> agree this mechanism is needed at least due to the problem I ran into that
> I previously described, and (3) for any thoughts about how to set up this
> mechanism assuming it is needed.
>
> Weston, perhaps your thinking was that the Substrait protobuf Python
> classes need to be built by a repo in the substrait-io space and made
> available as a binary+headers package? This can be done but will require
> involving Substrait people and appears to be inconsistent with current
> patterns in the Arrow build system. Note that for my purposes here, the
> Substrait protobuf Python classes will be used for composing or
> interpreting a Substrait plan, not for transforming it by an optimizer,
> though a Python-based optimizer is a valid use case for them.
>
>
> Yaron.
> ________________________________
> From: Weston Pace <we...@gmail.com>
> Sent: Friday, July 1, 2022 12:42 PM
> To: dev@arrow.apache.org <de...@arrow.apache.org>; Phillip Cloud <
> cpcloud@gmail.com>
> Subject: Re: accessing Substrait protobuf Python classes from PyArrow
>
> Given that Acero does not do any planner / optimizer type tasks I'm
> not sure you will find anything like this in arrow-cpp or pyarrow.
> What you are describing I sometimes refer to as "plan slicing and
> dicing".  I have wondered if we will someday need this in Acero but I
> fear it is a slippery slope between "a little bit of plan
> manipulation" and "a full blown planner" so I've shied away from it.
> My first spot to look would be a substrait-python repository which
> would belong here: https://github.com/substrait-io
>
> However, it does not appear that such a repository exists.  If you're
> willing to create one then a quick ask on the Substrait Slack instance
> should be enough to get the repository created.  Perhaps there is some
> genesis of this library in Ibis although I think Ibis would use its
> own representation for slicing and dicing and only use Substrait for
> serialization.
>
> Once that repository is created pyarrow could probably import it but
> unless this plan manipulation makes sense purely from a pyarrow
> perspective I would rather prefer that the user application import
> both pyarrow and substrait-python independently.
>
> Perhaps @Phillip Cloud or someone from the Ibis space might have some
> ideas on where this might be found.
>
> -Weston
>
> On Thu, Jun 30, 2022 at 10:06 AM Yaron Gvili <rt...@hotmail.com> wrote:
> >
> > Hi,
> >
> > Is there support for accessing Substrait protobuf Python classes (such
> as Plan) from PyArrow? If not, how should such support be added? For
> example, should the PyArrow build system pull in the Substrait repo as an
> external project and build its protobuf Python classes, in a manner similar
> to how Arrow C++ does it?
> >
> > I'm pondering these questions after running into an issue with code I'm
> writing under PyArrow that parses a Substrait plan represented as a
> dictionary. The current (and kind of shaky) parsing operation in this code
> uses json.dumps() on the dictionary, which results in a string that is
> passed to a Cython API that handles it using Arrow C++ code that has access
> to Substrait protobuf C++ classes. But when the Substrait plan contains a
> bytes-type, json.dump() no longer works and fails with "TypeError: Object
> of type bytes is not JSON serializable". A fix for this, and a better way
> to parse, is using google.protobuf.json_format.ParseDict() [1] on the
> dictionary. However, this invocation requires a second argument, namely a
> protobuf message instance to merge with. The class of this message (such as
> Plan) is a Substrait protobuf Python class, hence the need to access such
> classes from PyArrow.
> >
> > [1]
> https://googleapis.dev/python/protobuf/latest/google/protobuf/json_format.html
> >
> >
> > Yaron.
>

Re: accessing Substrait protobuf Python classes from PyArrow

Posted by Yaron Gvili <rt...@hotmail.com>.
I looked into the Arrow build system some more. It is possible to get the Python classes generated by adding "--python-out" flag (set to a directory created for it) to the `${ARROW_PROTOBUF_PROTOC}` command under `macro(build_substrait)` in `cpp/cmake_modules/ThirdpartyToolchain.cmake`. However, this makes them available only in the Arrow C++ build whereas for the current purpose they need to be available in the PyArrow build. The PyArrow build calls `cmake` on `python/CMakeLists.txt`, which AFAICS has access to `cpp/cmake_modules`. So, one solution could be to pull `macro(build_substrait)` into `python/CMakeLists.txt` and call it to generate the Python protobuf classes under `python/`, making them available for import by PyArrow code. This would probably be cleaner with some macro parameters to distinguish between C++ and Python generation.

Does this sound like a reasonable approach?


Yaron.

________________________________
From: Yaron Gvili <rt...@hotmail.com>
Sent: Saturday, July 2, 2022 8:55 AM
To: dev@arrow.apache.org <de...@arrow.apache.org>; Phillip Cloud <cp...@gmail.com>
Subject: Re: accessing Substrait protobuf Python classes from PyArrow

I'm somewhat confused by this answer because I think resolving the issue I raised does not require any change outside PyArrow. I'll try to explain the issue differently.

First, let me describe the current situation with Substrait protobuf in Arrow C++. The Arrow C++ build system handles external projects in `cpp/cmake_modules/ThirdpartyToolchain.cmake`, and one of these external projects is "substrait". By default, the build system takes the source code for "substrait" from `https://github.com/substrait-io/substrait/archive/${ARROW_SUBSTRAIT_BUILD_VERSION}.tar.gz ` where `ARROW_SUBSTRAIT_BUILD_VERSION` is set in `cpp/thirdparty/versions.txt`. The source code is check-summed and unpacked in `substrait_ep-prefix` under the build directory and from this the protobuf C++ classes are generated in `*.pb.{h,cc}` files in `substrait_ep-generated` under the build directory. The build system makes a library using the `*.cc` files and makes the `*.h` files available for other C++ modules to use.

Setting up the above mechanism did not require any change in the `substrait-io/substrait` repo, nor any coordination with its authors. What I'm looking for is a similar build mechanism for PyArrow that builds Substrait protobuf Python classes and makes them available for use by other PyArrow modules. I believe this PyArrow build mechanism does not exist currently and that setting up one would not require any changes outside PyArrow. I'm asking (1) whether that's indeed the case, (2) whether others agree this mechanism is needed at least due to the problem I ran into that I previously described, and (3) for any thoughts about how to set up this mechanism assuming it is needed.

Weston, perhaps your thinking was that the Substrait protobuf Python classes need to be built by a repo in the substrait-io space and made available as a binary+headers package? This can be done but will require involving Substrait people and appears to be inconsistent with current patterns in the Arrow build system. Note that for my purposes here, the Substrait protobuf Python classes will be used for composing or interpreting a Substrait plan, not for transforming it by an optimizer, though a Python-based optimizer is a valid use case for them.


Yaron.
________________________________
From: Weston Pace <we...@gmail.com>
Sent: Friday, July 1, 2022 12:42 PM
To: dev@arrow.apache.org <de...@arrow.apache.org>; Phillip Cloud <cp...@gmail.com>
Subject: Re: accessing Substrait protobuf Python classes from PyArrow

Given that Acero does not do any planner / optimizer type tasks I'm
not sure you will find anything like this in arrow-cpp or pyarrow.
What you are describing I sometimes refer to as "plan slicing and
dicing".  I have wondered if we will someday need this in Acero but I
fear it is a slippery slope between "a little bit of plan
manipulation" and "a full blown planner" so I've shied away from it.
My first spot to look would be a substrait-python repository which
would belong here: https://github.com/substrait-io

However, it does not appear that such a repository exists.  If you're
willing to create one then a quick ask on the Substrait Slack instance
should be enough to get the repository created.  Perhaps there is some
genesis of this library in Ibis although I think Ibis would use its
own representation for slicing and dicing and only use Substrait for
serialization.

Once that repository is created pyarrow could probably import it but
unless this plan manipulation makes sense purely from a pyarrow
perspective I would rather prefer that the user application import
both pyarrow and substrait-python independently.

Perhaps @Phillip Cloud or someone from the Ibis space might have some
ideas on where this might be found.

-Weston

On Thu, Jun 30, 2022 at 10:06 AM Yaron Gvili <rt...@hotmail.com> wrote:
>
> Hi,
>
> Is there support for accessing Substrait protobuf Python classes (such as Plan) from PyArrow? If not, how should such support be added? For example, should the PyArrow build system pull in the Substrait repo as an external project and build its protobuf Python classes, in a manner similar to how Arrow C++ does it?
>
> I'm pondering these questions after running into an issue with code I'm writing under PyArrow that parses a Substrait plan represented as a dictionary. The current (and kind of shaky) parsing operation in this code uses json.dumps() on the dictionary, which results in a string that is passed to a Cython API that handles it using Arrow C++ code that has access to Substrait protobuf C++ classes. But when the Substrait plan contains a bytes-type, json.dump() no longer works and fails with "TypeError: Object of type bytes is not JSON serializable". A fix for this, and a better way to parse, is using google.protobuf.json_format.ParseDict() [1] on the dictionary. However, this invocation requires a second argument, namely a protobuf message instance to merge with. The class of this message (such as Plan) is a Substrait protobuf Python class, hence the need to access such classes from PyArrow.
>
> [1] https://googleapis.dev/python/protobuf/latest/google/protobuf/json_format.html
>
>
> Yaron.

Re: accessing Substrait protobuf Python classes from PyArrow

Posted by Yaron Gvili <rt...@hotmail.com>.
I'm somewhat confused by this answer because I think resolving the issue I raised does not require any change outside PyArrow. I'll try to explain the issue differently.

First, let me describe the current situation with Substrait protobuf in Arrow C++. The Arrow C++ build system handles external projects in `cpp/cmake_modules/ThirdpartyToolchain.cmake`, and one of these external projects is "substrait". By default, the build system takes the source code for "substrait" from `https://github.com/substrait-io/substrait/archive/${ARROW_SUBSTRAIT_BUILD_VERSION}.tar.gz ` where `ARROW_SUBSTRAIT_BUILD_VERSION` is set in `cpp/thirdparty/versions.txt`. The source code is check-summed and unpacked in `substrait_ep-prefix` under the build directory and from this the protobuf C++ classes are generated in `*.pb.{h,cc}` files in `substrait_ep-generated` under the build directory. The build system makes a library using the `*.cc` files and makes the `*.h` files available for other C++ modules to use.

Setting up the above mechanism did not require any change in the `substrait-io/substrait` repo, nor any coordination with its authors. What I'm looking for is a similar build mechanism for PyArrow that builds Substrait protobuf Python classes and makes them available for use by other PyArrow modules. I believe this PyArrow build mechanism does not exist currently and that setting up one would not require any changes outside PyArrow. I'm asking (1) whether that's indeed the case, (2) whether others agree this mechanism is needed at least due to the problem I ran into that I previously described, and (3) for any thoughts about how to set up this mechanism assuming it is needed.

Weston, perhaps your thinking was that the Substrait protobuf Python classes need to be built by a repo in the substrait-io space and made available as a binary+headers package? This can be done but will require involving Substrait people and appears to be inconsistent with current patterns in the Arrow build system. Note that for my purposes here, the Substrait protobuf Python classes will be used for composing or interpreting a Substrait plan, not for transforming it by an optimizer, though a Python-based optimizer is a valid use case for them.


Yaron.
________________________________
From: Weston Pace <we...@gmail.com>
Sent: Friday, July 1, 2022 12:42 PM
To: dev@arrow.apache.org <de...@arrow.apache.org>; Phillip Cloud <cp...@gmail.com>
Subject: Re: accessing Substrait protobuf Python classes from PyArrow

Given that Acero does not do any planner / optimizer type tasks I'm
not sure you will find anything like this in arrow-cpp or pyarrow.
What you are describing I sometimes refer to as "plan slicing and
dicing".  I have wondered if we will someday need this in Acero but I
fear it is a slippery slope between "a little bit of plan
manipulation" and "a full blown planner" so I've shied away from it.
My first spot to look would be a substrait-python repository which
would belong here: https://github.com/substrait-io

However, it does not appear that such a repository exists.  If you're
willing to create one then a quick ask on the Substrait Slack instance
should be enough to get the repository created.  Perhaps there is some
genesis of this library in Ibis although I think Ibis would use its
own representation for slicing and dicing and only use Substrait for
serialization.

Once that repository is created pyarrow could probably import it but
unless this plan manipulation makes sense purely from a pyarrow
perspective I would rather prefer that the user application import
both pyarrow and substrait-python independently.

Perhaps @Phillip Cloud or someone from the Ibis space might have some
ideas on where this might be found.

-Weston

On Thu, Jun 30, 2022 at 10:06 AM Yaron Gvili <rt...@hotmail.com> wrote:
>
> Hi,
>
> Is there support for accessing Substrait protobuf Python classes (such as Plan) from PyArrow? If not, how should such support be added? For example, should the PyArrow build system pull in the Substrait repo as an external project and build its protobuf Python classes, in a manner similar to how Arrow C++ does it?
>
> I'm pondering these questions after running into an issue with code I'm writing under PyArrow that parses a Substrait plan represented as a dictionary. The current (and kind of shaky) parsing operation in this code uses json.dumps() on the dictionary, which results in a string that is passed to a Cython API that handles it using Arrow C++ code that has access to Substrait protobuf C++ classes. But when the Substrait plan contains a bytes-type, json.dump() no longer works and fails with "TypeError: Object of type bytes is not JSON serializable". A fix for this, and a better way to parse, is using google.protobuf.json_format.ParseDict() [1] on the dictionary. However, this invocation requires a second argument, namely a protobuf message instance to merge with. The class of this message (such as Plan) is a Substrait protobuf Python class, hence the need to access such classes from PyArrow.
>
> [1] https://googleapis.dev/python/protobuf/latest/google/protobuf/json_format.html
>
>
> Yaron.

Re: accessing Substrait protobuf Python classes from PyArrow

Posted by Weston Pace <we...@gmail.com>.
Given that Acero does not do any planner / optimizer type tasks I'm
not sure you will find anything like this in arrow-cpp or pyarrow.
What you are describing I sometimes refer to as "plan slicing and
dicing".  I have wondered if we will someday need this in Acero but I
fear it is a slippery slope between "a little bit of plan
manipulation" and "a full blown planner" so I've shied away from it.
My first spot to look would be a substrait-python repository which
would belong here: https://github.com/substrait-io

However, it does not appear that such a repository exists.  If you're
willing to create one then a quick ask on the Substrait Slack instance
should be enough to get the repository created.  Perhaps there is some
genesis of this library in Ibis although I think Ibis would use its
own representation for slicing and dicing and only use Substrait for
serialization.

Once that repository is created pyarrow could probably import it but
unless this plan manipulation makes sense purely from a pyarrow
perspective I would rather prefer that the user application import
both pyarrow and substrait-python independently.

Perhaps @Phillip Cloud or someone from the Ibis space might have some
ideas on where this might be found.

-Weston

On Thu, Jun 30, 2022 at 10:06 AM Yaron Gvili <rt...@hotmail.com> wrote:
>
> Hi,
>
> Is there support for accessing Substrait protobuf Python classes (such as Plan) from PyArrow? If not, how should such support be added? For example, should the PyArrow build system pull in the Substrait repo as an external project and build its protobuf Python classes, in a manner similar to how Arrow C++ does it?
>
> I'm pondering these questions after running into an issue with code I'm writing under PyArrow that parses a Substrait plan represented as a dictionary. The current (and kind of shaky) parsing operation in this code uses json.dumps() on the dictionary, which results in a string that is passed to a Cython API that handles it using Arrow C++ code that has access to Substrait protobuf C++ classes. But when the Substrait plan contains a bytes-type, json.dump() no longer works and fails with "TypeError: Object of type bytes is not JSON serializable". A fix for this, and a better way to parse, is using google.protobuf.json_format.ParseDict() [1] on the dictionary. However, this invocation requires a second argument, namely a protobuf message instance to merge with. The class of this message (such as Plan) is a Substrait protobuf Python class, hence the need to access such classes from PyArrow.
>
> [1] https://googleapis.dev/python/protobuf/latest/google/protobuf/json_format.html
>
>
> Yaron.