You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Justin Polchlopek <jp...@azavea.com> on 2019/11/06 19:28:55 UTC

Re: Achieving parity with Java extension types in Python

Hi.  I'm looking into this issue and I have some questions as someone new
to the project.  The comment from Joris earlier in the thread suggests that
the solution here is to create an Array subclass for each extension type
that wants to use one.  This will give a nice symmetry w.r.t. the Java
interface, but in the Python case, this seems to suggest having to travel
some fairly byzantine code paths (rather quickly, we end up in C++ code,
where I lose the thread of what's happening—specifically as regards
`pyarrow_wrap_array`, as suggested in ARROW-6176).

I came up with a quick-and-dirty method wherein the ExtensionType subclass
simply provides a method to translate from the storage type to the output
type, and ExtensionArray has a __getitem__ implementation that passes the
element from storage through the translation function.  This doesn't feel
outside of the realm of what is often acceptable in the python world, but
it isn't nearly as typeful as Arrow seems to be leaning.  Plus, this feels
very far from what was intended in the issue, and I believe that I'm not
understanding the underlying design principles.

Can I get a bit of advice on this?

Thanks.
-J

On Tue, Oct 29, 2019 at 12:26 PM Justin Polchlopek <jp...@azavea.com>
wrote:

> That sounds about right.  We're doing some work here that might require
> this feature sooner than later, and if we decide to go the route that needs
> this improved support, I'd be happy to make this PR.  Thanks for showing
> that issue.  I'll be sure to tag any contribution with that ticket number.
>
> On Tue, Oct 29, 2019 at 9:01 AM Joris Van den Bossche <
> jorisvandenbossche@gmail.com> wrote:
>
>>
>> On Mon, 28 Oct 2019 at 22:41, Wes McKinney <we...@gmail.com> wrote:
>>
>>> Adding dev@
>>>
>>> I don't believe we have APIs yet for plugging in user-defined Array
>>> subtypes. I assume you've read
>>>
>>>
>>> http://arrow.apache.org/docs/python/extending_types.html#defining-extension-types-user-defined-types
>>>
>>> There may be some JIRA issues already about this (defining subclasses
>>> of pa.Array with custom behavior) -- since Joris has been working on
>>> this I'm interested in more comments
>>>
>>
>> Yes, there is https://issues.apache.org/jira/browse/ARROW-6176 for
>> exactly this issue.
>> What I proposed there is to allow one to subclass pyarrow.ExtensionArray
>> and to attach this to an attribute on the custom ExtensionType (eg
>> __arrow_ext_array_class__ in line with the other __arrow_ext_..
>> methods). That should allow to achieve similar functionality as what is
>> available in Java I think.
>>
>> If that seems a good way to do this, I think we certainly welcome a PR
>> for that (I can also look into it otherwise before 1.0).
>>
>> Joris
>>
>>
>>>
>>> On Mon, Oct 28, 2019 at 3:56 PM Justin Polchlopek
>>> <jp...@azavea.com> wrote:
>>> >
>>> > Hi!
>>> >
>>> > I've been working through understanding extension types in Arrow.
>>> It's a great feature, and I've had no problems getting things working in
>>> Java/Scala; however, Python has been a bit of a different story.  Not that
>>> I am unable to create and register extension types in Python, but rather
>>> that I can't seem to recreate the functionality provided by the Java API's
>>> ExtensionTypeVector class.
>>> >
>>> > In Java, ExtensionType::getNewVector() provides a clear pathway from
>>> the registered type to output a vector in something other than the
>>> underlying vector type, and I am at a loss for how to get this same
>>> functionality in Python.  Am I missing something?
>>> >
>>> > Thanks for any hints.
>>> > -Justin
>>>
>>

Re: Achieving parity with Java extension types in Python

Posted by Justin Polchlopek <jp...@azavea.com>.
I made a PR for this issue at https://github.com/apache/arrow/pull/5835.
Would love some more detail about what was intended by the initial issue
and what would be a better way.

On Tue, Nov 12, 2019 at 11:25 AM Joris Van den Bossche <
jorisvandenbossche@gmail.com> wrote:

> Sorry for the delay in response. I would suggest that you open a PR (or
> point to a branch with those changes), that will make it easier to discuss
> specific implementation options (rather than trying to explain and
> understand it in words) and give advice.
>
> On Wed, 6 Nov 2019 at 20:29, Justin Polchlopek <jp...@azavea.com>
> wrote:
>
> > Hi.  I'm looking into this issue and I have some questions as someone new
> > to the project.  The comment from Joris earlier in the thread suggests
> that
> > the solution here is to create an Array subclass for each extension type
> > that wants to use one.  This will give a nice symmetry w.r.t. the Java
> > interface, but in the Python case, this seems to suggest having to travel
> > some fairly byzantine code paths (rather quickly, we end up in C++ code,
> > where I lose the thread of what's happening—specifically as regards
> > `pyarrow_wrap_array`, as suggested in ARROW-6176).
> >
>
> The goal here is that for the end user, it is possible to do this without
> involving C++ code, and I *think* implementing it should be possible from
> cython. How did you end up in C++?
>
>
> > I came up with a quick-and-dirty method wherein the ExtensionType
> subclass
> > simply provides a method to translate from the storage type to the output
> > type, and ExtensionArray has a __getitem__ implementation that passes the
> > element from storage through the translation function.  This doesn't feel
> > outside of the realm of what is often acceptable in the python world, but
> > it isn't nearly as typeful as Arrow seems to be leaning.  Plus, this
> feels
> > very far from what was intended in the issue, and I believe that I'm not
> > understanding the underlying design principles.
> >
> > Can I get a bit of advice on this?
> >
> > Thanks.
> > -J
> >
> > On Tue, Oct 29, 2019 at 12:26 PM Justin Polchlopek <
> jpolchlopek@azavea.com
> > >
> > wrote:
> >
> > > That sounds about right.  We're doing some work here that might require
> > > this feature sooner than later, and if we decide to go the route that
> > needs
> > > this improved support, I'd be happy to make this PR.  Thanks for
> showing
> > > that issue.  I'll be sure to tag any contribution with that ticket
> > number.
> > >
> > > On Tue, Oct 29, 2019 at 9:01 AM Joris Van den Bossche <
> > > jorisvandenbossche@gmail.com> wrote:
> > >
> > >>
> > >> On Mon, 28 Oct 2019 at 22:41, Wes McKinney <we...@gmail.com>
> wrote:
> > >>
> > >>> Adding dev@
> > >>>
> > >>> I don't believe we have APIs yet for plugging in user-defined Array
> > >>> subtypes. I assume you've read
> > >>>
> > >>>
> > >>>
> >
> http://arrow.apache.org/docs/python/extending_types.html#defining-extension-types-user-defined-types
> > >>>
> > >>> There may be some JIRA issues already about this (defining subclasses
> > >>> of pa.Array with custom behavior) -- since Joris has been working on
> > >>> this I'm interested in more comments
> > >>>
> > >>
> > >> Yes, there is https://issues.apache.org/jira/browse/ARROW-6176 for
> > >> exactly this issue.
> > >> What I proposed there is to allow one to subclass
> pyarrow.ExtensionArray
> > >> and to attach this to an attribute on the custom ExtensionType (eg
> > >> __arrow_ext_array_class__ in line with the other __arrow_ext_..
> > >> methods). That should allow to achieve similar functionality as what
> is
> > >> available in Java I think.
> > >>
> > >> If that seems a good way to do this, I think we certainly welcome a PR
> > >> for that (I can also look into it otherwise before 1.0).
> > >>
> > >> Joris
> > >>
> > >>
> > >>>
> > >>> On Mon, Oct 28, 2019 at 3:56 PM Justin Polchlopek
> > >>> <jp...@azavea.com> wrote:
> > >>> >
> > >>> > Hi!
> > >>> >
> > >>> > I've been working through understanding extension types in Arrow.
> > >>> It's a great feature, and I've had no problems getting things working
> > in
> > >>> Java/Scala; however, Python has been a bit of a different story.  Not
> > that
> > >>> I am unable to create and register extension types in Python, but
> > rather
> > >>> that I can't seem to recreate the functionality provided by the Java
> > API's
> > >>> ExtensionTypeVector class.
> > >>> >
> > >>> > In Java, ExtensionType::getNewVector() provides a clear pathway
> from
> > >>> the registered type to output a vector in something other than the
> > >>> underlying vector type, and I am at a loss for how to get this same
> > >>> functionality in Python.  Am I missing something?
> > >>> >
> > >>> > Thanks for any hints.
> > >>> > -Justin
> > >>>
> > >>
> >
>

Re: Achieving parity with Java extension types in Python

Posted by Joris Van den Bossche <jo...@gmail.com>.
Sorry for the delay in response. I would suggest that you open a PR (or
point to a branch with those changes), that will make it easier to discuss
specific implementation options (rather than trying to explain and
understand it in words) and give advice.

On Wed, 6 Nov 2019 at 20:29, Justin Polchlopek <jp...@azavea.com>
wrote:

> Hi.  I'm looking into this issue and I have some questions as someone new
> to the project.  The comment from Joris earlier in the thread suggests that
> the solution here is to create an Array subclass for each extension type
> that wants to use one.  This will give a nice symmetry w.r.t. the Java
> interface, but in the Python case, this seems to suggest having to travel
> some fairly byzantine code paths (rather quickly, we end up in C++ code,
> where I lose the thread of what's happening—specifically as regards
> `pyarrow_wrap_array`, as suggested in ARROW-6176).
>

The goal here is that for the end user, it is possible to do this without
involving C++ code, and I *think* implementing it should be possible from
cython. How did you end up in C++?


> I came up with a quick-and-dirty method wherein the ExtensionType subclass
> simply provides a method to translate from the storage type to the output
> type, and ExtensionArray has a __getitem__ implementation that passes the
> element from storage through the translation function.  This doesn't feel
> outside of the realm of what is often acceptable in the python world, but
> it isn't nearly as typeful as Arrow seems to be leaning.  Plus, this feels
> very far from what was intended in the issue, and I believe that I'm not
> understanding the underlying design principles.
>
> Can I get a bit of advice on this?
>
> Thanks.
> -J
>
> On Tue, Oct 29, 2019 at 12:26 PM Justin Polchlopek <jpolchlopek@azavea.com
> >
> wrote:
>
> > That sounds about right.  We're doing some work here that might require
> > this feature sooner than later, and if we decide to go the route that
> needs
> > this improved support, I'd be happy to make this PR.  Thanks for showing
> > that issue.  I'll be sure to tag any contribution with that ticket
> number.
> >
> > On Tue, Oct 29, 2019 at 9:01 AM Joris Van den Bossche <
> > jorisvandenbossche@gmail.com> wrote:
> >
> >>
> >> On Mon, 28 Oct 2019 at 22:41, Wes McKinney <we...@gmail.com> wrote:
> >>
> >>> Adding dev@
> >>>
> >>> I don't believe we have APIs yet for plugging in user-defined Array
> >>> subtypes. I assume you've read
> >>>
> >>>
> >>>
> http://arrow.apache.org/docs/python/extending_types.html#defining-extension-types-user-defined-types
> >>>
> >>> There may be some JIRA issues already about this (defining subclasses
> >>> of pa.Array with custom behavior) -- since Joris has been working on
> >>> this I'm interested in more comments
> >>>
> >>
> >> Yes, there is https://issues.apache.org/jira/browse/ARROW-6176 for
> >> exactly this issue.
> >> What I proposed there is to allow one to subclass pyarrow.ExtensionArray
> >> and to attach this to an attribute on the custom ExtensionType (eg
> >> __arrow_ext_array_class__ in line with the other __arrow_ext_..
> >> methods). That should allow to achieve similar functionality as what is
> >> available in Java I think.
> >>
> >> If that seems a good way to do this, I think we certainly welcome a PR
> >> for that (I can also look into it otherwise before 1.0).
> >>
> >> Joris
> >>
> >>
> >>>
> >>> On Mon, Oct 28, 2019 at 3:56 PM Justin Polchlopek
> >>> <jp...@azavea.com> wrote:
> >>> >
> >>> > Hi!
> >>> >
> >>> > I've been working through understanding extension types in Arrow.
> >>> It's a great feature, and I've had no problems getting things working
> in
> >>> Java/Scala; however, Python has been a bit of a different story.  Not
> that
> >>> I am unable to create and register extension types in Python, but
> rather
> >>> that I can't seem to recreate the functionality provided by the Java
> API's
> >>> ExtensionTypeVector class.
> >>> >
> >>> > In Java, ExtensionType::getNewVector() provides a clear pathway from
> >>> the registered type to output a vector in something other than the
> >>> underlying vector type, and I am at a loss for how to get this same
> >>> functionality in Python.  Am I missing something?
> >>> >
> >>> > Thanks for any hints.
> >>> > -Justin
> >>>
> >>
>