You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@arrow.apache.org by Chang She <ch...@eto.ai> on 2022/09/19 01:23:08 UTC

guidance on extension types

Hey y'all, thanks in advance for the discussion.

I'm creating Arrow extensions for computer vision and I'm running into
issues in two scenarios. I couldn't find the answers in the archive so I
thought I'd post here.

Example:
I make an extension type called "Label" that has storage type
"dictionary<int8, string>". This is an object detection dataset so each row
represents an image and has multiple detected objects that needs to be
labeled. So there's a "name" column that is "list<label>":

Example table schema:
*image_id: int*
*uri: string*
*label: list<label>   # list<dictionary<int8, string>>  storage type*


Problems:
1. `to_numpy` does not seem to work with a nested column. e.g., if I try to
call `to_numpy` on the `label` column, then I get "Not implemented type for
Arrow list to pandas: extension<label<LabelType>>"
2. If I'm querying this dataset using duckdb, running "select * from
dataset where label='person'" results in: "Function 'equal' has no kernel
matching input types (extension<label<LabelType>>, string)"

Am I missing an alternate path to make this work with extension types?
Does implementing this in Arrow consist of checking if something is an
extension type and if so, use the storage type instead? Is this something
that's already on the roadmap at all?

Thanks!

Chang She

Re: guidance on extension types

Posted by Chang She <ch...@eto.ai>.
Yup we’ve run into this as well. Though I think you could control this by
implementing a pandas extension dtype to go with the arrow extension type?


On Wed, Sep 21, 2022 at 9:17 PM Micah Kornfield <em...@gmail.com>
wrote:

> Also, note I've raised a similar issue (
> https://issues.apache.org/jira/browse/ARROW-17535) for to_pandas calls.
> One thing that I think would be nice is to be able to hook into the python
> conversion when necessary translate to Python objects when necessary.
>
>
>
> On Wed, Sep 21, 2022 at 8:49 PM Chang She <ch...@eto.ai> wrote:
>
>> Thanks Wes.
>>
>> => Array.to_numpy : I opened ARROW-17813
>> <https://issues.apache.org/jira/browse/ARROW-17813> as you suggested and
>> added some details / repro code. There's also a follow-up thing about the
>> other direction, converting from a pandas DataFrame column to an Arrow
>> list<extension>.
>>
>> => You're right, I was a little hasty in the description and it wasn't
>> very accurate:
>>
>> Scenario 1:
>>
>> If I have a non-nested ExtensionArray whose storage is a DictionaryArray,
>> `pc.field("extension") == 'string'` would be a valid filter but
>> currently triggers the "function 'equal' has no kernel matching input
>> types" error.
>> This is the path used by DuckDB if you add something like
>> `extension=='string'` in the where clause.
>> If Arrow/Acero is also able to automatically lower to storage type for
>> the functions then it would make running compute on extension types a lot
>> easier. Even for a list<label> column, at least in duckdb you could use
>> "UNNEST" to make it work.
>>
>>
>> Scenario 2:
>>
>> The trouble with using UNNEST is it makes the query a lot more
>> complicated and has perf implications. If we're working a lot with nested
>> data types, it would be easier to have a set of array functions.
>> If there's a nested ExtensionArray, then something like a list-contains
>> function would make things a lot easier. However, I think this is a lot
>> more work (and depends on other systems like duckdb to integrate with these
>> functions as well).
>>
>>
>> Would it make sense for me to create a JIRA for scenario 1 to continue
>> further discussion?
>>
>>
>> Thanks again.
>>
>>
>> On Tue, Sep 20, 2022 at 6:11 PM Wes McKinney <we...@gmail.com> wrote:
>>
>>> hi Chang,
>>>
>>> There are a few rough edges here that you've run into:
>>>
>>> * It looks like Array.to_numpy does not "automatically lower" to the
>>> storage type when trying to convert to NumPy format. In the absence of
>>> some other conversion rule, converting to the storage type seems like
>>> a reasonable alternative to failing. Can you open a Jira issue about
>>> this? This could probably be fixed easily in time for the 10.0.0
>>> release, much more easily than the next issue
>>>
>>> * On the query, it looks like the filter portion at least is being
>>> handled by Arrow/Acero — the syntax / UX relating to nested types here
>>> is relatively unexplored relative to non-nested types. Here comparing
>>> the label type (itself a list of dictionary-encoded strings) to a
>>> string seems invalid, probably you would need to check for inclusion
>>> of the string in the label list-of-strings. I do not know what the
>>> syntax for this would be with DuckDB (to check for inclusion of a
>>> string in a list of strings) but in principle this is something that
>>> should be able to be made to work with some effort
>>>
>>> - Wes
>>>
>>> On Sun, Sep 18, 2022 at 8:23 PM Chang She <ch...@eto.ai> wrote:
>>> >
>>> > Hey y'all, thanks in advance for the discussion.
>>> >
>>> > I'm creating Arrow extensions for computer vision and I'm running into
>>> issues in two scenarios. I couldn't find the answers in the archive so I
>>> thought I'd post here.
>>> >
>>> > Example:
>>> > I make an extension type called "Label" that has storage type
>>> "dictionary<int8, string>". This is an object detection dataset so each row
>>> represents an image and has multiple detected objects that needs to be
>>> labeled. So there's a "name" column that is "list<label>":
>>> >
>>> > Example table schema:
>>> > image_id: int
>>> > uri: string
>>> > label: list<label>   # list<dictionary<int8, string>>  storage type
>>> >
>>> >
>>> > Problems:
>>> > 1. `to_numpy` does not seem to work with a nested column. e.g., if I
>>> try to call `to_numpy` on the `label` column, then I get "Not implemented
>>> type for Arrow list to pandas: extension<label<LabelType>>"
>>> > 2. If I'm querying this dataset using duckdb, running "select * from
>>> dataset where label='person'" results in: "Function 'equal' has no kernel
>>> matching input types (extension<label<LabelType>>, string)"
>>> >
>>> > Am I missing an alternate path to make this work with extension types?
>>> > Does implementing this in Arrow consist of checking if something is an
>>> extension type and if so, use the storage type instead? Is this something
>>> that's already on the roadmap at all?
>>> >
>>> > Thanks!
>>> >
>>> > Chang She
>>>
>>

Re: guidance on extension types

Posted by Micah Kornfield <em...@gmail.com>.
Also, note I've raised a similar issue (
https://issues.apache.org/jira/browse/ARROW-17535) for to_pandas calls.
One thing that I think would be nice is to be able to hook into the python
conversion when necessary translate to Python objects when necessary.



On Wed, Sep 21, 2022 at 8:49 PM Chang She <ch...@eto.ai> wrote:

> Thanks Wes.
>
> => Array.to_numpy : I opened ARROW-17813
> <https://issues.apache.org/jira/browse/ARROW-17813> as you suggested and
> added some details / repro code. There's also a follow-up thing about the
> other direction, converting from a pandas DataFrame column to an Arrow
> list<extension>.
>
> => You're right, I was a little hasty in the description and it wasn't
> very accurate:
>
> Scenario 1:
>
> If I have a non-nested ExtensionArray whose storage is a DictionaryArray,
> `pc.field("extension") == 'string'` would be a valid filter but
> currently triggers the "function 'equal' has no kernel matching input
> types" error.
> This is the path used by DuckDB if you add something like
> `extension=='string'` in the where clause.
> If Arrow/Acero is also able to automatically lower to storage type for the
> functions then it would make running compute on extension types a lot
> easier. Even for a list<label> column, at least in duckdb you could use
> "UNNEST" to make it work.
>
>
> Scenario 2:
>
> The trouble with using UNNEST is it makes the query a lot more complicated
> and has perf implications. If we're working a lot with nested data types,
> it would be easier to have a set of array functions.
> If there's a nested ExtensionArray, then something like a list-contains
> function would make things a lot easier. However, I think this is a lot
> more work (and depends on other systems like duckdb to integrate with these
> functions as well).
>
>
> Would it make sense for me to create a JIRA for scenario 1 to continue
> further discussion?
>
>
> Thanks again.
>
>
> On Tue, Sep 20, 2022 at 6:11 PM Wes McKinney <we...@gmail.com> wrote:
>
>> hi Chang,
>>
>> There are a few rough edges here that you've run into:
>>
>> * It looks like Array.to_numpy does not "automatically lower" to the
>> storage type when trying to convert to NumPy format. In the absence of
>> some other conversion rule, converting to the storage type seems like
>> a reasonable alternative to failing. Can you open a Jira issue about
>> this? This could probably be fixed easily in time for the 10.0.0
>> release, much more easily than the next issue
>>
>> * On the query, it looks like the filter portion at least is being
>> handled by Arrow/Acero — the syntax / UX relating to nested types here
>> is relatively unexplored relative to non-nested types. Here comparing
>> the label type (itself a list of dictionary-encoded strings) to a
>> string seems invalid, probably you would need to check for inclusion
>> of the string in the label list-of-strings. I do not know what the
>> syntax for this would be with DuckDB (to check for inclusion of a
>> string in a list of strings) but in principle this is something that
>> should be able to be made to work with some effort
>>
>> - Wes
>>
>> On Sun, Sep 18, 2022 at 8:23 PM Chang She <ch...@eto.ai> wrote:
>> >
>> > Hey y'all, thanks in advance for the discussion.
>> >
>> > I'm creating Arrow extensions for computer vision and I'm running into
>> issues in two scenarios. I couldn't find the answers in the archive so I
>> thought I'd post here.
>> >
>> > Example:
>> > I make an extension type called "Label" that has storage type
>> "dictionary<int8, string>". This is an object detection dataset so each row
>> represents an image and has multiple detected objects that needs to be
>> labeled. So there's a "name" column that is "list<label>":
>> >
>> > Example table schema:
>> > image_id: int
>> > uri: string
>> > label: list<label>   # list<dictionary<int8, string>>  storage type
>> >
>> >
>> > Problems:
>> > 1. `to_numpy` does not seem to work with a nested column. e.g., if I
>> try to call `to_numpy` on the `label` column, then I get "Not implemented
>> type for Arrow list to pandas: extension<label<LabelType>>"
>> > 2. If I'm querying this dataset using duckdb, running "select * from
>> dataset where label='person'" results in: "Function 'equal' has no kernel
>> matching input types (extension<label<LabelType>>, string)"
>> >
>> > Am I missing an alternate path to make this work with extension types?
>> > Does implementing this in Arrow consist of checking if something is an
>> extension type and if so, use the storage type instead? Is this something
>> that's already on the roadmap at all?
>> >
>> > Thanks!
>> >
>> > Chang She
>>
>

Re: guidance on extension types

Posted by Chang She <ch...@eto.ai>.
Thanks Wes.

=> Array.to_numpy : I opened ARROW-17813
<https://issues.apache.org/jira/browse/ARROW-17813> as you suggested and
added some details / repro code. There's also a follow-up thing about the
other direction, converting from a pandas DataFrame column to an Arrow
list<extension>.

=> You're right, I was a little hasty in the description and it wasn't very
accurate:

Scenario 1:

If I have a non-nested ExtensionArray whose storage is a DictionaryArray,
`pc.field("extension") == 'string'` would be a valid filter but
currently triggers the "function 'equal' has no kernel matching input
types" error.
This is the path used by DuckDB if you add something like
`extension=='string'` in the where clause.
If Arrow/Acero is also able to automatically lower to storage type for the
functions then it would make running compute on extension types a lot
easier. Even for a list<label> column, at least in duckdb you could use
"UNNEST" to make it work.


Scenario 2:

The trouble with using UNNEST is it makes the query a lot more complicated
and has perf implications. If we're working a lot with nested data types,
it would be easier to have a set of array functions.
If there's a nested ExtensionArray, then something like a list-contains
function would make things a lot easier. However, I think this is a lot
more work (and depends on other systems like duckdb to integrate with these
functions as well).


Would it make sense for me to create a JIRA for scenario 1 to continue
further discussion?


Thanks again.


On Tue, Sep 20, 2022 at 6:11 PM Wes McKinney <we...@gmail.com> wrote:

> hi Chang,
>
> There are a few rough edges here that you've run into:
>
> * It looks like Array.to_numpy does not "automatically lower" to the
> storage type when trying to convert to NumPy format. In the absence of
> some other conversion rule, converting to the storage type seems like
> a reasonable alternative to failing. Can you open a Jira issue about
> this? This could probably be fixed easily in time for the 10.0.0
> release, much more easily than the next issue
>
> * On the query, it looks like the filter portion at least is being
> handled by Arrow/Acero — the syntax / UX relating to nested types here
> is relatively unexplored relative to non-nested types. Here comparing
> the label type (itself a list of dictionary-encoded strings) to a
> string seems invalid, probably you would need to check for inclusion
> of the string in the label list-of-strings. I do not know what the
> syntax for this would be with DuckDB (to check for inclusion of a
> string in a list of strings) but in principle this is something that
> should be able to be made to work with some effort
>
> - Wes
>
> On Sun, Sep 18, 2022 at 8:23 PM Chang She <ch...@eto.ai> wrote:
> >
> > Hey y'all, thanks in advance for the discussion.
> >
> > I'm creating Arrow extensions for computer vision and I'm running into
> issues in two scenarios. I couldn't find the answers in the archive so I
> thought I'd post here.
> >
> > Example:
> > I make an extension type called "Label" that has storage type
> "dictionary<int8, string>". This is an object detection dataset so each row
> represents an image and has multiple detected objects that needs to be
> labeled. So there's a "name" column that is "list<label>":
> >
> > Example table schema:
> > image_id: int
> > uri: string
> > label: list<label>   # list<dictionary<int8, string>>  storage type
> >
> >
> > Problems:
> > 1. `to_numpy` does not seem to work with a nested column. e.g., if I try
> to call `to_numpy` on the `label` column, then I get "Not implemented type
> for Arrow list to pandas: extension<label<LabelType>>"
> > 2. If I'm querying this dataset using duckdb, running "select * from
> dataset where label='person'" results in: "Function 'equal' has no kernel
> matching input types (extension<label<LabelType>>, string)"
> >
> > Am I missing an alternate path to make this work with extension types?
> > Does implementing this in Arrow consist of checking if something is an
> extension type and if so, use the storage type instead? Is this something
> that's already on the roadmap at all?
> >
> > Thanks!
> >
> > Chang She
>

Re: guidance on extension types

Posted by Wes McKinney <we...@gmail.com>.
hi Chang,

There are a few rough edges here that you've run into:

* It looks like Array.to_numpy does not "automatically lower" to the
storage type when trying to convert to NumPy format. In the absence of
some other conversion rule, converting to the storage type seems like
a reasonable alternative to failing. Can you open a Jira issue about
this? This could probably be fixed easily in time for the 10.0.0
release, much more easily than the next issue

* On the query, it looks like the filter portion at least is being
handled by Arrow/Acero — the syntax / UX relating to nested types here
is relatively unexplored relative to non-nested types. Here comparing
the label type (itself a list of dictionary-encoded strings) to a
string seems invalid, probably you would need to check for inclusion
of the string in the label list-of-strings. I do not know what the
syntax for this would be with DuckDB (to check for inclusion of a
string in a list of strings) but in principle this is something that
should be able to be made to work with some effort

- Wes

On Sun, Sep 18, 2022 at 8:23 PM Chang She <ch...@eto.ai> wrote:
>
> Hey y'all, thanks in advance for the discussion.
>
> I'm creating Arrow extensions for computer vision and I'm running into issues in two scenarios. I couldn't find the answers in the archive so I thought I'd post here.
>
> Example:
> I make an extension type called "Label" that has storage type "dictionary<int8, string>". This is an object detection dataset so each row represents an image and has multiple detected objects that needs to be labeled. So there's a "name" column that is "list<label>":
>
> Example table schema:
> image_id: int
> uri: string
> label: list<label>   # list<dictionary<int8, string>>  storage type
>
>
> Problems:
> 1. `to_numpy` does not seem to work with a nested column. e.g., if I try to call `to_numpy` on the `label` column, then I get "Not implemented type for Arrow list to pandas: extension<label<LabelType>>"
> 2. If I'm querying this dataset using duckdb, running "select * from dataset where label='person'" results in: "Function 'equal' has no kernel matching input types (extension<label<LabelType>>, string)"
>
> Am I missing an alternate path to make this work with extension types?
> Does implementing this in Arrow consist of checking if something is an extension type and if so, use the storage type instead? Is this something that's already on the roadmap at all?
>
> Thanks!
>
> Chang She