You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@arrow.apache.org by Antoine Pitrou <an...@python.org> on 2018/04/30 20:08:18 UTC

[Format] Pointer types / span types

Hi,

Today I got the opportunity to talk with Jim Pivarski, the main
developer on the OAMap project (*).  Under the hood, he is doing
something not unlike the Arrow representation of nested arrays: he
stores and processes structured data as linear arrays, allowing very
fast processing on seemingly irregular data (in Array parlance, think
something like lists of lists of structs).  It seems that OAMap data
requires two kinds of logical types that Arrow misses :

- a pointer type, where a physical array of ints is used to represent
indices into another array (the logical value being of course the value
pointed to)
- a span type, where two physical arrays of ints are used to represent
start and stop indices into another array (the logical value being the
list of values delimited by the start / stop indices)

Did such a feature request already come by?  Is this something we should
add to our roadmap or future wishlist?

(*) https://github.com/diana-hep/oamap

Regards

Antoine.

Re: [Format] Pointer types / span types

Posted by Wes McKinney <we...@gmail.com>.

Is there a way to model the span type in a normal Arrow schema? I
would need to look a bit more closely to understand it better

On Mon, Apr 30, 2018 at 5:42 PM, Antoine Pitrou <an...@python.org> wrote:
>
> Le 30/04/2018 à 23:39, Brian Hulette a écrit :
>> Yes my first reaction to both of these requests is
>> - would dictionary-encoding work?
>> - would a List<T> work?
>>
>> I think for the former the analogy is more clear, for the latter,
>> technically a List encodes start and stop indices with an offset array
>> rather than separate arrays for start and stop indices. Is there a
>> reason an offset array wouldn't work for the OAMap use-case though?
>
> With an offsets array, spans (lists) are contiguous: span N + 1 starts
> off where span N stops.  With separate start/stops array, they needn't
> be: the logical array can "walk" the physical array in any order.
>
> Regards
>
> Antoine.

Re: [Format] Pointer types / span types

Posted by Wes McKinney <we...@gmail.com>.

I see what you're saying. I was thinking about the span indices as it
relates to data split across record batches -- if you had a shared
"reference" array it could be treated like a dictionary, so if span
indices split across record batches reference the same array, then it
could be sent in a dictionary batch.

On Wed, May 2, 2018 at 5:03 PM, Brian Hulette <br...@ccri.com> wrote:
> List also references another (data) array which can be a different size, but
> rather than requiring it to be represented with a second schema, we make it
> a child of the List type. We could do the same thing for a Span type, and
> give it a new type of buffer that contains start/stop indices rather than
> offsets.
>
> To Antoine's point, maybe there's not enough demand to justify defining this
> type right now. I definitely agree that it would be good to see an example
> dataset before adding something like this.
>
> Brian
>
>
> On 05/02/2018 03:54 PM, Wes McKinney wrote:
>>>
>>> Perhaps that could be an argument for making span a core logical type?
>>
>> I think if anything, this argues that it should not be. Because "span"
>> references another array, which can be a different size, you need two
>> schemas to be able to make sense of it.
>>
>> In either case, I would be interested to see what modifications would
>> be proposed to Schema.fbs and an example dataset described with such a
>> schema (that is a single array, instead of multiple -- i.e. a
>> non-composite representation).
>>
>> For the record, if there are sufficiently common "composite" data
>> representations, I don't see a problem with developing community
>> standards based on the building blocks we already have
>>
>> - Wes
>>
>> On Wed, May 2, 2018 at 3:38 PM, Brian Hulette <br...@ccri.com>
>> wrote:
>>>
>>> If this were accomplished at the application level, how would it work
>>> with
>>> the IPC formats? I'd think you'd need to have two separate files (or
>>> streams), since array 1 and array 2 will be different lengths. Perhaps
>>> that
>>> could be an argument for making span a core logical type?
>>>
>>> Brian
>>>
>>>
>>>
>>> On 05/02/2018 03:34 PM, Antoine Pitrou wrote:
>>>>
>>>> On Wed, 2 May 2018 10:12:37 -0400
>>>> Wes McKinney <we...@gmail.com> wrote:
>>>>>
>>>>> It sounds like the "span" type could be implemented as a composite of
>>>>> multiple Arrow arrays / schemas:
>>>>>
>>>>> array 1 (data)
>>>>> any schema
>>>>>
>>>>> array 2 (view)
>>>>> struct <
>>>>>     start: int64,
>>>>>     stop: int64
>>>>>>
>>>>>>
>>>>> Unless I'm missing something, this feels like an application-level
>>>>> concern rather than something that needs to be addressed in the
>>>>> columnar format / metadata.
>>>>
>>>> Well, couldn't the same theoretically be said about list arrays?
>>>> In the end, I suppose it all depends whether there's enough demand to
>>>> make it a core logical type inside Arrow, rather than something people
>>>> write custom code for in their application.
>>>>
>>>> Regards
>>>>
>>>> Antoine.
>>>
>>>
>

Re: [Format] Pointer types / span types

Posted by Brian Hulette <br...@ccri.com>.

List also references another (data) array which can be a different size, 
but rather than requiring it to be represented with a second schema, we 
make it a child of the List type. We could do the same thing for a Span 
type, and give it a new type of buffer that contains start/stop indices 
rather than offsets.

To Antoine's point, maybe there's not enough demand to justify defining 
this type right now. I definitely agree that it would be good to see an 
example dataset before adding something like this.

Brian

On 05/02/2018 03:54 PM, Wes McKinney wrote:
>> Perhaps that could be an argument for making span a core logical type?
> I think if anything, this argues that it should not be. Because "span"
> references another array, which can be a different size, you need two
> schemas to be able to make sense of it.
>
> In either case, I would be interested to see what modifications would
> be proposed to Schema.fbs and an example dataset described with such a
> schema (that is a single array, instead of multiple -- i.e. a
> non-composite representation).
>
> For the record, if there are sufficiently common "composite" data
> representations, I don't see a problem with developing community
> standards based on the building blocks we already have
>
> - Wes
>
> On Wed, May 2, 2018 at 3:38 PM, Brian Hulette <br...@ccri.com> wrote:
>> If this were accomplished at the application level, how would it work with
>> the IPC formats? I'd think you'd need to have two separate files (or
>> streams), since array 1 and array 2 will be different lengths. Perhaps that
>> could be an argument for making span a core logical type?
>>
>> Brian
>>
>>
>>
>> On 05/02/2018 03:34 PM, Antoine Pitrou wrote:
>>> On Wed, 2 May 2018 10:12:37 -0400
>>> Wes McKinney <we...@gmail.com> wrote:
>>>> It sounds like the "span" type could be implemented as a composite of
>>>> multiple Arrow arrays / schemas:
>>>>
>>>> array 1 (data)
>>>> any schema
>>>>
>>>> array 2 (view)
>>>> struct <
>>>>     start: int64,
>>>>     stop: int64
>>>>>
>>>> Unless I'm missing something, this feels like an application-level
>>>> concern rather than something that needs to be addressed in the
>>>> columnar format / metadata.
>>> Well, couldn't the same theoretically be said about list arrays?
>>> In the end, I suppose it all depends whether there's enough demand to
>>> make it a core logical type inside Arrow, rather than something people
>>> write custom code for in their application.
>>>
>>> Regards
>>>
>>> Antoine.
>>

Re: [Format] Pointer types / span types

Posted by Wes McKinney <we...@gmail.com>.

> Perhaps that could be an argument for making span a core logical type?

I think if anything, this argues that it should not be. Because "span"
references another array, which can be a different size, you need two
schemas to be able to make sense of it.

In either case, I would be interested to see what modifications would
be proposed to Schema.fbs and an example dataset described with such a
schema (that is a single array, instead of multiple -- i.e. a
non-composite representation).

For the record, if there are sufficiently common "composite" data
representations, I don't see a problem with developing community
standards based on the building blocks we already have

- Wes

On Wed, May 2, 2018 at 3:38 PM, Brian Hulette <br...@ccri.com> wrote:
> If this were accomplished at the application level, how would it work with
> the IPC formats? I'd think you'd need to have two separate files (or
> streams), since array 1 and array 2 will be different lengths. Perhaps that
> could be an argument for making span a core logical type?
>
> Brian
>
>
>
> On 05/02/2018 03:34 PM, Antoine Pitrou wrote:
>>
>> On Wed, 2 May 2018 10:12:37 -0400
>> Wes McKinney <we...@gmail.com> wrote:
>>>
>>> It sounds like the "span" type could be implemented as a composite of
>>> multiple Arrow arrays / schemas:
>>>
>>> array 1 (data)
>>> any schema
>>>
>>> array 2 (view)
>>> struct <
>>>    start: int64,
>>>    stop: int64
>>>>
>>>>
>>>
>>> Unless I'm missing something, this feels like an application-level
>>> concern rather than something that needs to be addressed in the
>>> columnar format / metadata.
>>
>> Well, couldn't the same theoretically be said about list arrays?
>> In the end, I suppose it all depends whether there's enough demand to
>> make it a core logical type inside Arrow, rather than something people
>> write custom code for in their application.
>>
>> Regards
>>
>> Antoine.
>
>

Re: [Format] Pointer types / span types

Posted by Brian Hulette <br...@ccri.com>.

If this were accomplished at the application level, how would it work 
with the IPC formats? I'd think you'd need to have two separate files 
(or streams), since array 1 and array 2 will be different lengths. 
Perhaps that could be an argument for making span a core logical type?

Brian


On 05/02/2018 03:34 PM, Antoine Pitrou wrote:
> On Wed, 2 May 2018 10:12:37 -0400
> Wes McKinney <we...@gmail.com> wrote:
>> It sounds like the "span" type could be implemented as a composite of
>> multiple Arrow arrays / schemas:
>>
>> array 1 (data)
>> any schema
>>
>> array 2 (view)
>> struct <
>>    start: int64,
>>    stop: int64
>>>   
>> Unless I'm missing something, this feels like an application-level
>> concern rather than something that needs to be addressed in the
>> columnar format / metadata.
> Well, couldn't the same theoretically be said about list arrays?
> In the end, I suppose it all depends whether there's enough demand to
> make it a core logical type inside Arrow, rather than something people
> write custom code for in their application.
>
> Regards
>
> Antoine.

Re: [Format] Pointer types / span types

Posted by Antoine Pitrou <so...@pitrou.net>.

On Wed, 2 May 2018 10:12:37 -0400
Wes McKinney <we...@gmail.com> wrote:
> It sounds like the "span" type could be implemented as a composite of
> multiple Arrow arrays / schemas:
> 
> array 1 (data)
> any schema
> 
> array 2 (view)
> struct <
>   start: int64,
>   stop: int64
> >  
> 
> Unless I'm missing something, this feels like an application-level
> concern rather than something that needs to be addressed in the
> columnar format / metadata.

Well, couldn't the same theoretically be said about list arrays?
In the end, I suppose it all depends whether there's enough demand to
make it a core logical type inside Arrow, rather than something people
write custom code for in their application.

Regards

Antoine.

Re: [Format] Pointer types / span types

Posted by Wes McKinney <we...@gmail.com>.

It sounds like the "span" type could be implemented as a composite of
multiple Arrow arrays / schemas:

array 1 (data)
any schema

array 2 (view)
struct <
  start: int64,
  stop: int64
>

Unless I'm missing something, this feels like an application-level
concern rather than something that needs to be addressed in the
columnar format / metadata.

On Tue, May 1, 2018 at 9:43 AM, Antoine Pitrou <an...@python.org> wrote:
>
> IIUC, the point is to have different logical views over the same data.
> So you could have e.g. a "sorted" view.  You could also have a view
> spanning a tiny fraction of the original data (you can probably also
> encode that with a null bitmap, but if most values are nulls that is
> less efficient).
>
> Regards
>
> Antoine.
>
>
> Le 01/05/2018 à 15:24, Brian Hulette a écrit :
>> Yeah I see that difference. I guess my question was really - is there a
>> reason not to re-arrange the actual list data so that an offset array
>> will work?
>>
>> Perhaps they actually want to be able to specify lists with overlap? Or
>> maybe there is meaning to the original order of the list data? I suppose
>> that latter option seems more likely.
>>
>> Brian
>>
>>
>> On 04/30/2018 05:42 PM, Antoine Pitrou wrote:
>>> Le 30/04/2018 à 23:39, Brian Hulette a écrit :
>>>> Yes my first reaction to both of these requests is
>>>> - would dictionary-encoding work?
>>>> - would a List<T> work?
>>>>
>>>> I think for the former the analogy is more clear, for the latter,
>>>> technically a List encodes start and stop indices with an offset array
>>>> rather than separate arrays for start and stop indices. Is there a
>>>> reason an offset array wouldn't work for the OAMap use-case though?
>>> With an offsets array, spans (lists) are contiguous: span N + 1 starts
>>> off where span N stops.  With separate start/stops array, they needn't
>>> be: the logical array can "walk" the physical array in any order.
>>>
>>> Regards
>>>
>>> Antoine.
>>

Re: [Format] Pointer types / span types

Posted by Antoine Pitrou <an...@python.org>.

IIUC, the point is to have different logical views over the same data.
So you could have e.g. a "sorted" view.  You could also have a view
spanning a tiny fraction of the original data (you can probably also
encode that with a null bitmap, but if most values are nulls that is
less efficient).

Regards

Antoine.


Le 01/05/2018 à 15:24, Brian Hulette a écrit :
> Yeah I see that difference. I guess my question was really - is there a 
> reason not to re-arrange the actual list data so that an offset array 
> will work?
> 
> Perhaps they actually want to be able to specify lists with overlap? Or 
> maybe there is meaning to the original order of the list data? I suppose 
> that latter option seems more likely.
> 
> Brian
> 
> 
> On 04/30/2018 05:42 PM, Antoine Pitrou wrote:
>> Le 30/04/2018 à 23:39, Brian Hulette a écrit :
>>> Yes my first reaction to both of these requests is
>>> - would dictionary-encoding work?
>>> - would a List<T> work?
>>>
>>> I think for the former the analogy is more clear, for the latter,
>>> technically a List encodes start and stop indices with an offset array
>>> rather than separate arrays for start and stop indices. Is there a
>>> reason an offset array wouldn't work for the OAMap use-case though?
>> With an offsets array, spans (lists) are contiguous: span N + 1 starts
>> off where span N stops.  With separate start/stops array, they needn't
>> be: the logical array can "walk" the physical array in any order.
>>
>> Regards
>>
>> Antoine.
>

Re: [Format] Pointer types / span types

Posted by Antoine Pitrou <an...@python.org>.

Le 30/04/2018 à 23:39, Brian Hulette a écrit :
> Yes my first reaction to both of these requests is
> - would dictionary-encoding work?
> - would a List<T> work?
> 
> I think for the former the analogy is more clear, for the latter, 
> technically a List encodes start and stop indices with an offset array 
> rather than separate arrays for start and stop indices. Is there a 
> reason an offset array wouldn't work for the OAMap use-case though?

With an offsets array, spans (lists) are contiguous: span N + 1 starts
off where span N stops.  With separate start/stops array, they needn't
be: the logical array can "walk" the physical array in any order.

Regards

Antoine.

Re: [Format] Pointer types / span types

Posted by Brian Hulette <br...@ccri.com>.

Yes my first reaction to both of these requests is
- would dictionary-encoding work?
- would a List<T> work?

I think for the former the analogy is more clear, for the latter, 
technically a List encodes start and stop indices with an offset array 
rather than separate arrays for start and stop indices. Is there a 
reason an offset array wouldn't work for the OAMap use-case though?

Brian


On 04/30/2018 04:55 PM, Antoine Pitrou wrote:
> Actually, "pointer type" might just be another name for "dictionary type".
>
> Regards
>
> Antoine.
>
>
> Le 30/04/2018 à 22:08, Antoine Pitrou a écrit :
>> Hi,
>>
>> Today I got the opportunity to talk with Jim Pivarski, the main
>> developer on the OAMap project (*).  Under the hood, he is doing
>> something not unlike the Arrow representation of nested arrays: he
>> stores and processes structured data as linear arrays, allowing very
>> fast processing on seemingly irregular data (in Array parlance, think
>> something like lists of lists of structs).  It seems that OAMap data
>> requires two kinds of logical types that Arrow misses :
>>
>> - a pointer type, where a physical array of ints is used to represent
>> indices into another array (the logical value being of course the value
>> pointed to)
>> - a span type, where two physical arrays of ints are used to represent
>> start and stop indices into another array (the logical value being the
>> list of values delimited by the start / stop indices)
>>
>> Did such a feature request already come by?  Is this something we should
>> add to our roadmap or future wishlist?
>>
>> (*) https://github.com/diana-hep/oamap
>>
>> Regards
>>
>> Antoine.
>>

Re: [Format] Pointer types / span types

Posted by Antoine Pitrou <an...@python.org>.

Actually, "pointer type" might just be another name for "dictionary type".

Regards

Antoine.


Le 30/04/2018 à 22:08, Antoine Pitrou a écrit :
> 
> Hi,
> 
> Today I got the opportunity to talk with Jim Pivarski, the main
> developer on the OAMap project (*).  Under the hood, he is doing
> something not unlike the Arrow representation of nested arrays: he
> stores and processes structured data as linear arrays, allowing very
> fast processing on seemingly irregular data (in Array parlance, think
> something like lists of lists of structs).  It seems that OAMap data
> requires two kinds of logical types that Arrow misses :
> 
> - a pointer type, where a physical array of ints is used to represent
> indices into another array (the logical value being of course the value
> pointed to)
> - a span type, where two physical arrays of ints are used to represent
> start and stop indices into another array (the logical value being the
> list of values delimited by the start / stop indices)
> 
> Did such a feature request already come by?  Is this something we should
> add to our roadmap or future wishlist?
> 
> (*) https://github.com/diana-hep/oamap
> 
> Regards
> 
> Antoine.
>