You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@arrow.apache.org by Wes McKinney <we...@gmail.com> on 2017/07/18 19:10:17 UTC

Adding a Map logical type to the Arrow metadata

I recently created https://issues.apache.org/jira/browse/ARROW-1207
and wanted to discuss on the mailing list to hear opinions about how
to proceed.

Some systems, like Spark [1], Presto [2], or Drill have a Map<K, V>
composite type. These are sometimes stored in Parquet as a repeated
struct, or in Arrow types List<item: Struct<key: K, value: V>>.

While we can represent in-memory map data as List<Struct<K, V>>, it
may be useful to add a new logical type to the set of supported
logical types [3]. The idea is that the memory format between Map<K,
V> and List<Struct<K, V>> is identical, so this is strictly a logical
construct, similar to date/time values having the same in-memory
format as the corresponding integer types (int32/int64)

For Arrow implementation that do not provide a first class Map
container, they could process the data as though it were a repeated
struct. It would be helpful to us in C++ to have an arrow::MapArray
container because we could convert to / from this type and other data
structures like Python dictionaries. It would also be helpful to
faithfully transport the MAP logical type from Parquet [4]

Let me know what others think. One question I have is whether the
repeated struct in-memory representation makes sense as the canonical
map representation.

Thanks
Wes

[1]: https://spark.apache.org/docs/2.0.2/api/java/org/apache/spark/sql/types/MapType.html
[2]: https://prestodb.io/docs/current/functions/map.html
[3]: https://github.com/apache/arrow/blob/master/format/Schema.fbs#L157
[4]: https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L63

Re: Adding a Map logical type to the Arrow metadata

Posted by Wes McKinney <we...@gmail.com>.

I created a PR to assist with further discussion:
https://github.com/apache/arrow/pull/876

On Wed, Jul 19, 2017 at 4:14 PM, Julian Hyde <jh...@apache.org> wrote:
> I see. It took me a while to understand, but it all made sense when I
> realized that we are not looking at one Map instance but multiple
> rows, each with a Map instance, and the constituents parts of those
> Maps are stored end-to-end.
>
> Julian
>
>
> On Wed, Jul 19, 2017 at 11:42 AM, Wes McKinney <we...@gmail.com> wrote:
>> The only structural difference between
>>
>> List<Struct<K, V>>
>>
>> and
>>
>> Struct<List<K>, List<V>>
>>
>> is that in the latter case, the "key" value and the "value" value have
>> different offset vectors and thus can have different lengths.
>>
>> So in the first case we have buffer structure:
>>
>> - list null bitmap (map value is null / not null)
>> - list offsets
>> - key buffers (flattened)
>> - value buffers
>>
>> in the second case we have 1 additional buffer
>>
>> - struct null bitmap (map value is null / not null)
>> - key offsets
>> - key buffers
>> - value offsets
>> - value buffers
>>
>> We can do a zero-copy transformation from the former to the latter by
>> rearranging the buffers and reusing the list offsets buffer. In both
>> cases the keys and values are all contiguous.
>>
>> I agree that sorting is very useful; the metadata for Map should have
>> a field indicating whether or not the keys are sorted within each map
>> value
>>
>> - Wes
>>
>> On Wed, Jul 19, 2017 at 1:37 PM, Julian Hyde <jh...@apache.org> wrote:
>>> List<Struct<K, V>> isn’t the only physical representation that makes sense. Because it doesn’t take advantage of the fact that (a) keys can be re-ordered, (b) keys are unique.
>>>
>>> So, another viable physical representation would be Struct<List<K>, List<V>>, with the keys sorted. If keys are constant width and in contiguous memory then binary search is very fast.
>>>
>>> I am not claiming that this physical representation is better than yours. But the fact that there is a more than one means it’s not a no-brainer.
>>>
>>> Julian
>>>
>>>
>>>> On Jul 18, 2017, at 12:10 PM, Wes McKinney <we...@gmail.com> wrote:
>>>>
>>>> I recently created https://issues.apache.org/jira/browse/ARROW-1207
>>>> and wanted to discuss on the mailing list to hear opinions about how
>>>> to proceed.
>>>>
>>>> Some systems, like Spark [1], Presto [2], or Drill have a Map<K, V>
>>>> composite type. These are sometimes stored in Parquet as a repeated
>>>> struct, or in Arrow types List<item: Struct<key: K, value: V>>.
>>>>
>>>> While we can represent in-memory map data as List<Struct<K, V>>, it
>>>> may be useful to add a new logical type to the set of supported
>>>> logical types [3]. The idea is that the memory format between Map<K,
>>>> V> and List<Struct<K, V>> is identical, so this is strictly a logical
>>>> construct, similar to date/time values having the same in-memory
>>>> format as the corresponding integer types (int32/int64)
>>>>
>>>> For Arrow implementation that do not provide a first class Map
>>>> container, they could process the data as though it were a repeated
>>>> struct. It would be helpful to us in C++ to have an arrow::MapArray
>>>> container because we could convert to / from this type and other data
>>>> structures like Python dictionaries. It would also be helpful to
>>>> faithfully transport the MAP logical type from Parquet [4]
>>>>
>>>> Let me know what others think. One question I have is whether the
>>>> repeated struct in-memory representation makes sense as the canonical
>>>> map representation.
>>>>
>>>> Thanks
>>>> Wes
>>>>
>>>> [1]: https://spark.apache.org/docs/2.0.2/api/java/org/apache/spark/sql/types/MapType.html
>>>> [2]: https://prestodb.io/docs/current/functions/map.html
>>>> [3]: https://github.com/apache/arrow/blob/master/format/Schema.fbs#L157
>>>> [4]: https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L63
>>>

Re: Adding a Map logical type to the Arrow metadata

Posted by Julian Hyde <jh...@apache.org>.

I see. It took me a while to understand, but it all made sense when I
realized that we are not looking at one Map instance but multiple
rows, each with a Map instance, and the constituents parts of those
Maps are stored end-to-end.

Julian


On Wed, Jul 19, 2017 at 11:42 AM, Wes McKinney <we...@gmail.com> wrote:
> The only structural difference between
>
> List<Struct<K, V>>
>
> and
>
> Struct<List<K>, List<V>>
>
> is that in the latter case, the "key" value and the "value" value have
> different offset vectors and thus can have different lengths.
>
> So in the first case we have buffer structure:
>
> - list null bitmap (map value is null / not null)
> - list offsets
> - key buffers (flattened)
> - value buffers
>
> in the second case we have 1 additional buffer
>
> - struct null bitmap (map value is null / not null)
> - key offsets
> - key buffers
> - value offsets
> - value buffers
>
> We can do a zero-copy transformation from the former to the latter by
> rearranging the buffers and reusing the list offsets buffer. In both
> cases the keys and values are all contiguous.
>
> I agree that sorting is very useful; the metadata for Map should have
> a field indicating whether or not the keys are sorted within each map
> value
>
> - Wes
>
> On Wed, Jul 19, 2017 at 1:37 PM, Julian Hyde <jh...@apache.org> wrote:
>> List<Struct<K, V>> isn’t the only physical representation that makes sense. Because it doesn’t take advantage of the fact that (a) keys can be re-ordered, (b) keys are unique.
>>
>> So, another viable physical representation would be Struct<List<K>, List<V>>, with the keys sorted. If keys are constant width and in contiguous memory then binary search is very fast.
>>
>> I am not claiming that this physical representation is better than yours. But the fact that there is a more than one means it’s not a no-brainer.
>>
>> Julian
>>
>>
>>> On Jul 18, 2017, at 12:10 PM, Wes McKinney <we...@gmail.com> wrote:
>>>
>>> I recently created https://issues.apache.org/jira/browse/ARROW-1207
>>> and wanted to discuss on the mailing list to hear opinions about how
>>> to proceed.
>>>
>>> Some systems, like Spark [1], Presto [2], or Drill have a Map<K, V>
>>> composite type. These are sometimes stored in Parquet as a repeated
>>> struct, or in Arrow types List<item: Struct<key: K, value: V>>.
>>>
>>> While we can represent in-memory map data as List<Struct<K, V>>, it
>>> may be useful to add a new logical type to the set of supported
>>> logical types [3]. The idea is that the memory format between Map<K,
>>> V> and List<Struct<K, V>> is identical, so this is strictly a logical
>>> construct, similar to date/time values having the same in-memory
>>> format as the corresponding integer types (int32/int64)
>>>
>>> For Arrow implementation that do not provide a first class Map
>>> container, they could process the data as though it were a repeated
>>> struct. It would be helpful to us in C++ to have an arrow::MapArray
>>> container because we could convert to / from this type and other data
>>> structures like Python dictionaries. It would also be helpful to
>>> faithfully transport the MAP logical type from Parquet [4]
>>>
>>> Let me know what others think. One question I have is whether the
>>> repeated struct in-memory representation makes sense as the canonical
>>> map representation.
>>>
>>> Thanks
>>> Wes
>>>
>>> [1]: https://spark.apache.org/docs/2.0.2/api/java/org/apache/spark/sql/types/MapType.html
>>> [2]: https://prestodb.io/docs/current/functions/map.html
>>> [3]: https://github.com/apache/arrow/blob/master/format/Schema.fbs#L157
>>> [4]: https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L63
>>

Re: Adding a Map logical type to the Arrow metadata

Posted by Wes McKinney <we...@gmail.com>.

The only structural difference between

List<Struct<K, V>>

and

Struct<List<K>, List<V>>

is that in the latter case, the "key" value and the "value" value have
different offset vectors and thus can have different lengths.

So in the first case we have buffer structure:

- list null bitmap (map value is null / not null)
- list offsets
- key buffers (flattened)
- value buffers

in the second case we have 1 additional buffer

- struct null bitmap (map value is null / not null)
- key offsets
- key buffers
- value offsets
- value buffers

We can do a zero-copy transformation from the former to the latter by
rearranging the buffers and reusing the list offsets buffer. In both
cases the keys and values are all contiguous.

I agree that sorting is very useful; the metadata for Map should have
a field indicating whether or not the keys are sorted within each map
value

- Wes

On Wed, Jul 19, 2017 at 1:37 PM, Julian Hyde <jh...@apache.org> wrote:
> List<Struct<K, V>> isn’t the only physical representation that makes sense. Because it doesn’t take advantage of the fact that (a) keys can be re-ordered, (b) keys are unique.
>
> So, another viable physical representation would be Struct<List<K>, List<V>>, with the keys sorted. If keys are constant width and in contiguous memory then binary search is very fast.
>
> I am not claiming that this physical representation is better than yours. But the fact that there is a more than one means it’s not a no-brainer.
>
> Julian
>
>
>> On Jul 18, 2017, at 12:10 PM, Wes McKinney <we...@gmail.com> wrote:
>>
>> I recently created https://issues.apache.org/jira/browse/ARROW-1207
>> and wanted to discuss on the mailing list to hear opinions about how
>> to proceed.
>>
>> Some systems, like Spark [1], Presto [2], or Drill have a Map<K, V>
>> composite type. These are sometimes stored in Parquet as a repeated
>> struct, or in Arrow types List<item: Struct<key: K, value: V>>.
>>
>> While we can represent in-memory map data as List<Struct<K, V>>, it
>> may be useful to add a new logical type to the set of supported
>> logical types [3]. The idea is that the memory format between Map<K,
>> V> and List<Struct<K, V>> is identical, so this is strictly a logical
>> construct, similar to date/time values having the same in-memory
>> format as the corresponding integer types (int32/int64)
>>
>> For Arrow implementation that do not provide a first class Map
>> container, they could process the data as though it were a repeated
>> struct. It would be helpful to us in C++ to have an arrow::MapArray
>> container because we could convert to / from this type and other data
>> structures like Python dictionaries. It would also be helpful to
>> faithfully transport the MAP logical type from Parquet [4]
>>
>> Let me know what others think. One question I have is whether the
>> repeated struct in-memory representation makes sense as the canonical
>> map representation.
>>
>> Thanks
>> Wes
>>
>> [1]: https://spark.apache.org/docs/2.0.2/api/java/org/apache/spark/sql/types/MapType.html
>> [2]: https://prestodb.io/docs/current/functions/map.html
>> [3]: https://github.com/apache/arrow/blob/master/format/Schema.fbs#L157
>> [4]: https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L63
>

Re: Adding a Map logical type to the Arrow metadata

Posted by Julian Hyde <jh...@apache.org>.

List<Struct<K, V>> isn’t the only physical representation that makes sense. Because it doesn’t take advantage of the fact that (a) keys can be re-ordered, (b) keys are unique.

So, another viable physical representation would be Struct<List<K>, List<V>>, with the keys sorted. If keys are constant width and in contiguous memory then binary search is very fast.

I am not claiming that this physical representation is better than yours. But the fact that there is a more than one means it’s not a no-brainer.

Julian
 

> On Jul 18, 2017, at 12:10 PM, Wes McKinney <we...@gmail.com> wrote:
> 
> I recently created https://issues.apache.org/jira/browse/ARROW-1207
> and wanted to discuss on the mailing list to hear opinions about how
> to proceed.
> 
> Some systems, like Spark [1], Presto [2], or Drill have a Map<K, V>
> composite type. These are sometimes stored in Parquet as a repeated
> struct, or in Arrow types List<item: Struct<key: K, value: V>>.
> 
> While we can represent in-memory map data as List<Struct<K, V>>, it
> may be useful to add a new logical type to the set of supported
> logical types [3]. The idea is that the memory format between Map<K,
> V> and List<Struct<K, V>> is identical, so this is strictly a logical
> construct, similar to date/time values having the same in-memory
> format as the corresponding integer types (int32/int64)
> 
> For Arrow implementation that do not provide a first class Map
> container, they could process the data as though it were a repeated
> struct. It would be helpful to us in C++ to have an arrow::MapArray
> container because we could convert to / from this type and other data
> structures like Python dictionaries. It would also be helpful to
> faithfully transport the MAP logical type from Parquet [4]
> 
> Let me know what others think. One question I have is whether the
> repeated struct in-memory representation makes sense as the canonical
> map representation.
> 
> Thanks
> Wes
> 
> [1]: https://spark.apache.org/docs/2.0.2/api/java/org/apache/spark/sql/types/MapType.html
> [2]: https://prestodb.io/docs/current/functions/map.html
> [3]: https://github.com/apache/arrow/blob/master/format/Schema.fbs#L157
> [4]: https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L63