You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@arrow.apache.org by Brian Hulette <br...@ccri.com> on 2018/04/06 14:42:35 UTC

Allow dictionary-encoded children?

I've been considering a use-case with a dictionary-encoded struct 
column, which may contain some dictionary-encoded columns itself. More 
specifically, in this use-case each row represents a single observation 
in a geospatial track, which includes a position, a time, and some 
track-level metadata (track id, origin, destination, etc...). I would 
like to represent the metadata as a dictionary-encoded struct, since 
unique values will be repeated for each observation of that track, and I 
would _also_ like to dictionary-encode some of the metadata column's 
children, since unique values will typically be repeated in multiple tracks.

I think one could make a (totally legitimate) argument that this is 
stretching a format designed for tabular data too far. This use-case 
could also be accomplished by breaking out the struct metadata column 
into its own arrow table, and managing a new integer column that 
references that table. This would look almost identical to what I 
initially described, it just wouldn't rely on the arrow libraries to 
manage the "dictionary".


The spec doesn't have anything to say on this topic as far as I can 
tell, but our implementations don't currently allow a dictionary-encoded 
column's children to be dictionary-encoded themselves [1]. Is this just 
a simplifying assumption, or a hard rule that should be codified in the 
spec?

Thanks,
Brian

[1] 
https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/metadata-internal.cc#L824

Re: Allow dictionary-encoded children?

Posted by Brian Hulette <br...@ccri.com>.

Thanks Uwe, Wes, glad to hear I'm not too far out there :) The 
dictionary batch ordering seems like a reasonable requirement for this 
situation.

I made a JIRA to add something like this to the integration tests 
(https://issues.apache.org/jira/browse/ARROW-2412) and Ill put up a PR 
shortly.

On 04/06/2018 01:43 PM, Wes McKinney wrote:
> Having dictionaries-within-dictionaries does add some complexity, but
> I think the use case is valid and so it would be good to determine the
> best way to handle this in the IPC / messaging protocol.
>
> I would suggest: dictionaries can use other dictionaries, so long as
> those dictionaries occur earlier in the stream. I am not sure either
> the Java or C++ libraries will be able to properly handle these cases
> right now, but that's what we have integration tests for!
>
> On Fri, Apr 6, 2018 at 11:59 AM, Uwe L. Korn <uw...@xhochy.com> wrote:
>> Hello Brian,
>>
>> I would also have considered this a legitimate use of the Arrow specification. We only specify the DictionaryType to have a dictionary of any Arrow Type. In the context of Arrow's IPC this seems to be a bit more complicated as we seem to have the assumption that there is only one type of Dictionary per column. I would argue that we should be able to support this once we work out a reliable way to transfer them via the IPC mechanism.
>>
>> Just as a related thought (might not produce the result you want): In Parquet, only the values on the lowest level are dictionary-encoded. But this is also due to the fact that Parquet uses repetition and definition levels to encode arbitrarily nested data types. These are more space-efficient when they are correctly encoded but don't provide random access.
>>
>> Uwe
>>
>> On Fri, Apr 6, 2018, at 4:42 PM, Brian Hulette wrote:
>>> I've been considering a use-case with a dictionary-encoded struct
>>> column, which may contain some dictionary-encoded columns itself. More
>>> specifically, in this use-case each row represents a single observation
>>> in a geospatial track, which includes a position, a time, and some
>>> track-level metadata (track id, origin, destination, etc...). I would
>>> like to represent the metadata as a dictionary-encoded struct, since
>>> unique values will be repeated for each observation of that track, and I
>>> would _also_ like to dictionary-encode some of the metadata column's
>>> children, since unique values will typically be repeated in multiple tracks.
>>>
>>> I think one could make a (totally legitimate) argument that this is
>>> stretching a format designed for tabular data too far. This use-case
>>> could also be accomplished by breaking out the struct metadata column
>>> into its own arrow table, and managing a new integer column that
>>> references that table. This would look almost identical to what I
>>> initially described, it just wouldn't rely on the arrow libraries to
>>> manage the "dictionary".
>>>
>>>
>>> The spec doesn't have anything to say on this topic as far as I can
>>> tell, but our implementations don't currently allow a dictionary-encoded
>>> column's children to be dictionary-encoded themselves [1]. Is this just
>>> a simplifying assumption, or a hard rule that should be codified in the
>>> spec?
>>>
>>> Thanks,
>>> Brian
>>>
>>> [1]
>>> https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/metadata-internal.cc#L824

Re: Allow dictionary-encoded children?

Posted by Wes McKinney <we...@gmail.com>.

Having dictionaries-within-dictionaries does add some complexity, but
I think the use case is valid and so it would be good to determine the
best way to handle this in the IPC / messaging protocol.

I would suggest: dictionaries can use other dictionaries, so long as
those dictionaries occur earlier in the stream. I am not sure either
the Java or C++ libraries will be able to properly handle these cases
right now, but that's what we have integration tests for!

On Fri, Apr 6, 2018 at 11:59 AM, Uwe L. Korn <uw...@xhochy.com> wrote:
> Hello Brian,
>
> I would also have considered this a legitimate use of the Arrow specification. We only specify the DictionaryType to have a dictionary of any Arrow Type. In the context of Arrow's IPC this seems to be a bit more complicated as we seem to have the assumption that there is only one type of Dictionary per column. I would argue that we should be able to support this once we work out a reliable way to transfer them via the IPC mechanism.
>
> Just as a related thought (might not produce the result you want): In Parquet, only the values on the lowest level are dictionary-encoded. But this is also due to the fact that Parquet uses repetition and definition levels to encode arbitrarily nested data types. These are more space-efficient when they are correctly encoded but don't provide random access.
>
> Uwe
>
> On Fri, Apr 6, 2018, at 4:42 PM, Brian Hulette wrote:
>> I've been considering a use-case with a dictionary-encoded struct
>> column, which may contain some dictionary-encoded columns itself. More
>> specifically, in this use-case each row represents a single observation
>> in a geospatial track, which includes a position, a time, and some
>> track-level metadata (track id, origin, destination, etc...). I would
>> like to represent the metadata as a dictionary-encoded struct, since
>> unique values will be repeated for each observation of that track, and I
>> would _also_ like to dictionary-encode some of the metadata column's
>> children, since unique values will typically be repeated in multiple tracks.
>>
>> I think one could make a (totally legitimate) argument that this is
>> stretching a format designed for tabular data too far. This use-case
>> could also be accomplished by breaking out the struct metadata column
>> into its own arrow table, and managing a new integer column that
>> references that table. This would look almost identical to what I
>> initially described, it just wouldn't rely on the arrow libraries to
>> manage the "dictionary".
>>
>>
>> The spec doesn't have anything to say on this topic as far as I can
>> tell, but our implementations don't currently allow a dictionary-encoded
>> column's children to be dictionary-encoded themselves [1]. Is this just
>> a simplifying assumption, or a hard rule that should be codified in the
>> spec?
>>
>> Thanks,
>> Brian
>>
>> [1]
>> https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/metadata-internal.cc#L824

Re: Allow dictionary-encoded children?

Posted by "Uwe L. Korn" <uw...@xhochy.com>.

Hello Brian,

I would also have considered this a legitimate use of the Arrow specification. We only specify the DictionaryType to have a dictionary of any Arrow Type. In the context of Arrow's IPC this seems to be a bit more complicated as we seem to have the assumption that there is only one type of Dictionary per column. I would argue that we should be able to support this once we work out a reliable way to transfer them via the IPC mechanism.

Just as a related thought (might not produce the result you want): In Parquet, only the values on the lowest level are dictionary-encoded. But this is also due to the fact that Parquet uses repetition and definition levels to encode arbitrarily nested data types. These are more space-efficient when they are correctly encoded but don't provide random access.

Uwe

On Fri, Apr 6, 2018, at 4:42 PM, Brian Hulette wrote:
> I've been considering a use-case with a dictionary-encoded struct 
> column, which may contain some dictionary-encoded columns itself. More 
> specifically, in this use-case each row represents a single observation 
> in a geospatial track, which includes a position, a time, and some 
> track-level metadata (track id, origin, destination, etc...). I would 
> like to represent the metadata as a dictionary-encoded struct, since 
> unique values will be repeated for each observation of that track, and I 
> would _also_ like to dictionary-encode some of the metadata column's 
> children, since unique values will typically be repeated in multiple tracks.
> 
> I think one could make a (totally legitimate) argument that this is 
> stretching a format designed for tabular data too far. This use-case 
> could also be accomplished by breaking out the struct metadata column 
> into its own arrow table, and managing a new integer column that 
> references that table. This would look almost identical to what I 
> initially described, it just wouldn't rely on the arrow libraries to 
> manage the "dictionary".
> 
> 
> The spec doesn't have anything to say on this topic as far as I can 
> tell, but our implementations don't currently allow a dictionary-encoded 
> column's children to be dictionary-encoded themselves [1]. Is this just 
> a simplifying assumption, or a hard rule that should be codified in the 
> spec?
> 
> Thanks,
> Brian
> 
> [1] 
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/metadata-internal.cc#L824