You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@arrow.apache.org by Wes McKinney <we...@gmail.com> on 2020/06/26 12:58:05 UTC

[Format][C++] Offering limited support for unsigned dictionary indices

hi folks,

At the moment, using unsigned integers for dictionary indices/codes
isn't exactly forbidden by the metadata [1], which says that the
indices must be "positive integers". Meanwhile, the columnar format
specification says

"When a field is dictionary encoded, the values are represented by an
array of signed integers representing the index of the value in the
dictionary. The memory layout for a dictionary-encoded array is the
same as that of a primitive signed integer layout."

I was looking at a discussion in RAPIDS about this topic [2]

When we drafted the columnar specification for this, the intention as
I recall was only to support signed integer indices. The rationale
basically is that:

* Better cross platform / language support for signed (e.g. certainly
in the JVM)
* Supporting 4 instead of 8 index types is less burdensome for the
developer and means less code to generate to support them
* Unsigned wraparound bugs could pass silently

I think it would be feasible to support the unsigned indices with the
following caveats:

* Signed integers are recommended as the "compatible" and preferred choice
* Most algorithms in the reference libraries should choose signed over
unsigned when generating indices
* Libraries may choose to promote unsigned to signed (e.g. in Java) if
they don't support unsigned well

I can't say I'm thrilled about having to maintain extra code for the
unsigned case, but it also seems like it would not do great harm
overall. Additionally, if you are certain that the indices are all
non-negative, then you can use the same code to process both intX and
uintX -- we use this trick in the Take implementation in C++ to
generate half as much binary code.

Thoughts?

Thanks
Wes

[1]: https://github.com/apache/arrow/blob/master/format/Schema.fbs#L280
[2]: https://github.com/rapidsai/cudf/pull/5501#issuecomment-649934509

Re: [Format][C++] Offering limited support for unsigned dictionary indices

Posted by Wes McKinney <we...@gmail.com>.

I opened a PR

https://github.com/apache/arrow/pull/7567

We should probably vote on this so I will wait another day or so
before opening a vote so we can get this change through for the
release

On Sun, Jun 28, 2020 at 9:38 AM Wes McKinney <we...@gmail.com> wrote:
>
> I'll propose a patch to Columnar.rst about this. I think it's OK to
> allow uint64 indices (so we don't have to revisit the issue in a year
> or two or...) but the specification should say that they are not
> preferred.
>
> On Fri, Jun 26, 2020 at 7:20 PM Paul Taylor <pt...@gmail.com> wrote:
> >
> > Responding to this comment from GitHub[1]:
> >
> > > If we had to make a bet about what % of dictionaries empirically are
> > > between 128 and 255 elements, I would bet that the percentage is
> > > small. If it turned out that 40% of dictionaries fell in that range
> > > then I would agree that this makes sense.
> >
> > I agree, the case where you'd use uint8 vs. int16 isn't very motivating,
> > since those are probably small tables or temporary allocations inside
> > larger operations.
> >
> > The case where you'd use uint16 vs. int32 is more motivating, since
> > dictionaries between 32767 and 65535 elements could easily back larger
> > columns and saving up to 1GiB on the keys can quickly add up.
> >
> > > I would recommend that the specification advise
> > > against use of 64-bit indices at all unless that are actually needed
> > > to represent the data (i.e. dictionaries have more than INT32_MAX /
> > > UINT32_MAX elements
> >
> > Agreed.
> >
> > > but doesn't strike me as being a common occurrence
> >
> > This is somewhat common in certain graph operations on large datasets,
> > but I concede I may not be a representative sample of Arrow users :)
> >
> > Paul
> >
> > 1. https://github.com/rapidsai/cudf/pull/5501#issuecomment-649936352
> >
> > On 6/26/20 1:53 PM, Wes McKinney wrote:
> > > I think that situations where you need uint64 indices are likely to be
> > > exceedingly esoteric. I would recommend that the specification advise
> > > against use of 64-bit indices at all unless that are actually needed
> > > to represent the data (i.e. dictionaries have more than INT32_MAX /
> > > UINT32_MAX elements, but doesn't strike me as being a common
> > > occurrence)
> > >
> > > IIUC, the only change that would be necessary would be a revision to
> > > Columnar.rst and perhaps some language augmentation in Schema.fbs, and
> > > we might want to add some integration tests for unsigned indices to
> > > probe whether each language supports them. The changes needed to
> > > support unsigned integers in C++ are probably not that invasive but I
> > > haven't taken a close look at it yet
> > >
> > > On Fri, Jun 26, 2020 at 3:23 PM Paul Taylor <pt...@gmail.com> wrote:
> > >> If positive integers are expected, I'm in favor of supporting unsigned
> > >> index types. I was surprised at Arrow C++ restriction on signed indices
> > >> in the RAPIDS thread, perhaps it's newer than when I ported the logic in JS.
> > >>
> > >> Based on the flatbuffer schemas, dictionary indices could technically be
> > >> any Arrow type, which I assumed was to allow for more
> > >> complex/exotic/custom indexing schemes in the future. JS will allow you
> > >> to specify _any_ Arrow type as the dictionary codes, though using a
> > >> non-numeric type without a custom enumerator is UB.
> > >>
> > >> I'm also curious about how the restriction on dictionary index types
> > >> interacts with streaming delta dictionaries. In theory, you could have a
> > >> streaming data source produce enough delta dictionaries such that the
> > >> total dictionary size grows beyond 2^31-1 elements.
> > >>
> > >> I think that's a valid use-case of delta dictionaries assuming Arrow
> > >> aggregates the dictionaries into multiple RecordBatches (or a
> > >> ChunkedArray), which is what JS does. But if that were allowed, we would
> > >> have to allow 64-bit (signed or unsigned) dictionary index types.
> > >>
> > >> Paul
> > >>
> > >>
> > >> On 6/26/20 5:58 AM, Wes McKinney wrote:
> > >>> hi folks,
> > >>>
> > >>> At the moment, using unsigned integers for dictionary indices/codes
> > >>> isn't exactly forbidden by the metadata [1], which says that the
> > >>> indices must be "positive integers". Meanwhile, the columnar format
> > >>> specification says
> > >>>
> > >>> "When a field is dictionary encoded, the values are represented by an
> > >>> array of signed integers representing the index of the value in the
> > >>> dictionary. The memory layout for a dictionary-encoded array is the
> > >>> same as that of a primitive signed integer layout."
> > >>>
> > >>> I was looking at a discussion in RAPIDS about this topic [2]
> > >>>
> > >>> When we drafted the columnar specification for this, the intention as
> > >>> I recall was only to support signed integer indices. The rationale
> > >>> basically is that:
> > >>>
> > >>> * Better cross platform / language support for signed (e.g. certainly
> > >>> in the JVM)
> > >>> * Supporting 4 instead of 8 index types is less burdensome for the
> > >>> developer and means less code to generate to support them
> > >>> * Unsigned wraparound bugs could pass silently
> > >>>
> > >>> I think it would be feasible to support the unsigned indices with the
> > >>> following caveats:
> > >>>
> > >>> * Signed integers are recommended as the "compatible" and preferred choice
> > >>> * Most algorithms in the reference libraries should choose signed over
> > >>> unsigned when generating indices
> > >>> * Libraries may choose to promote unsigned to signed (e.g. in Java) if
> > >>> they don't support unsigned well
> > >>>
> > >>> I can't say I'm thrilled about having to maintain extra code for the
> > >>> unsigned case, but it also seems like it would not do great harm
> > >>> overall. Additionally, if you are certain that the indices are all
> > >>> non-negative, then you can use the same code to process both intX and
> > >>> uintX -- we use this trick in the Take implementation in C++ to
> > >>> generate half as much binary code.
> > >>>
> > >>> Thoughts?
> > >>>
> > >>> Thanks
> > >>> Wes
> > >>>
> > >>> [1]: https://github.com/apache/arrow/blob/master/format/Schema.fbs#L280
> > >>> [2]: https://github.com/rapidsai/cudf/pull/5501#issuecomment-649934509

Re: [Format][C++] Offering limited support for unsigned dictionary indices

Posted by Wes McKinney <we...@gmail.com>.

I'll propose a patch to Columnar.rst about this. I think it's OK to
allow uint64 indices (so we don't have to revisit the issue in a year
or two or...) but the specification should say that they are not
preferred.

On Fri, Jun 26, 2020 at 7:20 PM Paul Taylor <pt...@gmail.com> wrote:
>
> Responding to this comment from GitHub[1]:
>
> > If we had to make a bet about what % of dictionaries empirically are
> > between 128 and 255 elements, I would bet that the percentage is
> > small. If it turned out that 40% of dictionaries fell in that range
> > then I would agree that this makes sense.
>
> I agree, the case where you'd use uint8 vs. int16 isn't very motivating,
> since those are probably small tables or temporary allocations inside
> larger operations.
>
> The case where you'd use uint16 vs. int32 is more motivating, since
> dictionaries between 32767 and 65535 elements could easily back larger
> columns and saving up to 1GiB on the keys can quickly add up.
>
> > I would recommend that the specification advise
> > against use of 64-bit indices at all unless that are actually needed
> > to represent the data (i.e. dictionaries have more than INT32_MAX /
> > UINT32_MAX elements
>
> Agreed.
>
> > but doesn't strike me as being a common occurrence
>
> This is somewhat common in certain graph operations on large datasets,
> but I concede I may not be a representative sample of Arrow users :)
>
> Paul
>
> 1. https://github.com/rapidsai/cudf/pull/5501#issuecomment-649936352
>
> On 6/26/20 1:53 PM, Wes McKinney wrote:
> > I think that situations where you need uint64 indices are likely to be
> > exceedingly esoteric. I would recommend that the specification advise
> > against use of 64-bit indices at all unless that are actually needed
> > to represent the data (i.e. dictionaries have more than INT32_MAX /
> > UINT32_MAX elements, but doesn't strike me as being a common
> > occurrence)
> >
> > IIUC, the only change that would be necessary would be a revision to
> > Columnar.rst and perhaps some language augmentation in Schema.fbs, and
> > we might want to add some integration tests for unsigned indices to
> > probe whether each language supports them. The changes needed to
> > support unsigned integers in C++ are probably not that invasive but I
> > haven't taken a close look at it yet
> >
> > On Fri, Jun 26, 2020 at 3:23 PM Paul Taylor <pt...@gmail.com> wrote:
> >> If positive integers are expected, I'm in favor of supporting unsigned
> >> index types. I was surprised at Arrow C++ restriction on signed indices
> >> in the RAPIDS thread, perhaps it's newer than when I ported the logic in JS.
> >>
> >> Based on the flatbuffer schemas, dictionary indices could technically be
> >> any Arrow type, which I assumed was to allow for more
> >> complex/exotic/custom indexing schemes in the future. JS will allow you
> >> to specify _any_ Arrow type as the dictionary codes, though using a
> >> non-numeric type without a custom enumerator is UB.
> >>
> >> I'm also curious about how the restriction on dictionary index types
> >> interacts with streaming delta dictionaries. In theory, you could have a
> >> streaming data source produce enough delta dictionaries such that the
> >> total dictionary size grows beyond 2^31-1 elements.
> >>
> >> I think that's a valid use-case of delta dictionaries assuming Arrow
> >> aggregates the dictionaries into multiple RecordBatches (or a
> >> ChunkedArray), which is what JS does. But if that were allowed, we would
> >> have to allow 64-bit (signed or unsigned) dictionary index types.
> >>
> >> Paul
> >>
> >>
> >> On 6/26/20 5:58 AM, Wes McKinney wrote:
> >>> hi folks,
> >>>
> >>> At the moment, using unsigned integers for dictionary indices/codes
> >>> isn't exactly forbidden by the metadata [1], which says that the
> >>> indices must be "positive integers". Meanwhile, the columnar format
> >>> specification says
> >>>
> >>> "When a field is dictionary encoded, the values are represented by an
> >>> array of signed integers representing the index of the value in the
> >>> dictionary. The memory layout for a dictionary-encoded array is the
> >>> same as that of a primitive signed integer layout."
> >>>
> >>> I was looking at a discussion in RAPIDS about this topic [2]
> >>>
> >>> When we drafted the columnar specification for this, the intention as
> >>> I recall was only to support signed integer indices. The rationale
> >>> basically is that:
> >>>
> >>> * Better cross platform / language support for signed (e.g. certainly
> >>> in the JVM)
> >>> * Supporting 4 instead of 8 index types is less burdensome for the
> >>> developer and means less code to generate to support them
> >>> * Unsigned wraparound bugs could pass silently
> >>>
> >>> I think it would be feasible to support the unsigned indices with the
> >>> following caveats:
> >>>
> >>> * Signed integers are recommended as the "compatible" and preferred choice
> >>> * Most algorithms in the reference libraries should choose signed over
> >>> unsigned when generating indices
> >>> * Libraries may choose to promote unsigned to signed (e.g. in Java) if
> >>> they don't support unsigned well
> >>>
> >>> I can't say I'm thrilled about having to maintain extra code for the
> >>> unsigned case, but it also seems like it would not do great harm
> >>> overall. Additionally, if you are certain that the indices are all
> >>> non-negative, then you can use the same code to process both intX and
> >>> uintX -- we use this trick in the Take implementation in C++ to
> >>> generate half as much binary code.
> >>>
> >>> Thoughts?
> >>>
> >>> Thanks
> >>> Wes
> >>>
> >>> [1]: https://github.com/apache/arrow/blob/master/format/Schema.fbs#L280
> >>> [2]: https://github.com/rapidsai/cudf/pull/5501#issuecomment-649934509

Re: [Format][C++] Offering limited support for unsigned dictionary indices

Posted by Paul Taylor <pt...@gmail.com>.

Responding to this comment from GitHub[1]:

> If we had to make a bet about what % of dictionaries empirically are 
> between 128 and 255 elements, I would bet that the percentage is 
> small. If it turned out that 40% of dictionaries fell in that range 
> then I would agree that this makes sense.

I agree, the case where you'd use uint8 vs. int16 isn't very motivating, 
since those are probably small tables or temporary allocations inside 
larger operations.

The case where you'd use uint16 vs. int32 is more motivating, since 
dictionaries between 32767 and 65535 elements could easily back larger 
columns and saving up to 1GiB on the keys can quickly add up.

> I would recommend that the specification advise
> against use of 64-bit indices at all unless that are actually needed
> to represent the data (i.e. dictionaries have more than INT32_MAX /
> UINT32_MAX elements

Agreed.

> but doesn't strike me as being a common occurrence

This is somewhat common in certain graph operations on large datasets, 
but I concede I may not be a representative sample of Arrow users :)

Paul

1. https://github.com/rapidsai/cudf/pull/5501#issuecomment-649936352

On 6/26/20 1:53 PM, Wes McKinney wrote:
> I think that situations where you need uint64 indices are likely to be
> exceedingly esoteric. I would recommend that the specification advise
> against use of 64-bit indices at all unless that are actually needed
> to represent the data (i.e. dictionaries have more than INT32_MAX /
> UINT32_MAX elements, but doesn't strike me as being a common
> occurrence)
>
> IIUC, the only change that would be necessary would be a revision to
> Columnar.rst and perhaps some language augmentation in Schema.fbs, and
> we might want to add some integration tests for unsigned indices to
> probe whether each language supports them. The changes needed to
> support unsigned integers in C++ are probably not that invasive but I
> haven't taken a close look at it yet
>
> On Fri, Jun 26, 2020 at 3:23 PM Paul Taylor <pt...@gmail.com> wrote:
>> If positive integers are expected, I'm in favor of supporting unsigned
>> index types. I was surprised at Arrow C++ restriction on signed indices
>> in the RAPIDS thread, perhaps it's newer than when I ported the logic in JS.
>>
>> Based on the flatbuffer schemas, dictionary indices could technically be
>> any Arrow type, which I assumed was to allow for more
>> complex/exotic/custom indexing schemes in the future. JS will allow you
>> to specify _any_ Arrow type as the dictionary codes, though using a
>> non-numeric type without a custom enumerator is UB.
>>
>> I'm also curious about how the restriction on dictionary index types
>> interacts with streaming delta dictionaries. In theory, you could have a
>> streaming data source produce enough delta dictionaries such that the
>> total dictionary size grows beyond 2^31-1 elements.
>>
>> I think that's a valid use-case of delta dictionaries assuming Arrow
>> aggregates the dictionaries into multiple RecordBatches (or a
>> ChunkedArray), which is what JS does. But if that were allowed, we would
>> have to allow 64-bit (signed or unsigned) dictionary index types.
>>
>> Paul
>>
>>
>> On 6/26/20 5:58 AM, Wes McKinney wrote:
>>> hi folks,
>>>
>>> At the moment, using unsigned integers for dictionary indices/codes
>>> isn't exactly forbidden by the metadata [1], which says that the
>>> indices must be "positive integers". Meanwhile, the columnar format
>>> specification says
>>>
>>> "When a field is dictionary encoded, the values are represented by an
>>> array of signed integers representing the index of the value in the
>>> dictionary. The memory layout for a dictionary-encoded array is the
>>> same as that of a primitive signed integer layout."
>>>
>>> I was looking at a discussion in RAPIDS about this topic [2]
>>>
>>> When we drafted the columnar specification for this, the intention as
>>> I recall was only to support signed integer indices. The rationale
>>> basically is that:
>>>
>>> * Better cross platform / language support for signed (e.g. certainly
>>> in the JVM)
>>> * Supporting 4 instead of 8 index types is less burdensome for the
>>> developer and means less code to generate to support them
>>> * Unsigned wraparound bugs could pass silently
>>>
>>> I think it would be feasible to support the unsigned indices with the
>>> following caveats:
>>>
>>> * Signed integers are recommended as the "compatible" and preferred choice
>>> * Most algorithms in the reference libraries should choose signed over
>>> unsigned when generating indices
>>> * Libraries may choose to promote unsigned to signed (e.g. in Java) if
>>> they don't support unsigned well
>>>
>>> I can't say I'm thrilled about having to maintain extra code for the
>>> unsigned case, but it also seems like it would not do great harm
>>> overall. Additionally, if you are certain that the indices are all
>>> non-negative, then you can use the same code to process both intX and
>>> uintX -- we use this trick in the Take implementation in C++ to
>>> generate half as much binary code.
>>>
>>> Thoughts?
>>>
>>> Thanks
>>> Wes
>>>
>>> [1]: https://github.com/apache/arrow/blob/master/format/Schema.fbs#L280
>>> [2]: https://github.com/rapidsai/cudf/pull/5501#issuecomment-649934509

Re: [Format][C++] Offering limited support for unsigned dictionary indices

Posted by Wes McKinney <we...@gmail.com>.

I think that situations where you need uint64 indices are likely to be
exceedingly esoteric. I would recommend that the specification advise
against use of 64-bit indices at all unless that are actually needed
to represent the data (i.e. dictionaries have more than INT32_MAX /
UINT32_MAX elements, but doesn't strike me as being a common
occurrence)

IIUC, the only change that would be necessary would be a revision to
Columnar.rst and perhaps some language augmentation in Schema.fbs, and
we might want to add some integration tests for unsigned indices to
probe whether each language supports them. The changes needed to
support unsigned integers in C++ are probably not that invasive but I
haven't taken a close look at it yet

On Fri, Jun 26, 2020 at 3:23 PM Paul Taylor <pt...@gmail.com> wrote:
>
> If positive integers are expected, I'm in favor of supporting unsigned
> index types. I was surprised at Arrow C++ restriction on signed indices
> in the RAPIDS thread, perhaps it's newer than when I ported the logic in JS.
>
> Based on the flatbuffer schemas, dictionary indices could technically be
> any Arrow type, which I assumed was to allow for more
> complex/exotic/custom indexing schemes in the future. JS will allow you
> to specify _any_ Arrow type as the dictionary codes, though using a
> non-numeric type without a custom enumerator is UB.
>
> I'm also curious about how the restriction on dictionary index types
> interacts with streaming delta dictionaries. In theory, you could have a
> streaming data source produce enough delta dictionaries such that the
> total dictionary size grows beyond 2^31-1 elements.
>
> I think that's a valid use-case of delta dictionaries assuming Arrow
> aggregates the dictionaries into multiple RecordBatches (or a
> ChunkedArray), which is what JS does. But if that were allowed, we would
> have to allow 64-bit (signed or unsigned) dictionary index types.
>
> Paul
>
>
> On 6/26/20 5:58 AM, Wes McKinney wrote:
> > hi folks,
> >
> > At the moment, using unsigned integers for dictionary indices/codes
> > isn't exactly forbidden by the metadata [1], which says that the
> > indices must be "positive integers". Meanwhile, the columnar format
> > specification says
> >
> > "When a field is dictionary encoded, the values are represented by an
> > array of signed integers representing the index of the value in the
> > dictionary. The memory layout for a dictionary-encoded array is the
> > same as that of a primitive signed integer layout."
> >
> > I was looking at a discussion in RAPIDS about this topic [2]
> >
> > When we drafted the columnar specification for this, the intention as
> > I recall was only to support signed integer indices. The rationale
> > basically is that:
> >
> > * Better cross platform / language support for signed (e.g. certainly
> > in the JVM)
> > * Supporting 4 instead of 8 index types is less burdensome for the
> > developer and means less code to generate to support them
> > * Unsigned wraparound bugs could pass silently
> >
> > I think it would be feasible to support the unsigned indices with the
> > following caveats:
> >
> > * Signed integers are recommended as the "compatible" and preferred choice
> > * Most algorithms in the reference libraries should choose signed over
> > unsigned when generating indices
> > * Libraries may choose to promote unsigned to signed (e.g. in Java) if
> > they don't support unsigned well
> >
> > I can't say I'm thrilled about having to maintain extra code for the
> > unsigned case, but it also seems like it would not do great harm
> > overall. Additionally, if you are certain that the indices are all
> > non-negative, then you can use the same code to process both intX and
> > uintX -- we use this trick in the Take implementation in C++ to
> > generate half as much binary code.
> >
> > Thoughts?
> >
> > Thanks
> > Wes
> >
> > [1]: https://github.com/apache/arrow/blob/master/format/Schema.fbs#L280
> > [2]: https://github.com/rapidsai/cudf/pull/5501#issuecomment-649934509

Re: [Format][C++] Offering limited support for unsigned dictionary indices

Posted by Paul Taylor <pt...@gmail.com>.

If positive integers are expected, I'm in favor of supporting unsigned 
index types. I was surprised at Arrow C++ restriction on signed indices 
in the RAPIDS thread, perhaps it's newer than when I ported the logic in JS.

Based on the flatbuffer schemas, dictionary indices could technically be 
any Arrow type, which I assumed was to allow for more 
complex/exotic/custom indexing schemes in the future. JS will allow you 
to specify _any_ Arrow type as the dictionary codes, though using a 
non-numeric type without a custom enumerator is UB.

I'm also curious about how the restriction on dictionary index types 
interacts with streaming delta dictionaries. In theory, you could have a 
streaming data source produce enough delta dictionaries such that the 
total dictionary size grows beyond 2^31-1 elements.

I think that's a valid use-case of delta dictionaries assuming Arrow 
aggregates the dictionaries into multiple RecordBatches (or a 
ChunkedArray), which is what JS does. But if that were allowed, we would 
have to allow 64-bit (signed or unsigned) dictionary index types.

Paul

On 6/26/20 5:58 AM, Wes McKinney wrote:
> hi folks,
>
> At the moment, using unsigned integers for dictionary indices/codes
> isn't exactly forbidden by the metadata [1], which says that the
> indices must be "positive integers". Meanwhile, the columnar format
> specification says
>
> "When a field is dictionary encoded, the values are represented by an
> array of signed integers representing the index of the value in the
> dictionary. The memory layout for a dictionary-encoded array is the
> same as that of a primitive signed integer layout."
>
> I was looking at a discussion in RAPIDS about this topic [2]
>
> When we drafted the columnar specification for this, the intention as
> I recall was only to support signed integer indices. The rationale
> basically is that:
>
> * Better cross platform / language support for signed (e.g. certainly
> in the JVM)
> * Supporting 4 instead of 8 index types is less burdensome for the
> developer and means less code to generate to support them
> * Unsigned wraparound bugs could pass silently
>
> I think it would be feasible to support the unsigned indices with the
> following caveats:
>
> * Signed integers are recommended as the "compatible" and preferred choice
> * Most algorithms in the reference libraries should choose signed over
> unsigned when generating indices
> * Libraries may choose to promote unsigned to signed (e.g. in Java) if
> they don't support unsigned well
>
> I can't say I'm thrilled about having to maintain extra code for the
> unsigned case, but it also seems like it would not do great harm
> overall. Additionally, if you are certain that the indices are all
> non-negative, then you can use the same code to process both intX and
> uintX -- we use this trick in the Take implementation in C++ to
> generate half as much binary code.
>
> Thoughts?
>
> Thanks
> Wes
>
> [1]: https://github.com/apache/arrow/blob/master/format/Schema.fbs#L280
> [2]: https://github.com/rapidsai/cudf/pull/5501#issuecomment-649934509

Re: [Format][C++] Offering limited support for unsigned dictionary indices

Posted by Micah Kornfield <em...@gmail.com>.

I think in the interest of not having the spec fork we should probably do
this.  It is partially our fault for not providing better documentation in
Schema.fbs (and potentially more thorough integration tests).

Maybe we should explicitly disallow uint64 which provides the biggest
headache for the JVM and it is probably used rarely if at all for
dictionaries.  It is possible even uint32 might be unnecessary?

-Micah

On Fri, Jun 26, 2020 at 5:58 AM Wes McKinney <we...@gmail.com> wrote:

> hi folks,
>
> At the moment, using unsigned integers for dictionary indices/codes
> isn't exactly forbidden by the metadata [1], which says that the
> indices must be "positive integers". Meanwhile, the columnar format
> specification says
>
> "When a field is dictionary encoded, the values are represented by an
> array of signed integers representing the index of the value in the
> dictionary. The memory layout for a dictionary-encoded array is the
> same as that of a primitive signed integer layout."
>
> I was looking at a discussion in RAPIDS about this topic [2]
>
> When we drafted the columnar specification for this, the intention as
> I recall was only to support signed integer indices. The rationale
> basically is that:
>
> * Better cross platform / language support for signed (e.g. certainly
> in the JVM)
> * Supporting 4 instead of 8 index types is less burdensome for the
> developer and means less code to generate to support them
> * Unsigned wraparound bugs could pass silently
>
> I think it would be feasible to support the unsigned indices with the
> following caveats:
>
> * Signed integers are recommended as the "compatible" and preferred choice
> * Most algorithms in the reference libraries should choose signed over
> unsigned when generating indices
> * Libraries may choose to promote unsigned to signed (e.g. in Java) if
> they don't support unsigned well
>
> I can't say I'm thrilled about having to maintain extra code for the
> unsigned case, but it also seems like it would not do great harm
> overall. Additionally, if you are certain that the indices are all
> non-negative, then you can use the same code to process both intX and
> uintX -- we use this trick in the Take implementation in C++ to
> generate half as much binary code.
>
> Thoughts?
>
> Thanks
> Wes
>
> [1]: https://github.com/apache/arrow/blob/master/format/Schema.fbs#L280
> [2]: https://github.com/rapidsai/cudf/pull/5501#issuecomment-649934509
>