You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Adam Szmigin (Jira)" <ji...@apache.org> on 2020/07/08 00:09:00 UTC

[jira] [Created] (ARROW-9366) [C#] BinaryArray.Builder Reserve/Resize methods are broken

Adam Szmigin created ARROW-9366:
-----------------------------------

             Summary: [C#] BinaryArray.Builder Reserve/Resize methods are broken
                 Key: ARROW-9366
                 URL: https://issues.apache.org/jira/browse/ARROW-9366
             Project: Apache Arrow
          Issue Type: Bug
          Components: C#
    Affects Versions: 0.17.1
            Reporter: Adam Szmigin


h1. Summary

The {{Reserve()}} and {{Resize()}} methods on {{BinaryArray.Builder}} are broken, insofar as the current implementation does not perform the intended function and may corrupt the contents of any array built so far:

{code:csharp}
            public TBuilder Reserve(int capacity)
            {
                ValueOffsets.Reserve(capacity + 1);
                ValueBuffer.Reserve(capacity);
                ValidityBuffer.Reserve(capacity + 1);
                return Instance;
            }

            public TBuilder Resize(int length)
            {
                ValueOffsets.Resize(length + 1);
                ValueBuffer.Resize(length);
                ValidityBuffer.Resize(length + 1);
                return Instance;
            }
{code}

In the case of {{Reserve()}}, the {{capacity}} parameter is expected to refer to the desired number of items for which the array builder must have capacity.  However, it is almost certainly meaningless to call this for {{ValueBuffer}}, as this holds variable-length items and each item is very likely to be longer than one byte.

In the case of {{Resize()}}, the current implementation is _dangerous_, because:

# Resizing down will very likely truncate a variable-length value item somewhere in the middle.
# Resizing up will very likely truncate a variable-length value item somewhere too, as most value items are expected to be multiple bytes long, and hence an index equal to the parameter {{length}} is likely to be "back" in some earlier value instead.

h1. Suggested Fix

There are two broad approaches:

# Make existing functions safe:
#* {{Reserve()}} _may_ assume an average length of each value in order to make a sensible memory reservation in the {{ValueBuffer}}.
#* Resizing down _must not_ truncate values in the middle but find the appropriate offset for the new smaller number of items and truncate there instead.
#* Resizing up _must_ add valid value offsets, either with default empty or null values added to pad the array to the desired size.
# Remove the functions from the interface:
#* One could decide that {{Reserve()}} and {{Resize()}} only make sense for fixed-size values and change interfaces accordingly.

I have a mild personal preference for the former in the short term, mainly because it allows the existing interface to behave in a sensible way without entailing a larger-scale refactoring.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)