You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Adam Szmigin (Jira)" <ji...@apache.org> on 2020/07/08 00:09:00 UTC
[jira] [Created] (ARROW-9366) [C#] BinaryArray.Builder
Reserve/Resize methods are broken
Adam Szmigin created ARROW-9366:
-----------------------------------
Summary: [C#] BinaryArray.Builder Reserve/Resize methods are broken
Key: ARROW-9366
URL: https://issues.apache.org/jira/browse/ARROW-9366
Project: Apache Arrow
Issue Type: Bug
Components: C#
Affects Versions: 0.17.1
Reporter: Adam Szmigin
h1. Summary
The {{Reserve()}} and {{Resize()}} methods on {{BinaryArray.Builder}} are broken, insofar as the current implementation does not perform the intended function and may corrupt the contents of any array built so far:
{code:csharp}
public TBuilder Reserve(int capacity)
{
ValueOffsets.Reserve(capacity + 1);
ValueBuffer.Reserve(capacity);
ValidityBuffer.Reserve(capacity + 1);
return Instance;
}
public TBuilder Resize(int length)
{
ValueOffsets.Resize(length + 1);
ValueBuffer.Resize(length);
ValidityBuffer.Resize(length + 1);
return Instance;
}
{code}
In the case of {{Reserve()}}, the {{capacity}} parameter is expected to refer to the desired number of items for which the array builder must have capacity. However, it is almost certainly meaningless to call this for {{ValueBuffer}}, as this holds variable-length items and each item is very likely to be longer than one byte.
In the case of {{Resize()}}, the current implementation is _dangerous_, because:
# Resizing down will very likely truncate a variable-length value item somewhere in the middle.
# Resizing up will very likely truncate a variable-length value item somewhere too, as most value items are expected to be multiple bytes long, and hence an index equal to the parameter {{length}} is likely to be "back" in some earlier value instead.
h1. Suggested Fix
There are two broad approaches:
# Make existing functions safe:
#* {{Reserve()}} _may_ assume an average length of each value in order to make a sensible memory reservation in the {{ValueBuffer}}.
#* Resizing down _must not_ truncate values in the middle but find the appropriate offset for the new smaller number of items and truncate there instead.
#* Resizing up _must_ add valid value offsets, either with default empty or null values added to pad the array to the desired size.
# Remove the functions from the interface:
#* One could decide that {{Reserve()}} and {{Resize()}} only make sense for fixed-size values and change interfaces accordingly.
I have a mild personal preference for the former in the short term, mainly because it allows the existing interface to behave in a sensible way without entailing a larger-scale refactoring.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)