You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Pierre Belzile <pi...@gmail.com> on 2020/02/28 22:56:58 UTC

Crash with 0.15.1 when transposing dicts with nulls values

When I recover an array of type dictionary int32 -> string from a parquet
file and that array has null positions, it seems that the indices that
correspond to null positions are undefined. I.e. not guaranteed to be 0.
This causes a crash when using a transpose map when trying to read the
transpose value. Does this seem possible? Fixed in 0.16.0?

If not I can create a JIRA but it is difficult to create a code snippet to
reproduce because it depends on uninitialized memory.

Pierre

Re: Crash with 0.15.1 when transposing dicts with nulls values

Posted by Antoine Pitrou <an...@python.org>.
Hi Pierre,

While the Arrow format doesn't mandate particular values under null
slots, the Arrow C++ implementation should not create "undefined" values
(for security reasons: failing to initialize data could lead to reveal
confidential information that was previously at the same memory location).

In practice, it will generally write zeros in the value slots
corresponding to null values.  You can see it for example here:
https://github.com/apache/arrow/blob/master/cpp/src/arrow/array/builder_primitive.h#L86

If you have found a situation where that is not the case (with the
latest version or with git master), you should open a JIRA issue.

Regards

Antoine.


Le 29/02/2020 à 23:48, Pierre Belzile a écrit :
> Hi Wes,
> 
> I guess the answer is that it is not fixed...
> 
> At this line,
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/int_util.cc#L409,
> we have a utility for transposing a dict from its original indices to a new
> mapping that is used when unifying dictionaries. I use that when
> concatenating tables. When the indice data below the null value is "unset",
> the result of transpose_map[src[0]] can easily point to out of bounds,
> especially when using an int32 indice.
> 
> In that function it's not possible to fix with the current signature
> because there is no end provided to the transpose map. That would be easy
> to fix by passing an upper bound.
> 
> But I think that problem is more general. If one cares about the
> corresponding string value, it is impossible to process the indices array
> blindly -- unlike all the other arrays. I've worked around this problem in
> my code by resetting all null positions to 0 indices prior to unifying a
> dict. But this also applies when doing comparisons (for example when
> sorting based on the string value.)
> 
> I'm not sure if I got into this situation because of some operation that I
> did that left the null positions uninitialized or whether parquet
> deserialization can lead to that. Hence my original question.
> 
> Pierre
> 
> 
> Le sam. 29 févr. 2020 à 15:52, Wes McKinney <we...@gmail.com> a écrit :
> 
>> The Arrow format does not indicate any particular value "underneath" a
>> null so I'm not sure what can be "fixed" here. What precisely are you
>> doing with the data that is failing?
>>
>> On Fri, Feb 28, 2020 at 4:57 PM Pierre Belzile <pi...@gmail.com>
>> wrote:
>>>
>>> When I recover an array of type dictionary int32 -> string from a parquet
>>> file and that array has null positions, it seems that the indices that
>>> correspond to null positions are undefined. I.e. not guaranteed to be 0.
>>> This causes a crash when using a transpose map when trying to read the
>>> transpose value. Does this seem possible? Fixed in 0.16.0?
>>>
>>> If not I can create a JIRA but it is difficult to create a code snippet
>> to
>>> reproduce because it depends on uninitialized memory.
>>>
>>> Pierre
>>
> 

Re: Crash with 0.15.1 when transposing dicts with nulls values

Posted by Pierre Belzile <pi...@gmail.com>.
Hi Wes,

I guess the answer is that it is not fixed...

At this line,
https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/int_util.cc#L409,
we have a utility for transposing a dict from its original indices to a new
mapping that is used when unifying dictionaries. I use that when
concatenating tables. When the indice data below the null value is "unset",
the result of transpose_map[src[0]] can easily point to out of bounds,
especially when using an int32 indice.

In that function it's not possible to fix with the current signature
because there is no end provided to the transpose map. That would be easy
to fix by passing an upper bound.

But I think that problem is more general. If one cares about the
corresponding string value, it is impossible to process the indices array
blindly -- unlike all the other arrays. I've worked around this problem in
my code by resetting all null positions to 0 indices prior to unifying a
dict. But this also applies when doing comparisons (for example when
sorting based on the string value.)

I'm not sure if I got into this situation because of some operation that I
did that left the null positions uninitialized or whether parquet
deserialization can lead to that. Hence my original question.

Pierre


Le sam. 29 févr. 2020 à 15:52, Wes McKinney <we...@gmail.com> a écrit :

> The Arrow format does not indicate any particular value "underneath" a
> null so I'm not sure what can be "fixed" here. What precisely are you
> doing with the data that is failing?
>
> On Fri, Feb 28, 2020 at 4:57 PM Pierre Belzile <pi...@gmail.com>
> wrote:
> >
> > When I recover an array of type dictionary int32 -> string from a parquet
> > file and that array has null positions, it seems that the indices that
> > correspond to null positions are undefined. I.e. not guaranteed to be 0.
> > This causes a crash when using a transpose map when trying to read the
> > transpose value. Does this seem possible? Fixed in 0.16.0?
> >
> > If not I can create a JIRA but it is difficult to create a code snippet
> to
> > reproduce because it depends on uninitialized memory.
> >
> > Pierre
>

Re: Crash with 0.15.1 when transposing dicts with nulls values

Posted by Wes McKinney <we...@gmail.com>.
The Arrow format does not indicate any particular value "underneath" a
null so I'm not sure what can be "fixed" here. What precisely are you
doing with the data that is failing?

On Fri, Feb 28, 2020 at 4:57 PM Pierre Belzile <pi...@gmail.com> wrote:
>
> When I recover an array of type dictionary int32 -> string from a parquet
> file and that array has null positions, it seems that the indices that
> correspond to null positions are undefined. I.e. not guaranteed to be 0.
> This causes a crash when using a transpose map when trying to read the
> transpose value. Does this seem possible? Fixed in 0.16.0?
>
> If not I can create a JIRA but it is difficult to create a code snippet to
> reproduce because it depends on uninitialized memory.
>
> Pierre