You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2018/04/18 15:36:00 UTC
[jira] [Commented] (ARROW-2462) [C++] Segfault when writing a parquet table containing a dictionary column from Record Batch Stream

    [ https://issues.apache.org/jira/browse/ARROW-2462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16442703#comment-16442703 ] 

ASF GitHub Bot commented on ARROW-2462:
---------------------------------------

xhochy commented on issue #1896: ARROW-2462: [C++] Fix Segfault in UnpackBinaryDictionary
URL: https://github.com/apache/arrow/pull/1896#issuecomment-382430337
 
 
   Change looks good but has formatting issues. @zeroshade can you run `make format` and commit the changes?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


> [C++] Segfault when writing a parquet table containing a dictionary column from Record Batch Stream
> ---------------------------------------------------------------------------------------------------
>
>                 Key: ARROW-2462
>                 URL: https://issues.apache.org/jira/browse/ARROW-2462
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++, Python
>    Affects Versions: 0.9.1
>            Reporter: Matt Topol
>            Priority: Major
>              Labels: pull-request-available
>
> Discovered this through using pyarrow and dealing with RecordBatch Streams and parquet. The issue can be replicated as follows:
> {code:python}
> import pyarrow as pa
> import pyarrow.parquet as pq
> # create record batch with 1 dictionary column
> indices = pa.array([1,0,1,1,0])
> dictionary = pa.array(['Foo', 'Bar'])
> dict_array = pa.DictionaryArray.from_arrays(indices, dictionary)
> rb = pa.RecordBatch.from_arrays( [ dict_array ], [ 'd0' ] )
> # write out using RecordBatchStreamWriter
> sink = pa.BufferOutputStream()
> writer = pa.RecordBatchStreamWriter(sink, rb.schema)
> writer.write_batch(rb)
> writer.close()
> buf = sink.get_result()
> # read in and try to write parquet table
> reader = pa.open_stream(buf)
> tbl = reader.read_all()
> pq.write_table(tbl, 'dict_table.parquet') # SEGFAULTS
> {code}
> When writing record batch streams, if there are no nulls in an array, Arrow will put a placeholder nullptr instead of putting the full bitmap of 1s, when deserializing that stream, the bitmap for the nulls isn't populated and is left to being a nullptr. When attempting to write this table via pyarrow.parquet, you end up [here|https://github.com/apache/parquet-cpp/blob/master/src/parquet/arrow/writer.cc#L963]  in the parquet writer code which attempts to Cast the dictionary to a non-dictionary representation. Since the null count isn't checked before creating a BitmapReader, the BitmapReader is constructed with a nullptr for the bitmap_data, but a non-zero length which then segfaults in the constructor [here|https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/bit-util.h#L415] because bitmap is null.
> So a simple check of the null count before constructing the BitmapReader avoids the segfault.
> Already filed [PR 1896|https://github.com/apache/arrow/pull/1896]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)