You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Adam Hooper (Jira)" <ji...@apache.org> on 2019/09/15 19:05:00 UTC

[jira] [Commented] (ARROW-6568) pyarrow.parquet crash writing zero-chunk dictionary-type column

    [ https://issues.apache.org/jira/browse/ARROW-6568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16930052#comment-16930052 ] 

Adam Hooper commented on ARROW-6568:
------------------------------------

My workaround, in my function that wraps `pyarrow.parquet.write_table()`:

{code:python}
if table.num_rows == 0:
    # Workaround for https://issues.apache.org/jira/browse/ARROW-6568
    # If table is zero-length, guarantee it has a RecordBatch so Arrow
    # won't crash when writing a DictionaryArray.
    def empty_array_for_field(field):
        if pyarrow.types.is_dictionary(field.type):
            return pyarrow.DictionaryArray.from_arrays(
                pyarrow.array([], type=field.type.index_type),
                pyarrow.array([], type=field.type.value_type),
            )
        else:
            return pyarrow.array([], type=field.type)
    table = pyarrow.table(
        {field.name: empty_array_for_field(field) for field in table.schema}
    )

# ... and now `table` is safe to use in `pyarrow.parquet.write_table()`.
{code}

> pyarrow.parquet crash writing zero-chunk dictionary-type column
> ---------------------------------------------------------------
>
>                 Key: ARROW-6568
>                 URL: https://issues.apache.org/jira/browse/ARROW-6568
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 0.14.1
>         Environment: Pyarrow v0.14.1, manylinux1
>            Reporter: Adam Hooper
>            Priority: Major
>
> Trying to write a zero-RecordBatch file to parquet:
> {code:python}
> import pyarrow
> import pyarrow.parquet
> table = pyarrow.Table.from_batches([], pyarrow.schema([('A', pyarrow.dictionary(pyarrow.int32(), pyarrow.string()))]))
> pyarrow.parquet.write_table(table, 'x.parquet')
> {code}
> ... I receive an error and Python exits with exit code {{139}}:
> {noformat}
> WARNING: Logging before InitGoogleLogging() is written to STDERR
> F0915 18:37:23.099939     1 table.cc:64]  Check failed: (chunks.size()) > (0) cannot construct ChunkedArray from empty vector and omitted type
> *** Check failure stack trace: ***
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)