You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "&res (Jira)" <ji...@apache.org> on 2021/07/30 14:36:00 UTC
[jira] [Created] (ARROW-13509) Cannot "explode" empty table

&res created ARROW-13509:
----------------------------

             Summary: Cannot "explode" empty table
                 Key: ARROW-13509
                 URL: https://issues.apache.org/jira/browse/ARROW-13509
             Project: Apache Arrow
          Issue Type: Improvement
          Components: C++, Python
    Affects Versions: 4.0.0
            Reporter: &res


I'm trying to explode a table (in the pandas sense: [https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.explode.html)]

As it's not yet supported, I've writen some code to do it using a mix of list_flatten and list_parent_indices. It works well, excepted it crashed when for empty tables where it crashes.
{code:python}
WARNING: Logging before InitGoogleLogging() is written to STDERR
F0730 15:16:05.164858 13612 chunked_array.cc:48]  Check failed: (chunks_.size()) > (0) cannot construct ChunkedArray from empty vector and omitted type
*** Check failure stack trace: ***Process finished with exit code 134 (interrupted by signal 6: SIGABRT)

{code}

Here's a reproducable example:

{code:python}

import sys

import pyarrow as pa
from pyarrow import compute
import pandas as pd

table = pa.Table.from_arrays(
    [
        pa.array([101, 102, 103], pa.int32()),
        pa.array([['a'], ['a', 'b'], ['a', 'b', 'c']], pa.list_(pa.string()))
    ],
    names=['key', 'list']
)


def explode(table) -> pd.DataFrame:
    exploded_list = compute.list_flatten(table['list'])

    indices = compute.list_parent_indices(table['list'])
    assert indices.type == pa.int32()
    keys = compute.take(table['key'], indices)  # <--- Crashes here
    return pa.Table.from_arrays(
        [keys, exploded_list],
        names=['key', 'list_element']
    )


explode(table).to_pandas().to_markdown(sys.stdout)
explode(table.slice(0, 0)).to_pandas().to_markdown(sys.stdout) # <--- doesn't work
{code}
 
I've narrowed it down to the following: 

when list_parent_indices is called on an empty table it returns this empty chunk array:
{code}
pa.chunked_array([], pa.int32())
{code}
Instead of this chunked array with 1 empty chunk:
{code}
pa.chunked_array([pa.array([], pa.int32())])
{code}

In turn take doesn't work with the empty chunked aray:
{code:python}
compute.take(pa.chunked_array([pa.array([], pa.int32())]),
             pa.chunked_array([], pa.int32())) # Bad
compute.take(pa.chunked_array([pa.array([], pa.int32())]),
             pa.chunked_array([pa.array([], pa.int32())])) # Good
{code}


Now in terms of how to fix it there's two solutions:
* take could accept empty chunked array
* list_parent_indices could return a chunked array with an empty chunk

PS: the error message isn't accurate. It says "cannot construct ChunkedArray from empty vector and omitted type". But the array being passed has got a type (int32) but no chunk. It makes me suspect that something in take strip the type of the empty chunked array.






 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)