You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "&res (Jira)" <ji...@apache.org> on 2021/07/30 14:36:00 UTC
[jira] [Created] (ARROW-13509) Cannot "explode" empty table
&res created ARROW-13509:
----------------------------
Summary: Cannot "explode" empty table
Key: ARROW-13509
URL: https://issues.apache.org/jira/browse/ARROW-13509
Project: Apache Arrow
Issue Type: Improvement
Components: C++, Python
Affects Versions: 4.0.0
Reporter: &res
I'm trying to explode a table (in the pandas sense: [https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.explode.html)]
As it's not yet supported, I've writen some code to do it using a mix of list_flatten and list_parent_indices. It works well, excepted it crashed when for empty tables where it crashes.
{code:python}
WARNING: Logging before InitGoogleLogging() is written to STDERR
F0730 15:16:05.164858 13612 chunked_array.cc:48] Check failed: (chunks_.size()) > (0) cannot construct ChunkedArray from empty vector and omitted type
*** Check failure stack trace: ***Process finished with exit code 134 (interrupted by signal 6: SIGABRT)
{code}
Here's a reproducable example:
{code:python}
import sys
import pyarrow as pa
from pyarrow import compute
import pandas as pd
table = pa.Table.from_arrays(
[
pa.array([101, 102, 103], pa.int32()),
pa.array([['a'], ['a', 'b'], ['a', 'b', 'c']], pa.list_(pa.string()))
],
names=['key', 'list']
)
def explode(table) -> pd.DataFrame:
exploded_list = compute.list_flatten(table['list'])
indices = compute.list_parent_indices(table['list'])
assert indices.type == pa.int32()
keys = compute.take(table['key'], indices) # <--- Crashes here
return pa.Table.from_arrays(
[keys, exploded_list],
names=['key', 'list_element']
)
explode(table).to_pandas().to_markdown(sys.stdout)
explode(table.slice(0, 0)).to_pandas().to_markdown(sys.stdout) # <--- doesn't work
{code}
I've narrowed it down to the following:
when list_parent_indices is called on an empty table it returns this empty chunk array:
{code}
pa.chunked_array([], pa.int32())
{code}
Instead of this chunked array with 1 empty chunk:
{code}
pa.chunked_array([pa.array([], pa.int32())])
{code}
In turn take doesn't work with the empty chunked aray:
{code:python}
compute.take(pa.chunked_array([pa.array([], pa.int32())]),
pa.chunked_array([], pa.int32())) # Bad
compute.take(pa.chunked_array([pa.array([], pa.int32())]),
pa.chunked_array([pa.array([], pa.int32())])) # Good
{code}
Now in terms of how to fix it there's two solutions:
* take could accept empty chunked array
* list_parent_indices could return a chunked array with an empty chunk
PS: the error message isn't accurate. It says "cannot construct ChunkedArray from empty vector and omitted type". But the array being passed has got a type (int32) but no chunk. It makes me suspect that something in take strip the type of the empty chunked array.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)