You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Joris Van den Bossche (Jira)" <ji...@apache.org> on 2020/09/30 16:08:00 UTC
[jira] [Comment Edited] (ARROW-10140) No data for map column of a parquet file created from pyarrow and pandas

    [ https://issues.apache.org/jira/browse/ARROW-10140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17204840#comment-17204840 ] 

Joris Van den Bossche edited comment on ARROW-10140 at 9/30/20, 4:07 PM:
-------------------------------------------------------------------------

Yes, that was with latest master. 

On 1.0, I get the expected notimplemented error:

{code}
In [2]: pq.read_table("notebooks-arrow/test_map.parquet")
...
ArrowNotImplementedError: Reading lists of structs from Parquet files not yet supported: key_value: list<key_value: struct<key: string not null, value: string> not null> not null
{code}

Now, writing with pyarrow 1.0 still works without error (and reading with pyarrow 1.0 als gives the notimplemented error).  
But reading that file in arrow master gives a different error message:

{code}
In [45]: pq.read_table("test_map.parquet")
...
ArrowInvalid: Struct child array #0 invalid: Invalid: Length spanned by list offsets (2) larger than values array (length 0)
In ../src/parquet/arrow/reader.cc, line 106, code: (*out)->chunk(x)->Validate()
In ../src/parquet/arrow/reader.cc, line 881, code: ::arrow::internal::OptionalParallelFor( reader_properties_.use_threads(), static_cast<int>(readers.size()), [&](int i) { return readers[i]->NextBatch(batch_size, &columns[i]); })
In ../src/arrow/util/iterator.h, line 385, code: (_error_or_value7).status()
In ../src/arrow/record_batch.h, line 208, code: ReadNext(&batch)
In ../src/arrow/util/iterator.h, line 283, code: (_error_or_value5).status()
In ../src/arrow/util/iterator.h, line 283, code: (_error_or_value5).status()
In ../src/arrow/util/iterator.h, line 129, code: value_.status()
In ../src/arrow/util/iterator.h, line 157, code: (_error_or_value4).status()
In ../src/arrow/dataset/scanner.cc, line 210, code: (_error_or_value18).status()
In ../src/arrow/dataset/scanner.cc, line 217, code: task_group->Finish()
{code}

indicating that indeed the file written by pyarrow 1.0 might be incorrect / corrupt (the original report). But, it also seems working now on pyarrow master (at least _we_ can read it in again, but it would also be nice to check that the java parquet-tools can read it correctly now)


was (Author: jorisvandenbossche):
Yes, that was with latest master. 

On 1.0, I get the expected notimplemented error:

{code}
In [2]: pq.read_table("notebooks-arrow/test_map.parquet")
...
ArrowNotImplementedError: Reading lists of structs from Parquet files not yet supported: key_value: list<key_value: struct<key: string not null, value: string> not null> not null
{code}

Now, writing with pyarrow 1.0 still works without error (and reading with pyarrow 1.0 als gives the notimplemented error).  
But reading that file in arrow master gives a different error message:

{code}
In [45]: pq.read_table("test_map.parquet")
...
ArrowInvalid: Struct child array #0 invalid: Invalid: Length spanned by list offsets (2) larger than values array (length 0)
In ../src/parquet/arrow/reader.cc, line 106, code: (*out)->chunk(x)->Validate()
In ../src/parquet/arrow/reader.cc, line 881, code: ::arrow::internal::OptionalParallelFor( reader_properties_.use_threads(), static_cast<int>(readers.size()), [&](int i) { return readers[i]->NextBatch(batch_size, &columns[i]); })
In ../src/arrow/util/iterator.h, line 385, code: (_error_or_value7).status()
In ../src/arrow/record_batch.h, line 208, code: ReadNext(&batch)
In ../src/arrow/util/iterator.h, line 283, code: (_error_or_value5).status()
In ../src/arrow/util/iterator.h, line 283, code: (_error_or_value5).status()
In ../src/arrow/util/iterator.h, line 129, code: value_.status()
In ../src/arrow/util/iterator.h, line 157, code: (_error_or_value4).status()
In ../src/arrow/dataset/scanner.cc, line 210, code: (_error_or_value18).status()
In ../src/arrow/dataset/scanner.cc, line 217, code: task_group->Finish()
{code}

indicating that indeed the file written by pyarrow 1.0 might be incorrect / corrupt. But, it also seems working now on pyarrow master (at least _we_ can read it in again, but it would also be nice to check that the java parquet-tools can read it correctly now)

> No data for map column of a parquet file created from pyarrow and pandas
> ------------------------------------------------------------------------
>
>                 Key: ARROW-10140
>                 URL: https://issues.apache.org/jira/browse/ARROW-10140
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 1.0.1
>            Reporter: Chen Ming
>            Assignee: Micah Kornfield
>            Priority: Minor
>         Attachments: test_map.parquet, test_map.py
>
>
> Hi,
> I'm having problems reading parquet files with 'map' data type created by pyarrow.
> I followed [https://stackoverflow.com/questions/63553715/pyarrow-data-types-for-columns-that-have-lists-of-dictionaries] to convert a pandas DF to an arrow table, then call write_table to output a parquet file:
> (We also referred to https://issues.apache.org/jira/browse/ARROW-9812)
> {code:java}
> import pandas as pd
> import pyarrow as pa
> import pyarrow.parquet as pq
> print(f'PyArrow Version = {pa.__version__}')
> print(f'Pandas Version = {pd.__version__}')
> df = pd.DataFrame({
>          'col1': pd.Series([
>              [('id', 'something'), ('value2', 'else')],
>              [('id', 'something2'), ('value','else2')],
>          ]),
>          'col2': pd.Series(['foo', 'bar'])
>      })
> udt = pa.map_(pa.string(), pa.string())
> schema = pa.schema([pa.field('col1', udt), pa.field('col2', pa.string())])
> table = pa.Table.from_pandas(df, schema)
> pq.write_table(table, './test_map.parquet')
> {code}
> The above code (attached as test_map.py) runs smoothly on my developing computer:
> {code:java}
> PyArrow Version = 1.0.1
> Pandas Version = 1.1.2
> {code}
> And generated the test_map.parquet file (attached as test_map.parquet) successfully.
> Then I use parquet-tools (1.11.1) to read the file, but get the following output:
> {code:java}
> $ java -jar parquet-tools-1.11.1.jar head test_map.parquet
> col1:
> .key_value:
> .key_value:
> col2 = foo
> col1:
> .key_value:
> .key_value:
> col2 = bar
> {code}
> I also checked the schema of the parquet file:
> {code:java}
> java -jar parquet-tools-1.11.1.jar schema test_map.parquet
> message schema {
>   optional group col1 (MAP) {
>     repeated group key_value {
>       required binary key (STRING);
>       optional binary value (STRING);
>     }
>   }
>   optional binary col2 (STRING);
> }{code}
> Am I doing something wrong? 
> We need to output the data to parquet files, and query them later.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)