You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Micah Kornfield (Jira)" <ji...@apache.org> on 2020/09/30 15:23:00 UTC
[jira] [Assigned] (ARROW-10140) No data for map column of a parquet file created from pyarrow and pandas

     [ https://issues.apache.org/jira/browse/ARROW-10140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Micah Kornfield reassigned ARROW-10140:
---------------------------------------

    Assignee: Micah Kornfield

> No data for map column of a parquet file created from pyarrow and pandas
> ------------------------------------------------------------------------
>
>                 Key: ARROW-10140
>                 URL: https://issues.apache.org/jira/browse/ARROW-10140
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 1.0.1
>            Reporter: Chen Ming
>            Assignee: Micah Kornfield
>            Priority: Minor
>         Attachments: test_map.parquet, test_map.py
>
>
> Hi,
> I'm having problems reading parquet files with 'map' data type created by pyarrow.
> I followed [https://stackoverflow.com/questions/63553715/pyarrow-data-types-for-columns-that-have-lists-of-dictionaries] to convert a pandas DF to an arrow table, then call write_table to output a parquet file:
> (We also referred to https://issues.apache.org/jira/browse/ARROW-9812)
> {code:java}
> import pandas as pd
> import pyarrow as pa
> import pyarrow.parquet as pq
> print(f'PyArrow Version = {pa.__version__}')
> print(f'Pandas Version = {pd.__version__}')
> df = pd.DataFrame({
>          'col1': pd.Series([
>              [('id', 'something'), ('value2', 'else')],
>              [('id', 'something2'), ('value','else2')],
>          ]),
>          'col2': pd.Series(['foo', 'bar'])
>      })
> udt = pa.map_(pa.string(), pa.string())
> schema = pa.schema([pa.field('col1', udt), pa.field('col2', pa.string())])
> table = pa.Table.from_pandas(df, schema)
> pq.write_table(table, './test_map.parquet')
> {code}
> The above code (attached as test_map.py) runs smoothly on my developing computer:
> {code:java}
> PyArrow Version = 1.0.1
> Pandas Version = 1.1.2
> {code}
> And generated the test_map.parquet file (attached as test_map.parquet) successfully.
> Then I use parquet-tools (1.11.1) to read the file, but get the following output:
> {code:java}
> $ java -jar parquet-tools-1.11.1.jar head test_map.parquet
> col1:
> .key_value:
> .key_value:
> col2 = foo
> col1:
> .key_value:
> .key_value:
> col2 = bar
> {code}
> I also checked the schema of the parquet file:
> {code:java}
> java -jar parquet-tools-1.11.1.jar schema test_map.parquet
> message schema {
>   optional group col1 (MAP) {
>     repeated group key_value {
>       required binary key (STRING);
>       optional binary value (STRING);
>     }
>   }
>   optional binary col2 (STRING);
> }{code}
> Am I doing something wrong? 
> We need to output the data to parquet files, and query them later.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)