You are viewing a plain text version of this content. The canonical link for it is here.

Posted to jira@arrow.apache.org by "Chen Ming (Jira)" <ji...@apache.org> on 2020/09/30 08:21:00 UTC

[jira] [Created] (ARROW-10140) No data for map column of a parquet file created from pyarrow and pandas

Chen Ming created ARROW-10140:
---------------------------------

             Summary: No data for map column of a parquet file created from pyarrow and pandas
                 Key: ARROW-10140
                 URL: https://issues.apache.org/jira/browse/ARROW-10140
             Project: Apache Arrow
          Issue Type: Bug
          Components: Python
    Affects Versions: 1.0.1
            Reporter: Chen Ming
         Attachments: test_map.py

Hi,

I'm having problems reading parquet files with 'map' data type created by pyarrow.

I followed [https://stackoverflow.com/questions/63553715/pyarrow-data-types-for-columns-that-have-lists-of-dictionaries] to convert a pandas DF to an arrow table, then call write_table to output a parquet file:

(We also referred to https://issues.apache.org/jira/browse/ARROW-9812)
{code:java}
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

print(f'PyArrow Version = {pa.__version__}')
print(f'Pandas Version = {pd.__version__}')

df = pd.DataFrame({
         'col1': pd.Series([
             [('id', 'something'), ('value2', 'else')],
             [('id', 'something2'), ('value','else2')],
         ]),
         'col2': pd.Series(['foo', 'bar'])
     })

udt = pa.map_(pa.string(), pa.string())
schema = pa.schema([pa.field('col1', udt), pa.field('col2', pa.string())])
table = pa.Table.from_pandas(df, schema)
pq.write_table(table, './test_map.parquet')
{code}
The above code (attached as test_map.py) runs smoothly on my developing computer:
{code:java}
PyArrow Version = 1.0.1
Pandas Version = 1.1.2
{code}
And generated the test_map.parquet file (attached as test_map.parquet) successfully.

Then I use parquet-tools (1.11.1) to read the file, but get the following output:
{code:java}
$ java -jar parquet-tools-1.11.1.jar head test_map.parquet
col1:
.key_value:
.key_value:
col2 = foo

col1:
.key_value:
.key_value:
col2 = bar
{code}
I also checked the schema of the parquet file:
{code:java}
java -jar parquet-tools-1.11.1.jar schema test_map.parquet
message schema {
  optional group col1 (MAP) {
    repeated group key_value {
      required binary key (STRING);
      optional binary value (STRING);
    }
  }
  optional binary col2 (STRING);
}{code}
Am I doing something wrong? 

We need to output the data a parquet files, and query them later.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)