You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Chen Ming (Jira)" <ji...@apache.org> on 2020/10/01 15:21:00 UTC
[jira] [Comment Edited] (ARROW-10140) [Python][C++] No data for map column of a parquet file created from pyarrow and pandas

    [ https://issues.apache.org/jira/browse/ARROW-10140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17205599#comment-17205599 ] 

Chen Ming edited comment on ARROW-10140 at 10/1/20, 3:20 PM:
-------------------------------------------------------------

[~emkornfield] [~jorisvandenbossche] Thank you for the quick follow-up. And sorry for not telling the problem clearly...

After we get the parquet file(s), we would put to AWS S3, then use Amazon Athena to create a table and query on them.

Take "test_map.parquet' as an example:
 # Upload the file to a S3 bucket (e.g. s3://test/test_map).
 # Use below DDL to create a table:
{code:java}
CREATE EXTERNAL TABLE `development.test_map`(
  `col1` map<string,string>,
  `col2` string)
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
STORED AS PARQUET 
LOCATION
  's3://test/test_map'
TBLPROPERTIES (
  'has_encrypted_data'='true')
{code}
# Refresh the table then query all records:
{code:java}
MSCK REPAIR TABLE development.test_map
select * from  development.test_map
{code}
We got the following output:
||col1||col2||
|{}|foo|
|{}|bar|
# Also try to get value for a single key:
{code:java}
select col1['id'] as id from  development.test_map
{code}
Got the following output:
||id||
|[NULL]|
|[NULL]|

 

We tested a parquet file (attached as pyspark.snappy.parquet) created from PySpark, which can be queried successfully from Amazon Athena:
||sid||mp||
|1|{bar=2, foo=1, baz=aaa}\|
 \|2\|\{bar=2, foo=1, baz=aaa}\|
 \|3\|\{bar=2, foo=1, baz=aaa}|

 (The content of the data is different in the Spark version.)

 

We are using AWS Lambda to parse raw data files to Parquet files. It seems that Spark is not recommended for AWS Lambda.
 We are really happy to find that Arrow added support to MapType with 0.17.0, and would like to use it in our project.


was (Author: acan):
[~emkornfield] [~jorisvandenbossche] Thank you for the quick follow-up. And sorry for not telling the problem clearly...

After we get the parquet file(s), we would put to AWS S3, then use Amazon Athena to create a table and query on them.

Take "test_map.parquet' as an example:
 # Upload the file to a S3 bucket (e.g. s3://test/test_map).
 # Use below DDL to create a table:
{code:java}
CREATE EXTERNAL TABLE `development.test_map`(
  `col1` map<string,string>,
  `col2` string)
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
STORED AS PARQUET 
LOCATION
  's3://test/test_map'
TBLPROPERTIES (
  'has_encrypted_data'='true')
{code}

 # Refresh the table then query all records:
{code:java}
MSCK REPAIR TABLE development.test_map
select * from  development.test_map
{code}
We got the following output:
||col1||col2||
|{}|foo|
|{}|bar|

 # Also try to get value for a single key:
{code:java}
select col1['id'] as id from  development.test_map
{code}
Got the following output:
||id||
|[NULL]|
|[NULL]|

 

We tested a parquet file (attached as pyspark.snappy.parquet) created from PySpark, which can be queried successfully from Amazon Athena:
||sid||mp||
|1|{bar=2, foo=1, baz=aaa}\|
 \|2\|\{bar=2, foo=1, baz=aaa}\|
 \|3\|\{bar=2, foo=1, baz=aaa}|

 (The content of the data is different in the Spark version.)

 

We are using AWS Lambda to parse raw data files to Parquet files. It seems that Spark is not recommended for AWS Lambda.
 We are really happy to find that Arrow added support to MapType with 0.17.0, and would like to use it in our project.

> [Python][C++] No data for map column of a parquet file created from pyarrow and pandas
> --------------------------------------------------------------------------------------
>
>                 Key: ARROW-10140
>                 URL: https://issues.apache.org/jira/browse/ARROW-10140
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 1.0.1
>            Reporter: Chen Ming
>            Assignee: Micah Kornfield
>            Priority: Minor
>         Attachments: test_map.parquet, test_map.py
>
>
> Hi,
> I'm having problems reading parquet files with 'map' data type created by pyarrow.
> I followed [https://stackoverflow.com/questions/63553715/pyarrow-data-types-for-columns-that-have-lists-of-dictionaries] to convert a pandas DF to an arrow table, then call write_table to output a parquet file:
> (We also referred to https://issues.apache.org/jira/browse/ARROW-9812)
> {code:java}
> import pandas as pd
> import pyarrow as pa
> import pyarrow.parquet as pq
> print(f'PyArrow Version = {pa.__version__}')
> print(f'Pandas Version = {pd.__version__}')
> df = pd.DataFrame({
>          'col1': pd.Series([
>              [('id', 'something'), ('value2', 'else')],
>              [('id', 'something2'), ('value','else2')],
>          ]),
>          'col2': pd.Series(['foo', 'bar'])
>      })
> udt = pa.map_(pa.string(), pa.string())
> schema = pa.schema([pa.field('col1', udt), pa.field('col2', pa.string())])
> table = pa.Table.from_pandas(df, schema)
> pq.write_table(table, './test_map.parquet')
> {code}
> The above code (attached as test_map.py) runs smoothly on my developing computer:
> {code:java}
> PyArrow Version = 1.0.1
> Pandas Version = 1.1.2
> {code}
> And generated the test_map.parquet file (attached as test_map.parquet) successfully.
> Then I use parquet-tools (1.11.1) to read the file, but get the following output:
> {code:java}
> $ java -jar parquet-tools-1.11.1.jar head test_map.parquet
> col1:
> .key_value:
> .key_value:
> col2 = foo
> col1:
> .key_value:
> .key_value:
> col2 = bar
> {code}
> I also checked the schema of the parquet file:
> {code:java}
> java -jar parquet-tools-1.11.1.jar schema test_map.parquet
> message schema {
>   optional group col1 (MAP) {
>     repeated group key_value {
>       required binary key (STRING);
>       optional binary value (STRING);
>     }
>   }
>   optional binary col2 (STRING);
> }{code}
> Am I doing something wrong? 
> We need to output the data to parquet files, and query them later.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)