You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2018/06/13 18:59:00 UTC
[jira] [Commented] (ARROW-1644) [Python] Read and write nested Parquet data with a mix of struct and list nesting levels

    [ https://issues.apache.org/jira/browse/ARROW-1644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16511550#comment-16511550 ] 

ASF GitHub Bot commented on ARROW-1644:
---------------------------------------

wesm commented on issue #462: ARROW-1644: [C++] Initial cut of implementing deserialization of arbitrary nested groups from Parquet to Arrow
URL: https://github.com/apache/parquet-cpp/pull/462#issuecomment-397048860
 
 
   I have been on the road a lot lately but I hope to spend some time reviewing this in the next 10 days. I'm sorry for the hold up

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


> [Python] Read and write nested Parquet data with a mix of struct and list nesting levels
> ----------------------------------------------------------------------------------------
>
>                 Key: ARROW-1644
>                 URL: https://issues.apache.org/jira/browse/ARROW-1644
>             Project: Apache Arrow
>          Issue Type: New Feature
>          Components: Python
>    Affects Versions: 0.8.0
>            Reporter: DB Tsai
>            Assignee: Joshua Storck
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 0.10.0
>
>
> We have many nested parquet files generated from Apache Spark for ranking problems, and we would like to load them in python for other programs to consume. 
> The schema looks like 
> {code:java}
> root
>  |-- profile_id: long (nullable = true)
>  |-- country_iso_code: string (nullable = true)
>  |-- items: array (nullable = false)
>  |    |-- element: struct (containsNull = false)
>  |    |    |-- show_title_id: integer (nullable = true)
>  |    |    |-- duration: double (nullable = true)
> {code}
> And when I tried to load it with nightly build pyarrow on Oct 4, 2017, I got the following error.
> {code:python}
> Python 3.6.2 |Anaconda, Inc.| (default, Sep 30 2017, 18:42:57) 
> [GCC 7.2.0] on linux
> Type "help", "copyright", "credits" or "license" for more information.
> >>> import numpy as np
> >>> import pandas as pd
> >>> import pyarrow as pa
> >>> import pyarrow.parquet as pq
> >>> table2 = pq.read_table('part-00000')
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
>   File "/home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/parquet.py", line 823, in read_table
>     use_pandas_metadata=use_pandas_metadata)
>   File "/home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/parquet.py", line 119, in read
>     nthreads=nthreads)
>   File "_parquet.pyx", line 466, in pyarrow._parquet.ParquetReader.read_all
>   File "error.pxi", line 85, in pyarrow.lib.check_status
> pyarrow.lib.ArrowNotImplementedError: lists with structs are not supported.
> {code}
> I somehow get the impression that after https://issues.apache.org/jira/browse/PARQUET-911 is merged, we should be able to load the nested parquet in pyarrow. 
> Any insight about this? 
> Thanks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)