You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Rinke Hoekstra (Jira)" <ji...@apache.org> on 2019/11/21 14:25:00 UTC
[jira] [Comment Edited] (ARROW-1644) [C++][Parquet] Read and write
nested Parquet data with a mix of struct and list nesting levels
[ https://issues.apache.org/jira/browse/ARROW-1644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16979315#comment-16979315 ]
Rinke Hoekstra edited comment on ARROW-1644 at 11/21/19 2:24 PM:
-----------------------------------------------------------------
I was just trying this with the example found in the pyarrow docs at [http://arrow.apache.org/docs/python/json.html]
The documented example does not work. Is this related to this issue, or is it another matter?
It says to load the following JSON file:
{{{"a": [1, 2], "b": \{"c": true, "d": "1991-02-03"}}}}
{{ {"a": [3, 4, 5], "b": \{"c": false, "d": "2019-04-01"}}}}
I fixed this to make it valid (but that's another issue):
{{[{"a": [1, 2], "b": \{"c": true, "d": "1991-02-03"}},}}
{{ {"a": [3, 4, 5], "b": \{"c": false, "d": "2019-04-01"}}]}}
Then reading the JSON from a file called `my_data.json`:
{{from pyarrow import json}}
{{ table = json.read_json("my_data.json")}}
Gives the following error:
{{---------------------------------------------------------------------------}}
{{ ArrowInvalid Traceback (most recent call last)}}
{{ <ipython-input-69-f974c21f0941> in <module>()}}
{{ 1 from pyarrow import json}}
{{ ----> 2 table = json.read_json('test.json')}}
{{~/.local/share/virtualenvs/parquet-ifRxINoC/lib/python3.7/site-packages/pyarrow/_json.pyx in pyarrow._json.read_json()}}
{{~/.local/share/virtualenvs/parquet-ifRxINoC/lib/python3.7/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status()}}
{{ArrowInvalid: JSON parse error: A column changed from object to array}}
was (Author: rinkehoekstra):
I was just trying this with the example found in the pyarrow docs at [http://arrow.apache.org/docs/python/json.html]
The documented example does not work. Is this related to this issue, or is it another matter?
It says to load the following JSON file:
```\{"a": [1, 2], "b": {"c": true, "d": "1991-02-03"}}
{"a": [3, 4, 5], "b": \{"c": false, "d": "2019-04-01"}}```
I fixed this to make it valid (but that's another issue):
```
[\{"a": [1, 2], "b": {"c": true, "d": "1991-02-03"}},
{"a": [3, 4, 5], "b": \{"c": false, "d": "2019-04-01"}}]
```
Then reading the JSON from a file called `my_data.json`:
```
from pyarrow import json
table = json.read_json("my_data.json")
```
Gives the following error:
```
---------------------------------------------------------------------------
ArrowInvalid Traceback (most recent call last)
<ipython-input-69-f974c21f0941> in <module>()
1 from pyarrow import json
----> 2 table = json.read_json('test.json')
~/.local/share/virtualenvs/parquet-ifRxINoC/lib/python3.7/site-packages/pyarrow/_json.pyx in pyarrow._json.read_json()
~/.local/share/virtualenvs/parquet-ifRxINoC/lib/python3.7/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status()
ArrowInvalid: JSON parse error: A column changed from object to array
```
> [C++][Parquet] Read and write nested Parquet data with a mix of struct and list nesting levels
> ----------------------------------------------------------------------------------------------
>
> Key: ARROW-1644
> URL: https://issues.apache.org/jira/browse/ARROW-1644
> Project: Apache Arrow
> Issue Type: New Feature
> Components: C++, Python
> Affects Versions: 0.8.0
> Reporter: DB Tsai
> Assignee: Micah Kornfield
> Priority: Major
> Labels: parquet, pull-request-available
> Fix For: 1.0.0
>
>
> We have many nested parquet files generated from Apache Spark for ranking problems, and we would like to load them in python for other programs to consume.
> The schema looks like
> {code:java}
> root
> |-- profile_id: long (nullable = true)
> |-- country_iso_code: string (nullable = true)
> |-- items: array (nullable = false)
> | |-- element: struct (containsNull = false)
> | | |-- show_title_id: integer (nullable = true)
> | | |-- duration: double (nullable = true)
> {code}
> And when I tried to load it with nightly build pyarrow on Oct 4, 2017, I got the following error.
> {code:python}
> Python 3.6.2 |Anaconda, Inc.| (default, Sep 30 2017, 18:42:57)
> [GCC 7.2.0] on linux
> Type "help", "copyright", "credits" or "license" for more information.
> >>> import numpy as np
> >>> import pandas as pd
> >>> import pyarrow as pa
> >>> import pyarrow.parquet as pq
> >>> table2 = pq.read_table('part-00000')
> Traceback (most recent call last):
> File "<stdin>", line 1, in <module>
> File "/home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/parquet.py", line 823, in read_table
> use_pandas_metadata=use_pandas_metadata)
> File "/home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/parquet.py", line 119, in read
> nthreads=nthreads)
> File "_parquet.pyx", line 466, in pyarrow._parquet.ParquetReader.read_all
> File "error.pxi", line 85, in pyarrow.lib.check_status
> pyarrow.lib.ArrowNotImplementedError: lists with structs are not supported.
> {code}
> I somehow get the impression that after https://issues.apache.org/jira/browse/PARQUET-911 is merged, we should be able to load the nested parquet in pyarrow.
> Any insight about this?
> Thanks.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)