You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by "Benoit Rostykus (Jira)" <ji...@apache.org> on 2019/10/10 04:45:00 UTC

[jira] [Created] (ARROW-6844) List columns read broken with 0.15.0

Benoit Rostykus created ARROW-6844:
--------------------------------------

             Summary: List<scalar type> columns read broken with 0.15.0
                 Key: ARROW-6844
                 URL: https://issues.apache.org/jira/browse/ARROW-6844
             Project: Apache Arrow
          Issue Type: Bug
          Components: C++, Python
    Affects Versions: 0.15.0
            Reporter: Benoit Rostykus


Columns of type `array<primitive type>` (such as `array<int32>`, `array<int64>`...) are not readable anymore using `pyarrow == 0.15.0` (but were with `pyarrow == 0.14.1`) when the original writer of the parquet file is `parquet-mr 1.9.1`.

```
import pyarrow.parquet as pq

pf = pq.ParquetFile('sample.gz.parquet')

print(pf.read(columns=['profile_ids']))
```
with 0.14.1:
```
pyarrow.Table
profile_ids: list<element: int64>
 child 0, element: int64

...
```
with 0.15.0:

```

Traceback (most recent call last):
 File "<string>", line 1, in <module>
 File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pyarrow/parquet.py", line 253, in read
 use_threads=use_threads)
 File "pyarrow/_parquet.pyx", line 1131, in pyarrow._parquet.ParquetReader.read_all
 File "pyarrow/error.pxi", line 78, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Column data for field 0 with type list<item: int64> is inconsistent with schema list<element: int64>

```

I've tested parquet files coming from multiple tables (with various schemas) created with `parquet-mr`, couldn't read any `array<primitive type>` column anymore.

 

I _think_ the bug was introduced with [this commit|[https://github.com/apache/arrow/commit/06fd2da5e8e71b660e6eea4b7702ca175e31f3f5]].

I think the root of the issue comes from the fact that `parquet-mr` writes the inner struct name as `"element"` by default (see [here|[https://github.com/apache/parquet-mr/blob/b4198be200e7e2df82bc9a18d54c8cd16aa156ac/parquet-column/src/main/java/org/apache/parquet/schema/ConversionPatterns.java#L33]]), whereas `parquet-cpp` (or `pyarrow`?) assumes `"item"` (see for example [this test|[https://github.com/apache/arrow/blob/c805b5fadb548925c915e0e130d6ed03c95d1398/python/pyarrow/tests/test_schema.py#L74]]). The round-tripping tests write/read in pyarrow only obviously won't catch this.

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)