You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Benjamin Kietzman (Jira)" <ji...@apache.org> on 2019/09/17 12:10:00 UTC

[jira] [Resolved] (ARROW-5630) [Python][Parquet] Table of nested arrays doesn't round trip

     [ https://issues.apache.org/jira/browse/ARROW-5630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Benjamin Kietzman resolved ARROW-5630.
--------------------------------------
    Resolution: Fixed

Issue resolved by pull request 5395
[https://github.com/apache/arrow/pull/5395]

> [Python][Parquet] Table of nested arrays doesn't round trip
> -----------------------------------------------------------
>
>                 Key: ARROW-5630
>                 URL: https://issues.apache.org/jira/browse/ARROW-5630
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>         Environment: pyarrow 0.13, Windows 10
>            Reporter: Philip Felton
>            Assignee: Wes McKinney
>            Priority: Major
>              Labels: parquet, pull-request-available
>             Fix For: 0.15.0
>
>          Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> This is pyarrow 0.13 on Windows.
> {code:python}
> import pandas as pd
> import pyarrow as pa
> import pyarrow.parquet as pq
> def make_table(num_rows):
>     typ = pa.list_(pa.field("item", pa.float32(), False))
>     return pa.Table.from_arrays([
>         pa.array([[0] * (i%10) for i in range(0, num_rows)], type=typ),
>         pa.array([[0] * ((i+5)%10) for i in range(0, num_rows)], type=typ)
>     ], ['a', 'b'])
> pq.write_table(make_table(1000000), 'test.parquet')
> pq.read_table('test.parquet')
> {code}
> The last line throws the following exception:
> {noformat}
> ---------------------------------------------------------------------------
> ArrowInvalid                              Traceback (most recent call last)
> <ipython-input-4-0f3266afa36c> in <module>
> ----> 1 pq.read_table('full.parquet')
> ~\Anaconda3\lib\site-packages\pyarrow\parquet.py in read_table(source, columns, use_threads, metadata, use_pandas_metadata, memory_map, filesystem)
>    1150         return fs.read_parquet(path, columns=columns,
>    1151                                use_threads=use_threads, metadata=metadata,
> -> 1152                                use_pandas_metadata=use_pandas_metadata)
>    1153 
>    1154     pf = ParquetFile(source, metadata=metadata)
> ~\Anaconda3\lib\site-packages\pyarrow\filesystem.py in read_parquet(self, path, columns, metadata, schema, use_threads, use_pandas_metadata)
>     179                                  filesystem=self)
>     180         return dataset.read(columns=columns, use_threads=use_threads,
> --> 181                             use_pandas_metadata=use_pandas_metadata)
>     182 
>     183     def open(self, path, mode='rb'):
> ~\Anaconda3\lib\site-packages\pyarrow\parquet.py in read(self, columns, use_threads, use_pandas_metadata)
>    1012             table = piece.read(columns=columns, use_threads=use_threads,
>    1013                                partitions=self.partitions,
> -> 1014                                use_pandas_metadata=use_pandas_metadata)
>    1015             tables.append(table)
>    1016 
> ~\Anaconda3\lib\site-packages\pyarrow\parquet.py in read(self, columns, use_threads, partitions, open_file_func, file, use_pandas_metadata)
>     562             table = reader.read_row_group(self.row_group, **options)
>     563         else:
> --> 564             table = reader.read(**options)
>     565 
>     566         if len(self.partition_keys) > 0:
> ~\Anaconda3\lib\site-packages\pyarrow\parquet.py in read(self, columns, use_threads, use_pandas_metadata)
>     212             columns, use_pandas_metadata=use_pandas_metadata)
>     213         return self.reader.read_all(column_indices=column_indices,
> --> 214                                     use_threads=use_threads)
>     215 
>     216     def scan_contents(self, columns=None, batch_size=65536):
> ~\Anaconda3\lib\site-packages\pyarrow\_parquet.pyx in pyarrow._parquet.ParquetReader.read_all()
> ~\Anaconda3\lib\site-packages\pyarrow\error.pxi in pyarrow.lib.check_status()
> ArrowInvalid: Column 1 named b expected length 932066 but got length 932063
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)