You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Kyle Barron (JIRA)" <ji...@apache.org> on 2018/03/30 21:58:00 UTC
[jira] [Created] (ARROW-2372) ArrowIOError: Invalid argument

Kyle Barron created ARROW-2372:
----------------------------------

             Summary: ArrowIOError: Invalid argument
                 Key: ARROW-2372
                 URL: https://issues.apache.org/jira/browse/ARROW-2372
             Project: Apache Arrow
          Issue Type: Bug
          Components: Python
    Affects Versions: 0.9.0, 0.8.0
         Environment: Ubuntu 16.04
            Reporter: Kyle Barron


I get an ArrowIOError when reading a specific file that was also written by pyarrow. Specifically, the traceback is:
{code:python}
>>> import pyarrow.parquet as pq
>>> pq.ParquetFile('gaz2016zcta5distancemiles.parquet')
 ---------------------------------------------------------------------------
 ArrowIOError Traceback (most recent call last)
 <ipython-input-18-149f11bf68a5> in <module>()
 ----> 1 pf = pq.ParquetFile('gaz2016zcta5distancemiles.parquet')
~/local/anaconda3/lib/python3.6/site-packages/pyarrow/parquet.py in _init_(self, source, metadata, common_metadata)
 62 self.reader = ParquetReader()
 63 source = _ensure_file(source)
 ---> 64 self.reader.open(source, metadata=metadata)
 65 self.common_metadata = common_metadata
 66 self._nested_paths_by_prefix = self._build_nested_paths()
_parquet.pyx in pyarrow._parquet.ParquetReader.open()
error.pxi in pyarrow.lib.check_status()
ArrowIOError: Arrow error: IOError: [Errno 22] Invalid argument
{code}
Here's a reproducible example with the specific file I'm working with. I'm converting a 34 GB csv file to parquet in chunks of roughly 2GB each. To get the source data:
{code:bash}
wget https://www.nber.org/distance/2016/gaz/zcta5/gaz2016zcta5distancemiles.csv.zip
unzip gaz2016zcta5distancemiles.csv.zip{code}
Then the basic idea from the [pyarrow Parquet documentation|https://arrow.apache.org/docs/python/parquet.html#finer-grained-reading-and-writing] is instantiating the writer class; looping over chunks of the csv and writing them to parquet; then closing the writer object.

 
{code:python}
import numpy as np
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
from pathlib import Path

zcta_file = Path('gaz2016zcta5distancemiles.csv')
itr = pd.read_csv(
    zcta_file,
    header=0,
    dtype={'zip1': str, 'zip2': str, 'mi_to_zcta5': np.float64},
    engine='c',
    chunksize=64617153)

schema = pa.schema([
    pa.field('zip1', pa.string()),
    pa.field('zip2', pa.string()),
    pa.field('mi_to_zcta5', pa.float64())])

writer = pq.ParquetWriter('gaz2016zcta5distancemiles.parquet', schema=schema)
print(f'Starting conversion')

i = 0
for df in itr:
    i += 1
    print(f'Finished reading csv block {i}')

    table = pa.Table.from_pandas(df, preserve_index=False, nthreads=3)
    writer.write_table(table)

    print(f'Finished writing parquet block {i}')

writer.close()
{code}
Then running this python script produces the file 
{code:java}
gaz2016zcta5distancemiles.parquet{code}
, but just attempting to read the metadata with `pq.ParquetFile()` produces the above exception.

I tested this with pyarrow 0.8 and pyarrow 0.9. I assume that pandas would complain on import of the csv if the columns in the data were not `string`, `string`, and `float64`, so I think creating the Parquet schema in that way should be fine.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)