You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Kyle Barron (JIRA)" <ji...@apache.org> on 2018/04/02 14:45:00 UTC
[jira] [Commented] (ARROW-2372) ArrowIOError: Invalid argument

    [ https://issues.apache.org/jira/browse/ARROW-2372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16422572#comment-16422572 ] 

Kyle Barron commented on ARROW-2372:
------------------------------------

I edited my code to the script below, which, I believe, writes a parquet file with just the first 2GB csv chunk, then with the first two, and so on, checking each time that it can open the output. Here's the traceback first, which suggests that it was able to open the Parquet file representing around 6GB of csv data, but not the Parquet file representing about 8GB of csv data.
{code:java}
Starting conversion, up to iteration 0
	0.12 minutes
Finished reading csv block 0
	0.43 minutes
Finished writing parquet block 0
	1.80 minutes
Starting conversion, up to iteration 1
	1.80 minutes
Finished reading csv block 0
	2.12 minutes
Finished writing parquet block 0
	3.49 minutes
Finished reading csv block 1
	3.80 minutes
Finished writing parquet block 1
	5.19 minutes
Starting conversion, up to iteration 2
	5.20 minutes
Finished reading csv block 0
	5.52 minutes
Finished writing parquet block 0
	6.91 minutes
Finished reading csv block 1
	7.22 minutes
Finished writing parquet block 1
	8.59 minutes
Finished reading csv block 2
	8.92 minutes
Finished writing parquet block 2
	10.29 minutes
Starting conversion, up to iteration 3
	10.29 minutes
Finished reading csv block 0
	10.60 minutes
Finished writing parquet block 0
	11.98 minutes
Finished reading csv block 1
	12.30 minutes
Finished writing parquet block 1
	13.66 minutes
Finished reading csv block 2
	13.98 minutes
Finished writing parquet block 2
	15.35 minutes
Finished reading csv block 3
	15.68 minutes
Finished writing parquet block 3
	17.05 minutes
---------------------------------------------------------------------------
ArrowIOError                              Traceback (most recent call last)
<ipython-input-10-2fadd2a47023> in <module>()
     29         if j == i:
     30             writer.close()
---> 31             pf = pq.ParquetFile(f'gaz2016zcta5distancemiles_{i}.parquet')
     32             pfs_dict[i] = pf
     33             break

~/local/anaconda3/lib/python3.6/site-packages/pyarrow/parquet.py in __init__(self, source, metadata, common_metadata)
     62         self.reader = ParquetReader()
     63         source = _ensure_file(source)
---> 64         self.reader.open(source, metadata=metadata)
     65         self.common_metadata = common_metadata
     66         self._nested_paths_by_prefix = self._build_nested_paths()

_parquet.pyx in pyarrow._parquet.ParquetReader.open()

error.pxi in pyarrow.lib.check_status()

ArrowIOError: Arrow error: IOError: [Errno 22] Invalid argument
{code}
And the source code:
{code}
import numpy as np
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
from pathlib import Path
from time import time

t0 = time()

zcta_file = Path('gaz2016zcta5distancemiles.csv')

pfs_dict = {}

for i in range(17):
    itr = pd.read_csv(
        zcta_file,
        header=0,
        dtype={'zip1': str, 'zip2': str, 'mi_to_zcta5': np.float64},
        engine='c',
        chunksize=64617153)  # previously determined to be about 2GB of csv data

    msg = f'Starting conversion, up to iteration {i}'
    msg += f'\n\t{(time() - t0) / 60:.2f} minutes'
    print(msg)

    j = 0
    for df in itr:
        msg = f'Finished reading csv block {j}'
        msg += f'\n\t{(time() - t0) / 60:.2f} minutes'
        print(msg)

        table = pa.Table.from_pandas(df, preserve_index=False, nthreads=3)
        if j == 0:
            writer = pq.ParquetWriter(f'gaz2016zcta5distancemiles_{i}.parquet', schema=table.schema)

        writer.write_table(table)

        msg = f'Finished writing parquet block {j}'
        msg += f'\n\t{(time() - t0) / 60:.2f} minutes'
        print(msg)

        if j == i:
            writer.close()
            pf = pq.ParquetFile(f'gaz2016zcta5distancemiles_{i}.parquet')
            pfs_dict[i] = pf
            break

        j += 1
{code}

> ArrowIOError: Invalid argument
> ------------------------------
>
>                 Key: ARROW-2372
>                 URL: https://issues.apache.org/jira/browse/ARROW-2372
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 0.8.0, 0.9.0
>         Environment: Ubuntu 16.04
>            Reporter: Kyle Barron
>            Priority: Major
>
> I get an ArrowIOError when reading a specific file that was also written by pyarrow. Specifically, the traceback is:
> {code:python}
> >>> import pyarrow.parquet as pq
> >>> pq.ParquetFile('gaz2016zcta5distancemiles.parquet')
>  ---------------------------------------------------------------------------
>  ArrowIOError Traceback (most recent call last)
>  <ipython-input-18-149f11bf68a5> in <module>()
>  ----> 1 pf = pq.ParquetFile('gaz2016zcta5distancemiles.parquet')
> ~/local/anaconda3/lib/python3.6/site-packages/pyarrow/parquet.py in _init_(self, source, metadata, common_metadata)
>  62 self.reader = ParquetReader()
>  63 source = _ensure_file(source)
>  ---> 64 self.reader.open(source, metadata=metadata)
>  65 self.common_metadata = common_metadata
>  66 self._nested_paths_by_prefix = self._build_nested_paths()
> _parquet.pyx in pyarrow._parquet.ParquetReader.open()
> error.pxi in pyarrow.lib.check_status()
> ArrowIOError: Arrow error: IOError: [Errno 22] Invalid argument
> {code}
> Here's a reproducible example with the specific file I'm working with. I'm converting a 34 GB csv file to parquet in chunks of roughly 2GB each. To get the source data:
> {code:bash}
> wget https://www.nber.org/distance/2016/gaz/zcta5/gaz2016zcta5distancemiles.csv.zip
> unzip gaz2016zcta5distancemiles.csv.zip{code}
> Then the basic idea from the [pyarrow Parquet documentation|https://arrow.apache.org/docs/python/parquet.html#finer-grained-reading-and-writing] is instantiating the writer class; looping over chunks of the csv and writing them to parquet; then closing the writer object.
>  
> {code:python}
> import numpy as np
> import pandas as pd
> import pyarrow as pa
> import pyarrow.parquet as pq
> from pathlib import Path
> zcta_file = Path('gaz2016zcta5distancemiles.csv')
> itr = pd.read_csv(
>     zcta_file,
>     header=0,
>     dtype={'zip1': str, 'zip2': str, 'mi_to_zcta5': np.float64},
>     engine='c',
>     chunksize=64617153)
> schema = pa.schema([
>     pa.field('zip1', pa.string()),
>     pa.field('zip2', pa.string()),
>     pa.field('mi_to_zcta5', pa.float64())])
> writer = pq.ParquetWriter('gaz2016zcta5distancemiles.parquet', schema=schema)
> print(f'Starting conversion')
> i = 0
> for df in itr:
>     i += 1
>     print(f'Finished reading csv block {i}')
>     table = pa.Table.from_pandas(df, preserve_index=False, nthreads=3)
>     writer.write_table(table)
>     print(f'Finished writing parquet block {i}')
> writer.close()
> {code}
> Then running this python script produces the file 
> {code:java}
> gaz2016zcta5distancemiles.parquet{code}
> , but just attempting to read the metadata with `pq.ParquetFile()` produces the above exception.
> I tested this with pyarrow 0.8 and pyarrow 0.9. I assume that pandas would complain on import of the csv if the columns in the data were not `string`, `string`, and `float64`, so I think creating the Parquet schema in that way should be fine.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)