You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@arrow.apache.org by "Bulat Yaminov (Jira)" <ji...@apache.org> on 2020/03/01 19:15:00 UTC

[jira] [Created] (ARROW-7972) Allow reading CSV in chunks

Bulat Yaminov created ARROW-7972:
------------------------------------

             Summary: Allow reading CSV in chunks
                 Key: ARROW-7972
                 URL: https://issues.apache.org/jira/browse/ARROW-7972
             Project: Apache Arrow
          Issue Type: New Feature
          Components: Python
    Affects Versions: 0.16.0
            Reporter: Bulat Yaminov


Currently in the Python API you can read a CSV using [{{pyarrow.csv.read_csv("big.csv")}}|https://arrow.apache.org/docs/python/csv.html]. There are some settings for the reader that you can pass in [{{pyarrow.csv.ReadOptions}}|https://arrow.apache.org/docs/python/generated/pyarrow.csv.ReadOptions.html#pyarrow.csv.ReadOptions], but I don't see an option to read a part of the CSV file instead of the whole (or starting from `skip_rows`). As a result if I have a big CSV file that cannot be fit into memory, I cannot process it with this API.

Is it possible to implement a chunked iterator in the similar way that [Pandas allows it|https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-chunking]:
{code:python}
from pyarrow import csv
for table_chunk in csv.read_csv("big.csv", read_options=csv.ReadOptions(chunksize=1_000_000)):
    # do something with the table_chunk, e.g. filter and save to disk
    pass
{code}

Thanks in advance for your reaction.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)