You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Joris Van den Bossche (Jira)" <ji...@apache.org> on 2020/10/30 12:24:00 UTC
[jira] [Commented] (ARROW-10419) [C++] Add max_rows parameter to csv ReadOptions

    [ https://issues.apache.org/jira/browse/ARROW-10419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17223614#comment-17223614 ] 

Joris Van den Bossche commented on ARROW-10419:
-----------------------------------------------

Agreed that such a keyword would be useful (just for being able to peek at the first rows of a big file would be very useful).

I think in general the problem with this is that the reader processes different blocks of data in parallel (there is a {{block_size}} option in {{ReadOptions}}), so this might not work with the default of multithreaded reading (you need to know the number of rows of the first block to know if you need to process the next block as well). 

For the "chunked" reader case you mention, there is already {{pyarrow.csv.open_csv}} which returns a streaming reader, which reads batch by batch:

{code}
In [1]: pd.DataFrame({'a': np.arange(1_000_000)}).to_csv("test.csv", index=False)

In [2]: from pyarrow import csv

In [3]: reader = csv.open_csv("test.csv")

In [4]: reader
Out[4]: <pyarrow._csv.CSVStreamingReader at 0x7fe629398278>

In [5]: reader.read_next_batch().to_pandas()
Out[5]: 
             a
0            0
1            1
2            2
3            3
4            4
...        ...
165664  165664
165665  165665
165666  165666
165667  165667
165668  165668

[165669 rows x 1 columns]

In [5]: reader.read_next_batch().to_pandas()
Out[5]: 
             a
0       165669
1       165670
2       165671
3       165672
4       165673
...        ...
149791  315460
149792  315461
149793  315462
149794  315463
149795  315464

[149796 rows x 1 columns]
{code}

The number of rows depends on the {{block_size}}, so you _can_ control this, but not that easily by just specifying the number of {{max_rows}}. 

{code}
In [13]: reader = csv.open_csv("test.csv", read_options=csv.ReadOptions(block_size=20))

In [14]: reader.read_next_batch().to_pandas()
Out[14]: 
   a
0  0
1  1
2  2
3  3
4  4
5  5
6  6
7  7
8  8

In [15]: reader.read_next_batch().to_pandas()
Out[15]: 
    a
0   9
1  10
2  11
3  12
4  13
5  14
6  15
{code}

So using this streaming reader with {{open_csv}}, you can actually already somewhat achieve what you want, I think? (it could also be used to read only the first xx rows, instead of using {{read_csv}})

The question is still if we can make this easier directly with pyarrow, to be able to specify a {{max_rows}} instead of a {{block_size}}. For the general multithreaded reader, I don't think this easy (as mentioned above), but for the streaming reader (which is single threaded anyway) it should be possible I suppose. 

cc [~apitrou]

> [C++] Add max_rows parameter to csv ReadOptions
> -----------------------------------------------
>
>                 Key: ARROW-10419
>                 URL: https://issues.apache.org/jira/browse/ARROW-10419
>             Project: Apache Arrow
>          Issue Type: New Feature
>          Components: C++, Python
>            Reporter: Marc Garcia
>            Priority: Major
>              Labels: csv
>
> I'm trying to read only the first 1,000 rows of a huge CSV with PyArrow.
> I don't see a way to do this with Arrow. I guess it should be easy to implement by adding a `max_rows` parameter to pyarrow.csv.ReadOptions.
> After reading the first 1,000, it should be possible to load the next 1,000 (or any other chunk) by using both the new `max_rows` together with `skip_rows` (e.g. `pyarrow.csv.read_csv(path, pyarrow.csv.ReadOption(skip_rows=1_000, max_rows=1_000)` would read from 1,000 to 2,000).
> Thanks!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)