You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Joris Van den Bossche (Jira)" <ji...@apache.org> on 2020/10/30 12:24:00 UTC
[jira] [Comment Edited] (ARROW-10419) [C++] Add max_rows parameter
to csv ReadOptions
[ https://issues.apache.org/jira/browse/ARROW-10419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17223614#comment-17223614 ]
Joris Van den Bossche edited comment on ARROW-10419 at 10/30/20, 12:23 PM:
---------------------------------------------------------------------------
Agreed that such a keyword would be useful (just for being able to peek at the first rows of a big file this would be important).
I think in general the problem with this is that the reader processes different blocks of data in parallel (there is a {{block_size}} option in {{ReadOptions}}), so this might not work with the default of multithreaded reading (you need to know the number of rows of the first block to know if you need to process the next block as well).
For the "chunked" reader case you mention, there is already {{pyarrow.csv.open_csv}} which returns a streaming reader, which reads batch by batch:
{code}
In [1]: pd.DataFrame({'a': np.arange(1_000_000)}).to_csv("test.csv", index=False)
In [2]: from pyarrow import csv
In [3]: reader = csv.open_csv("test.csv")
In [4]: reader
Out[4]: <pyarrow._csv.CSVStreamingReader at 0x7fe629398278>
In [5]: reader.read_next_batch().to_pandas()
Out[5]:
a
0 0
1 1
2 2
3 3
4 4
... ...
165664 165664
165665 165665
165666 165666
165667 165667
165668 165668
[165669 rows x 1 columns]
In [5]: reader.read_next_batch().to_pandas()
Out[5]:
a
0 165669
1 165670
2 165671
3 165672
4 165673
... ...
149791 315460
149792 315461
149793 315462
149794 315463
149795 315464
[149796 rows x 1 columns]
{code}
The number of rows depends on the {{block_size}}, so you _can_ control this, but not that easily by just specifying the number of {{max_rows}}.
{code}
In [13]: reader = csv.open_csv("test.csv", read_options=csv.ReadOptions(block_size=20))
In [14]: reader.read_next_batch().to_pandas()
Out[14]:
a
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
In [15]: reader.read_next_batch().to_pandas()
Out[15]:
a
0 9
1 10
2 11
3 12
4 13
5 14
6 15
{code}
So using this streaming reader with {{open_csv}}, you can actually already somewhat achieve what you want, I think? (it could also be used to read only the first xx rows, instead of using {{read_csv}})
The question is still if we can make this easier directly with pyarrow, to be able to specify a {{max_rows}} instead of a {{block_size}}. For the general multithreaded reader, I don't think this easy (as mentioned above), but for the streaming reader (which is single threaded anyway) it should be possible I suppose.
cc [~apitrou]
was (Author: jorisvandenbossche):
Agreed that such a keyword would be useful (just for being able to peek at the first rows of a big file would be very useful).
I think in general the problem with this is that the reader processes different blocks of data in parallel (there is a {{block_size}} option in {{ReadOptions}}), so this might not work with the default of multithreaded reading (you need to know the number of rows of the first block to know if you need to process the next block as well).
For the "chunked" reader case you mention, there is already {{pyarrow.csv.open_csv}} which returns a streaming reader, which reads batch by batch:
{code}
In [1]: pd.DataFrame({'a': np.arange(1_000_000)}).to_csv("test.csv", index=False)
In [2]: from pyarrow import csv
In [3]: reader = csv.open_csv("test.csv")
In [4]: reader
Out[4]: <pyarrow._csv.CSVStreamingReader at 0x7fe629398278>
In [5]: reader.read_next_batch().to_pandas()
Out[5]:
a
0 0
1 1
2 2
3 3
4 4
... ...
165664 165664
165665 165665
165666 165666
165667 165667
165668 165668
[165669 rows x 1 columns]
In [5]: reader.read_next_batch().to_pandas()
Out[5]:
a
0 165669
1 165670
2 165671
3 165672
4 165673
... ...
149791 315460
149792 315461
149793 315462
149794 315463
149795 315464
[149796 rows x 1 columns]
{code}
The number of rows depends on the {{block_size}}, so you _can_ control this, but not that easily by just specifying the number of {{max_rows}}.
{code}
In [13]: reader = csv.open_csv("test.csv", read_options=csv.ReadOptions(block_size=20))
In [14]: reader.read_next_batch().to_pandas()
Out[14]:
a
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
In [15]: reader.read_next_batch().to_pandas()
Out[15]:
a
0 9
1 10
2 11
3 12
4 13
5 14
6 15
{code}
So using this streaming reader with {{open_csv}}, you can actually already somewhat achieve what you want, I think? (it could also be used to read only the first xx rows, instead of using {{read_csv}})
The question is still if we can make this easier directly with pyarrow, to be able to specify a {{max_rows}} instead of a {{block_size}}. For the general multithreaded reader, I don't think this easy (as mentioned above), but for the streaming reader (which is single threaded anyway) it should be possible I suppose.
cc [~apitrou]
> [C++] Add max_rows parameter to csv ReadOptions
> -----------------------------------------------
>
> Key: ARROW-10419
> URL: https://issues.apache.org/jira/browse/ARROW-10419
> Project: Apache Arrow
> Issue Type: New Feature
> Components: C++, Python
> Reporter: Marc Garcia
> Priority: Major
> Labels: csv
>
> I'm trying to read only the first 1,000 rows of a huge CSV with PyArrow.
> I don't see a way to do this with Arrow. I guess it should be easy to implement by adding a `max_rows` parameter to pyarrow.csv.ReadOptions.
> After reading the first 1,000, it should be possible to load the next 1,000 (or any other chunk) by using both the new `max_rows` together with `skip_rows` (e.g. `pyarrow.csv.read_csv(path, pyarrow.csv.ReadOption(skip_rows=1_000, max_rows=1_000)` would read from 1,000 to 2,000).
> Thanks!
--
This message was sent by Atlassian Jira
(v8.3.4#803005)