You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Ziheng Wang (Jira)" <ji...@apache.org> on 2022/08/25 17:48:00 UTC
[jira] [Created] (ARROW-17529) Clean up how the CSV reader handles the first buffer
Ziheng Wang created ARROW-17529:
-----------------------------------
Summary: Clean up how the CSV reader handles the first buffer
Key: ARROW-17529
URL: https://issues.apache.org/jira/browse/ARROW-17529
Project: Apache Arrow
Issue Type: Improvement
Components: C++, Python
Reporter: Ziheng Wang
Assignee: Ziheng Wang
Currently how the CSV reader handles the first block in the CSV is not great.
In fact I think the first block is read multiple times. First in the Peek in file_csv.cc and then in the InitFromBlock in the OpenReaderAsync in reader.cc
This could be problematic if the first block is pretty big, and also delays the synchronous opening of a dataset.
Possible solution is to use a smaller block size for the peek in file_csv.cc since you don't need to read the entire block to GetConvertOptions. So we could really just have another option in reader_options that's first_peek_size or something like that.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)