You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Weston Pace (Jira)" <ji...@apache.org> on 2021/09/29 04:02:00 UTC

[jira] [Created] (ARROW-14164) [C++][Dataset] Enhance dataset writer to allow file-per-batch

Weston Pace created ARROW-14164:
-----------------------------------

             Summary: [C++][Dataset] Enhance dataset writer to allow file-per-batch
                 Key: ARROW-14164
                 URL: https://issues.apache.org/jira/browse/ARROW-14164
             Project: Apache Arrow
          Issue Type: Improvement
          Components: C++
            Reporter: Weston Pace


The dataset writer currently groups incoming batches into large files which are controlled by max_rows_per_file.  In the PR for this work [~jorisvandenbossche] recommended an option where each incoming batch creates a new file.

This would give the user fine grained control over how many files are created (provided they are doing a very basic scan/filter/project and not using any more sophisticated nodes which may resize batches.)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)