You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Weston Pace (Jira)" <ji...@apache.org> on 2022/04/05 18:30:00 UTC

[jira] [Commented] (ARROW-16118) [C++] Reduce memory usage when writing to IPC

    [ https://issues.apache.org/jira/browse/ARROW-16118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17517637#comment-17517637 ] 

Weston Pace commented on ARROW-16118:
-------------------------------------

One note is that this will only be possible when using a local filesystem.  Object stores usually do not support modification of files after they have been uploaded.  So we might need to add a flag to a filesystem to ask whether it supports this but that can be useful.  On spinning disk filesystems (i.e. HDD) the seek penalty may cause this to be less (or potentially negative) of a benefit if there is not much RAM pressure.

However, I could see this being useful in cases where we are writing to a very fast solid state drive.

Another memory-usage issue that might be related since we are talking about writes is ARROW-14635.  I'd like to move to direct I/O (or at least being smarter about OS page cache) when writing large datasets since we otherwise end up causing the system to swap for no good reason.

> [C++] Reduce memory usage when writing to IPC
> ---------------------------------------------
>
>                 Key: ARROW-16118
>                 URL: https://issues.apache.org/jira/browse/ARROW-16118
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Jorge Leitão
>            Priority: Major
>
> Writing a record batch to IPC ([header][buffers]) currently requires O(N*B) where N is the average size of the buffer and B the number of buffers in the recordbatch.
> This is because we need the buffer location and total number of bytes to write the header of the record, which is only known after e.g. knowning by how much the buffers were compressed.
> When the writer supports seeking, this memory usage can be reduced to O(N) where N is the average size of a primitive buffer over all fields. This is done using the following pseudo-code implementation:
> {code:java}
> start = writer.seek(current);
> empty_locations = create_empty_header(schema)
> write_header(writer, empty_locations)
> locations = write_buffers(writer, batch)
> writer.seek(start)
> write_header(writer, locations)
> {code}
> This has a significantly lower memory footprint. O(N) vs O(N*B)
> It could be interesting for the C++ implementation to support this.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)