You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Jorge Leitão (Jira)" <ji...@apache.org> on 2022/04/05 08:45:00 UTC

[jira] [Updated] (ARROW-16118) [C++] Reduce memory usage when writing to IPC

     [ https://issues.apache.org/jira/browse/ARROW-16118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jorge Leitão updated ARROW-16118:
---------------------------------
    Description: 
Writing a record batch to IPC ([header][buffers]) currently requires O(N*B) where N is the average size of the buffer and B the number of buffers.

This is because we need the buffer location and total number of bytes to write the header of the record, which is only known after e.g. knowning by how much the buffers were compressed.

When the writer supports seeking, this memory usage can be reduced to O(N) where N is the average size of a primitive buffer over all fields. This is done using the following pseudo-code implementation:


{code:java}
start = writer.seek(current);
empty_locations = create_empty_header(schema)
write_header(writer, empty_locations)
locations = write_buffers(writer, batch)
writer.seek(start)
write_header(writer, locations)
{code}

This has a significantly lower memory footprint. O(N) vs O(N*B)

It could be interesting for the C++ implementation to support this.

  was:
Writing a record batch to IPC ([header][buffers]) currently requires O(N*B) where N is the average size of the buffer and B the number of buffers.

This is because we need the buffer location and total number of bytes to write the header of the record, which is only known after e.g. compressing them.

When the writer supports seeking, this memory usage can be reduced to O(N) where N is the average size of a primitive buffer over all fields. This is done using the following pseudo-code implementation:


{code:java}
start = writer.seek(current);
empty_locations = create_empty_header(schema)
write_header(writer, empty_locations)
locations = write_buffers(writer, batch)
writer.seek(start)
write_header(writer, locations)
{code}

This has a significantly lower memory footprint. O(N) vs O(N*B)

It could be interesting for the C++ implementation to support this.


> [C++] Reduce memory usage when writing to IPC
> ---------------------------------------------
>
>                 Key: ARROW-16118
>                 URL: https://issues.apache.org/jira/browse/ARROW-16118
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Jorge Leitão
>            Priority: Major
>
> Writing a record batch to IPC ([header][buffers]) currently requires O(N*B) where N is the average size of the buffer and B the number of buffers.
> This is because we need the buffer location and total number of bytes to write the header of the record, which is only known after e.g. knowning by how much the buffers were compressed.
> When the writer supports seeking, this memory usage can be reduced to O(N) where N is the average size of a primitive buffer over all fields. This is done using the following pseudo-code implementation:
> {code:java}
> start = writer.seek(current);
> empty_locations = create_empty_header(schema)
> write_header(writer, empty_locations)
> locations = write_buffers(writer, batch)
> writer.seek(start)
> write_header(writer, locations)
> {code}
> This has a significantly lower memory footprint. O(N) vs O(N*B)
> It could be interesting for the C++ implementation to support this.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)