You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Chris Hutchinson (JIRA)" <ji...@apache.org> on 2019/02/10 23:25:00 UTC

[jira] [Commented] (ARROW-4503) [C#] ArrowStreamReader allocates and copies data excessively

    [ https://issues.apache.org/jira/browse/ARROW-4503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16764599#comment-16764599 ] 

Chris Hutchinson commented on ARROW-4503:
-----------------------------------------

The serialization code needs a lot of work. There were recent changes to ArrowBuffer.Builder that I made that I believe were a mistake. I believe it should probably allocate from MemoryPool and "freeze" the builder after building an ArrowBuffer. I think this means you would expect InvalidOperationException when attempting to modify the builder after building an ArrowBuffer, or perhaps just allocating new backing memory when needed.

As for ByteBuffer, consider that it is presently code generated by the Google FlatBuffer schema compiler. The correct approach seems to be to allocate from MemoryPool, read from the stream into the allocated memory, have the ByteBuffer backed by that memory, and construct an ArrowBuffer directly from the ByteBuffer instead of using ArrowBuffer.Builder. Alternatively ArrowBuffer.Builder could be modified to accept an existing memory for the initial backing store. 

Please note one of the original goals was to target .NET Standard 1.3. I would be comfortable moving that up to .NET Standard 2.0, but also would be interested in hearing any arguments on that.

Good work!

 

 

 

 

> [C#] ArrowStreamReader allocates and copies data excessively
> ------------------------------------------------------------
>
>                 Key: ARROW-4503
>                 URL: https://issues.apache.org/jira/browse/ARROW-4503
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C#
>            Reporter: Eric Erhardt
>            Priority: Major
>              Labels: performance
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> When reading `RecordBatch` instances using the `ArrowStreamReader` class, it is currently allocating and copying memory 3 times for the data.
>  # It is allocating memory in order to [read the data from the Stream|https://github.com/apache/arrow/blob/044b418fa108a57f0b4e2e887546cc3e68271397/csharp/src/Apache.Arrow/Ipc/ArrowStreamReader.cs#L72-L74], and then reading from the Stream.  (This should be the only allocation that is necessary.)
>  # It then [creates a new `ArrowBuffer.Builder`|https://github.com/apache/arrow/blob/044b418fa108a57f0b4e2e887546cc3e68271397/csharp/src/Apache.Arrow/Ipc/ArrowStreamReader.cs#L227-L228], which allocates another `byte[]`, and calls `Append` on it, which copies the values to the new `byte[]`.
>  # Finally, it then calls `.Build()` on the `ArrowBuffer.Builder`, which [allocates memory from the MemoryPool, and then copies the intermediate buffer|https://github.com/apache/arrow/blob/044b418fa108a57f0b4e2e887546cc3e68271397/csharp/src/Apache.Arrow/ArrowBuffer.Builder.cs#L112-L121] into it.
>  
> We should reduce this overhead to only allocating a single time (from the MemoryPool), and not copying the data more times than necessary.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)