You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@arrow.apache.org by "Anthony Abate (Jira)" <ji...@apache.org> on 2020/01/07 20:15:00 UTC

[jira] [Created] (ARROW-7511) [C#] - Batch / Data Size Can't Exceed 2 gigs

Anthony Abate created ARROW-7511:
------------------------------------

             Summary: [C#] - Batch / Data Size Can't Exceed 2 gigs
                 Key: ARROW-7511
                 URL: https://issues.apache.org/jira/browse/ARROW-7511
             Project: Apache Arrow
          Issue Type: Bug
          Components: C#
    Affects Versions: 0.15.1
            Reporter: Anthony Abate


While the Arrow spec does not forbid batches larger than 2 gigs, the C# library can not support this in its current form due to limits on managed memory as it tries to put the whole batch into a single Span<byte>/Memory<byte>

It is possible to fix this by not trying to use Memory/Span/byte[] for the entire Batch.. and instead move the memory mapping to the ArrowBuffers.  This only move the problem 'lower' as it would then still set the limit of a Column Data in a single batch to be 2 Gigs.  

This seems like plenty of memory... but if you think of strings columns, the data is just one giant string appended to together with offsets and it can get very large quickly.

I think the unfortunate problem is that memory management in the C# managed world is always going to hit the 2 gig limit somewhere. (please correct me if I am wrong on this statement)

That ultimately means the C# library either has to reject files of certain characteristics (ie validation checks on opening) , or the spec needs put upper limits on certain internal arrow constructs (ie arrow buffer) to eliminate the need for more than a 2 gigs of contiguous memory for the smallest arrow object.

However, If the spec was indeed designed for the smallest buffer object to be larger than 2 gigs, or for the entire memory buffer of arrow to be contiguous, one has to wonder if at some point, it might just make sense for the C# library to use the C++ library as its memory manager as replicating a very large blocks of memory more work than its wroth.

In any case,  this issue is more about 'deferring' the 2 gig size problem by moving it down to the buffer objects... This might require some re-write of the batch data structures

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)