You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Bob Tinsman (Jira)" <ji...@apache.org> on 2021/03/26 18:58:00 UTC
[jira] [Commented] (ARROW-7494) [Java] Remove reader index and writer index from ArrowBuf

    [ https://issues.apache.org/jira/browse/ARROW-7494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17309660#comment-17309660 ] 

Bob Tinsman commented on ARROW-7494:
------------------------------------

I was looking for Java issues and this one showed up. I can see that there was a lot of work done on it by [~tianchen92] but the activity has petered out, so I can take it if no one objects.

I've reviewed the previous work (including the related ARROW-7539) and decided to start from scratch. I've been experimenting locally but will put my work into a PR soon so I can get people to review it.

I agree with [~jnadeau] that there is leftover stuff from an old implementation--ArrowBuf is overloaded. It's a tricky thing to factor out the r/w indices, but I think it can be done by looking at the usage patterns. Here's a rough summary (feel free to clarify or correct any mistakes):
 * The main pattern is to allocate ArrowBufs and slice them up into the buffers used by various vector implementations. This doesn't really require the indices.
 * Some code (the JsonFileReader for example) uses the various byte-twiddling methods (like writeByte(), writeShort(), etc.) that make use of the indices.
 * Some core buffer management code like BufferLedger sets the indices; I assume this is to reset them to known values before handing out ArrowBufs.

Possible approaches:
 * The original approach: rip out the indices and methods that use them, then change all the code using those methods to use equivalent logic
 ** It's a lot of changes, and difficult to get them all right.
 ** It could be a tough merge to other Arrow forks or other projects
 * Minimal approach that saves memory: put the r/w indices in a helper object that is allocated as needed; otherwise, no change to ArrowBuf methods
 ** Pro: save 12 bytes in ArrowBuf (4 byte object ref vs. 2 long indices)
 ** Pro: pretty safe, incremental refactoring that you could build on
 ** Con: haven't really cleaned up ArrowBuf
 * Factor out superclass (BaseArrowBuf) without indices, replace use of ArrowBuf with BaseArrowBuf where possible (I'm trying this one now)
 ** Pro: ArrowBuf doesn't change
 ** Con: The replacement of ArrowBuf touches a lot of code
 ** Con: Unclear how to change some of the core buffer management code like BufferLedger, which sets indices
 * Push indices and related methods into subclass (IndexedArrowBuf)
 ** Pro: clean up ArrowBuf, the original goal
 ** Pro: code that still needs old methods can work on a subclass instance
 ** Pro: core buffer logic still uses ArrowBuf
 ** Con: have to make sure semantics of setting the indices in core buffer logic are still the same

> [Java] Remove reader index and writer index from ArrowBuf
> ---------------------------------------------------------
>
>                 Key: ARROW-7494
>                 URL: https://issues.apache.org/jira/browse/ARROW-7494
>             Project: Apache Arrow
>          Issue Type: Task
>          Components: Java
>            Reporter: Jacques Nadeau
>            Assignee: Ji Liu
>            Priority: Critical
>              Labels: pull-request-available
>             Fix For: 4.0.0
>
>          Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> Reader and writer index and functionality doesn't belong on a chunk of memory and is due to inheritance from ByteBuf. As part of removing ByteBuf inheritance, we should also remove reader and writer indexes from ArrowBuf functionality. It wastes heap memory for rare utility. In general, a slice can be used instead of a reader/writer index pattern.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)