You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/05/22 18:45:02 UTC

[GitHub] [arrow-rs] nevi-me commented on pull request #335: respect offset in utf8 and list casts

nevi-me commented on pull request #335:
URL: https://github.com/apache/arrow-rs/pull/335#issuecomment-846448353


   > I am a bit uncertain about the spec here: do we require the offset buffer to start at 0, or is the requirement that the last value minus the first value must be equal to the length of the values buffer? @nevi-me , do you know what is the spec here?
   
   This might be a bit tricky. AFAIK the spec doesn't prescribe what happens when a compute kernel interacts with sliced data.
   
   > Generally the first value in the offsets array is 0, and the last slot is the length of the values array. When serializing this layout, we recommend normalizing the offsets to start at 0. [0]
   
   I would interpret "generally" as, 'most implementations will expect to start at 0'. In which case, I would prefer a solution that carries the offset of the array, and starts the offset buffer at 0.
   
   I think carrying the offset of the input into the output in this case, is the most performant and compatible solution. Otherwise we'd have to racalculate the offsets to make sure that they start from 0.
   
   [0] https://arrow.apache.org/docs/format/Columnar.html#variable-size-binary-layout


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org