You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/02/07 09:56:03 UTC

[GitHub] [arrow-rs] tustvold opened a new pull request #1280: Skip zero-ing primitive nulls

tustvold opened a new pull request #1280:
URL: https://github.com/apache/arrow-rs/pull/1280


   # Which issue does this PR close?
   
   Closes #1279.
   
   # Rationale for this change
    
   A marginal performance increase, and to gauge the appetite for these sorts of optimizations. In particular the obvious next question is, do we need to be zero-initializing the buffers to start with? I'm not entirely sure what Rust's safety rules state about uninitialized memory of POD types, are their invalid bit sequences for some types (e.g. f32), if not, why does it matter?
   
   ```
   arrow_array_reader/read Int32Array, plain encoded, optional, half NULLs                                                                             
                           time:   [24.527 us 24.556 us 24.588 us]
                           change: [-6.4746% -5.3769% -4.4004%] (p = 0.00 < 0.05)
                           Performance has improved.
   arrow_array_reader/read Int32Array, dictionary encoded, optional, half NULLs                                                                             
                           time:   [32.846 us 32.858 us 32.875 us]
                           change: [-9.1480% -8.3967% -7.9605%] (p = 0.00 < 0.05)
                           Performance has improved.
   ```
   
   # What changes are included in this PR?
   
   Alters `ScalarBuffer::pad_nulls` to not zero the source location on read.
   
   # Are there any user-facing changes?
   
   Buffers that previously contained zeros in null positions, will now contain arbitrary values
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] tustvold commented on pull request #1280: Skip zero-ing primitive nulls

Posted by GitBox <gi...@apache.org>.

tustvold commented on pull request #1280:
URL: https://github.com/apache/arrow-rs/pull/1280#issuecomment-1032410249


   > This is [what Spark does](https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedRleValuesReader.java#L228) at the moment.
   
   Tbh I'm surprised that is more performant, if it even is, as you're putting a branch inside the main body of the loop... Something to measure for sure :smile: 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] alamb merged pull request #1280: Skip zero-ing primitive nulls

Posted by GitBox <gi...@apache.org>.

alamb merged pull request #1280:
URL: https://github.com/apache/arrow-rs/pull/1280


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] sunchao commented on pull request #1280: Skip zero-ing primitive nulls

Posted by GitBox <gi...@apache.org>.

sunchao commented on pull request #1280:
URL: https://github.com/apache/arrow-rs/pull/1280#issuecomment-1032875597


   Well, it's the same branch for decoding levels though, it's just that we don't have to store all the decoded definition levels in a vector first, and then start decoding values. 
   
   Imagine we have 100 values to decode: first 50 is null and the rest are not-null (therefore max definition level = 1), we can:
   1. read the next RLE-encoded value from definition level buffer, which is 0
   2. read 50 values from the value buffer using the batch API, and also set validity bits
   3. read the next RLE-encoded value from definition level buffer, which is 1
   4. (assuming validity buffer is zero-initialized) increment the offset by 50


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org