You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/05/25 10:14:38 UTC

[GitHub] [arrow-rs] ritchie46 opened a new pull request #348: Reduce memory usage of concat (large)utf8

ritchie46 opened a new pull request #348:
URL: https://github.com/apache/arrow-rs/pull/348


   partial solution for #347 
   
   This PR precomputes the needed length of the buffers that store the raw string data. This way we only allocate the minimal needed memory and it is faster because there will be no reallocation.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] codecov-commenter commented on pull request #348: Reduce memory usage of concat (large)utf8

Posted by GitBox <gi...@apache.org>.
codecov-commenter commented on pull request #348:
URL: https://github.com/apache/arrow-rs/pull/348#issuecomment-847754603


   # [Codecov](https://codecov.io/gh/apache/arrow-rs/pull/348?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
   > Merging [#348](https://codecov.io/gh/apache/arrow-rs/pull/348?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (44d325f) into [master](https://codecov.io/gh/apache/arrow-rs/commit/5ac771a6831c243e4fd8560b9cb5e13f8a81af34?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (5ac771a) will **increase** coverage by `0.00%`.
   > The diff coverage is `90.32%`.
   
   > :exclamation: Current head 44d325f differs from pull request most recent head 9b30c2b. Consider uploading reports for the commit 9b30c2b to get more accurate results
   [![Impacted file tree graph](https://codecov.io/gh/apache/arrow-rs/pull/348/graphs/tree.svg?width=650&height=150&src=pr&token=pq9V9qWZ1N&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)](https://codecov.io/gh/apache/arrow-rs/pull/348?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   
   ```diff
   @@           Coverage Diff           @@
   ##           master     #348   +/-   ##
   =======================================
     Coverage   82.56%   82.56%           
   =======================================
     Files         162      162           
     Lines       44063    44093   +30     
   =======================================
   + Hits        36379    36405   +26     
   - Misses       7684     7688    +4     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow-rs/pull/348?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | |
   |---|---|---|
   | [arrow/src/array/array\_boolean.rs](https://codecov.io/gh/apache/arrow-rs/pull/348/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2FycmF5L2FycmF5X2Jvb2xlYW4ucnM=) | `90.90% <ø> (ø)` | |
   | [arrow/src/array/transform/mod.rs](https://codecov.io/gh/apache/arrow-rs/pull/348/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2FycmF5L3RyYW5zZm9ybS9tb2QucnM=) | `88.87% <87.50%> (-0.07%)` | :arrow_down: |
   | [arrow/src/compute/kernels/concat.rs](https://codecov.io/gh/apache/arrow-rs/pull/348/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2NvbXB1dGUva2VybmVscy9jb25jYXQucnM=) | `100.00% <100.00%> (ø)` | |
   | [parquet/src/encodings/encoding.rs](https://codecov.io/gh/apache/arrow-rs/pull/348/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvZW5jb2RpbmdzL2VuY29kaW5nLnJz) | `94.85% <0.00%> (-0.20%)` | :arrow_down: |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow-rs/pull/348?src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow-rs/pull/348?src=pr&el=footer&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Last update [5ac771a...9b30c2b](https://codecov.io/gh/apache/arrow-rs/pull/348?src=pr&el=lastupdated&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] codecov-commenter edited a comment on pull request #348: Reduce memory usage of concat (large)utf8

Posted by GitBox <gi...@apache.org>.
codecov-commenter edited a comment on pull request #348:
URL: https://github.com/apache/arrow-rs/pull/348#issuecomment-847754603


   # [Codecov](https://codecov.io/gh/apache/arrow-rs/pull/348?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
   > Merging [#348](https://codecov.io/gh/apache/arrow-rs/pull/348?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (257bc59) into [master](https://codecov.io/gh/apache/arrow-rs/commit/5ac771a6831c243e4fd8560b9cb5e13f8a81af34?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (5ac771a) will **increase** coverage by `0.05%`.
   > The diff coverage is `99.30%`.
   
   > :exclamation: Current head 257bc59 differs from pull request most recent head b3bfb5d. Consider uploading reports for the commit b3bfb5d to get more accurate results
   [![Impacted file tree graph](https://codecov.io/gh/apache/arrow-rs/pull/348/graphs/tree.svg?width=650&height=150&src=pr&token=pq9V9qWZ1N&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)](https://codecov.io/gh/apache/arrow-rs/pull/348?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   
   ```diff
   @@            Coverage Diff             @@
   ##           master     #348      +/-   ##
   ==========================================
   + Coverage   82.56%   82.61%   +0.05%     
   ==========================================
     Files         162      162              
     Lines       44063    44197     +134     
   ==========================================
   + Hits        36379    36512     +133     
   - Misses       7684     7685       +1     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow-rs/pull/348?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | |
   |---|---|---|
   | [arrow/src/array/array\_boolean.rs](https://codecov.io/gh/apache/arrow-rs/pull/348/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2FycmF5L2FycmF5X2Jvb2xlYW4ucnM=) | `90.90% <ø> (ø)` | |
   | [arrow/src/csv/reader.rs](https://codecov.io/gh/apache/arrow-rs/pull/348/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2Nzdi9yZWFkZXIucnM=) | `89.91% <99.07%> (+1.54%)` | :arrow_up: |
   | [arrow/src/array/transform/mod.rs](https://codecov.io/gh/apache/arrow-rs/pull/348/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2FycmF5L3RyYW5zZm9ybS9tb2QucnM=) | `89.18% <100.00%> (+0.24%)` | :arrow_up: |
   | [arrow/src/compute/kernels/concat.rs](https://codecov.io/gh/apache/arrow-rs/pull/348/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2NvbXB1dGUva2VybmVscy9jb25jYXQucnM=) | `100.00% <100.00%> (ø)` | |
   | [parquet/src/arrow/arrow\_writer.rs](https://codecov.io/gh/apache/arrow-rs/pull/348/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvYXJyb3cvYXJyb3dfd3JpdGVyLnJz) | `98.11% <100.00%> (+0.03%)` | :arrow_up: |
   | [parquet/src/column/writer.rs](https://codecov.io/gh/apache/arrow-rs/pull/348/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvY29sdW1uL3dyaXRlci5ycw==) | `93.29% <100.00%> (ø)` | |
   | [parquet/src/encodings/encoding.rs](https://codecov.io/gh/apache/arrow-rs/pull/348/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvZW5jb2RpbmdzL2VuY29kaW5nLnJz) | `94.85% <0.00%> (-0.20%)` | :arrow_down: |
   | [arrow/src/array/transform/boolean.rs](https://codecov.io/gh/apache/arrow-rs/pull/348/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2FycmF5L3RyYW5zZm9ybS9ib29sZWFuLnJz) | `84.61% <0.00%> (+7.69%)` | :arrow_up: |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow-rs/pull/348?src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow-rs/pull/348?src=pr&el=footer&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Last update [5ac771a...b3bfb5d](https://codecov.io/gh/apache/arrow-rs/pull/348?src=pr&el=lastupdated&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] jorgecarleitao edited a comment on pull request #348: Reduce memory usage of concat (large)utf8

Posted by GitBox <gi...@apache.org>.
jorgecarleitao edited a comment on pull request #348:
URL: https://github.com/apache/arrow-rs/pull/348#issuecomment-850896662


   I did not have the time to review this, but I think that this will over-allocate in all cases where `MutableArrayData`'s API is used other than concat.
   
   In particular, this will overallocate in:
   
   * `filter` arrays, as it now pre-allocates the maximum size of the string array, even if we only keep a single slot.
   * `zip` kernel: we now allocate `falsy.len() + truthy.len()`, even though we will only pick either from one or the other.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] alamb merged pull request #348: Reduce memory usage of concat (large)utf8

Posted by GitBox <gi...@apache.org>.
alamb merged pull request #348:
URL: https://github.com/apache/arrow-rs/pull/348


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] alamb commented on pull request #348: Reduce memory usage of concat (large)utf8

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #348:
URL: https://github.com/apache/arrow-rs/pull/348#issuecomment-848286923


   The MIRI failure is unrelated to this PR: https://github.com/apache/arrow-rs/issues/345


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] alamb commented on pull request #348: Reduce memory usage of concat (large)utf8

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #348:
URL: https://github.com/apache/arrow-rs/pull/348#issuecomment-850812404


   Thanks @ritchie46 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] ritchie46 commented on pull request #348: Reduce memory usage of concat (large)utf8

Posted by GitBox <gi...@apache.org>.
ritchie46 commented on pull request #348:
URL: https://github.com/apache/arrow-rs/pull/348#issuecomment-848502242


   > @ritchie46 do we have some (micro) benchmark results, like:
   
   ```
   Gnuplot not found, using plotters backend
   concat str 10k          time:   [162.54 us 162.73 us 163.00 us]                           
                           change: [-3.7750% -3.6483% -3.5084%] (p = 0.00 < 0.05)
                           Performance has improved.
   Found 2 outliers among 100 measurements (2.00%)
     1 (1.00%) high mild
     1 (1.00%) high severe
   
   concat str 50% nulls 10k                                                                            
                           time:   [619.71 us 621.16 us 623.28 us]
                           change: [-1.1095% -0.9350% -0.7141%] (p = 0.00 < 0.05)
                           Change within noise threshold.
   Found 10 outliers among 100 measurements (10.00%)
     1 (1.00%) low mild
     2 (2.00%) high mild
     7 (7.00%) high severe
   
   concat str 5% nulls 10k time:   [391.61 us 391.71 us 391.81 us]                                    
                           change: [-1.3337% -1.2446% -1.1878%] (p = 0.00 < 0.05)
                           Performance has improved.
   Found 9 outliers among 100 measurements (9.00%)
     6 (6.00%) high mild
     3 (3.00%) high severe
   ```
   
   Performance wise it doesn't matter/ or hurt that much. So it mostly is more memory efficient. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] ritchie46 commented on pull request #348: Reduce memory usage of concat (large)utf8

Posted by GitBox <gi...@apache.org>.
ritchie46 commented on pull request #348:
URL: https://github.com/apache/arrow-rs/pull/348#issuecomment-850943703


   I did not realize this @jorgecarleitao. What do you think about creating a builder pattern (with an prealloc option) for the concat kernel case and return the `new` implementation to the old state?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] jorgecarleitao commented on pull request #348: Reduce memory usage of concat (large)utf8

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on pull request #348:
URL: https://github.com/apache/arrow-rs/pull/348#issuecomment-850896662


   I did not have the time to review this, but I think that this will over-allocate in all cases where `MutableArrayData`'s API is used.
   
   In particular, this will overallocate in:
   
   * `filter` arrays, as it now pre-allocates the maximum size of the string array, even if we only keep a single slot.
   * `zip` kernel: we now allocate `falsy.len() + truthy.len()`, even though we will only pick either from one or the other.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] jorgecarleitao edited a comment on pull request #348: Reduce memory usage of concat (large)utf8

Posted by GitBox <gi...@apache.org>.
jorgecarleitao edited a comment on pull request #348:
URL: https://github.com/apache/arrow-rs/pull/348#issuecomment-850896662


   I did not have the time to review this, but I think that this will over-allocate in all cases where `MutableArrayData`'s API is used other than concat.
   
   In particular, IMO this overallocate in:
   
   * `filter` arrays, as it now pre-allocates the maximum size of the string array, even if we only keep a single slot.
   * `zip` kernel: we now allocate `falsy.len() + truthy.len()`, even though we will only pick either from one or the other.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] Dandandan commented on pull request #348: Reduce memory usage of concat (large)utf8

Posted by GitBox <gi...@apache.org>.
Dandandan commented on pull request #348:
URL: https://github.com/apache/arrow-rs/pull/348#issuecomment-848290785


   @ritchie46 do we have some (micro) benchmark results, like:
   
   ```cargo bench --bench concatenate_kernel -- "concat str"```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] ritchie46 commented on pull request #348: Reduce memory usage of concat (large)utf8

Posted by GitBox <gi...@apache.org>.
ritchie46 commented on pull request #348:
URL: https://github.com/apache/arrow-rs/pull/348#issuecomment-850330839


   > Did some tests locally with 1025 items in the array, getting a small improvement in timing there too (3-5%).
   
   Nice!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] alamb commented on pull request #348: Reduce memory usage of concat (large)utf8

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #348:
URL: https://github.com/apache/arrow-rs/pull/348#issuecomment-850812372


   MIRI test failure is not related to this PR (we subsequently disabled it until we can fix the errors)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org