You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/05/25 10:14:38 UTC
[GitHub] [arrow-rs] ritchie46 opened a new pull request #348: Reduce memory usage of concat (large)utf8
ritchie46 opened a new pull request #348:
URL: https://github.com/apache/arrow-rs/pull/348
partial solution for #347
This PR precomputes the needed length of the buffers that store the raw string data. This way we only allocate the minimal needed memory and it is faster because there will be no reallocation.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-rs] codecov-commenter commented on pull request #348: Reduce memory usage of concat (large)utf8
Posted by GitBox <gi...@apache.org>.
codecov-commenter commented on pull request #348:
URL: https://github.com/apache/arrow-rs/pull/348#issuecomment-847754603
# [Codecov](https://codecov.io/gh/apache/arrow-rs/pull/348?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
> Merging [#348](https://codecov.io/gh/apache/arrow-rs/pull/348?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (44d325f) into [master](https://codecov.io/gh/apache/arrow-rs/commit/5ac771a6831c243e4fd8560b9cb5e13f8a81af34?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (5ac771a) will **increase** coverage by `0.00%`.
> The diff coverage is `90.32%`.
> :exclamation: Current head 44d325f differs from pull request most recent head 9b30c2b. Consider uploading reports for the commit 9b30c2b to get more accurate results
[![Impacted file tree graph](https://codecov.io/gh/apache/arrow-rs/pull/348/graphs/tree.svg?width=650&height=150&src=pr&token=pq9V9qWZ1N&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)](https://codecov.io/gh/apache/arrow-rs/pull/348?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
```diff
@@ Coverage Diff @@
## master #348 +/- ##
=======================================
Coverage 82.56% 82.56%
=======================================
Files 162 162
Lines 44063 44093 +30
=======================================
+ Hits 36379 36405 +26
- Misses 7684 7688 +4
```
| [Impacted Files](https://codecov.io/gh/apache/arrow-rs/pull/348?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | |
|---|---|---|
| [arrow/src/array/array\_boolean.rs](https://codecov.io/gh/apache/arrow-rs/pull/348/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2FycmF5L2FycmF5X2Jvb2xlYW4ucnM=) | `90.90% <ø> (ø)` | |
| [arrow/src/array/transform/mod.rs](https://codecov.io/gh/apache/arrow-rs/pull/348/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2FycmF5L3RyYW5zZm9ybS9tb2QucnM=) | `88.87% <87.50%> (-0.07%)` | :arrow_down: |
| [arrow/src/compute/kernels/concat.rs](https://codecov.io/gh/apache/arrow-rs/pull/348/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2NvbXB1dGUva2VybmVscy9jb25jYXQucnM=) | `100.00% <100.00%> (ø)` | |
| [parquet/src/encodings/encoding.rs](https://codecov.io/gh/apache/arrow-rs/pull/348/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvZW5jb2RpbmdzL2VuY29kaW5nLnJz) | `94.85% <0.00%> (-0.20%)` | :arrow_down: |
------
[Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow-rs/pull/348?src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
> **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
> `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
> Powered by [Codecov](https://codecov.io/gh/apache/arrow-rs/pull/348?src=pr&el=footer&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Last update [5ac771a...9b30c2b](https://codecov.io/gh/apache/arrow-rs/pull/348?src=pr&el=lastupdated&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-rs] codecov-commenter edited a comment on pull request #348: Reduce memory usage of concat (large)utf8
Posted by GitBox <gi...@apache.org>.
codecov-commenter edited a comment on pull request #348:
URL: https://github.com/apache/arrow-rs/pull/348#issuecomment-847754603
# [Codecov](https://codecov.io/gh/apache/arrow-rs/pull/348?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
> Merging [#348](https://codecov.io/gh/apache/arrow-rs/pull/348?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (257bc59) into [master](https://codecov.io/gh/apache/arrow-rs/commit/5ac771a6831c243e4fd8560b9cb5e13f8a81af34?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (5ac771a) will **increase** coverage by `0.05%`.
> The diff coverage is `99.30%`.
> :exclamation: Current head 257bc59 differs from pull request most recent head b3bfb5d. Consider uploading reports for the commit b3bfb5d to get more accurate results
[![Impacted file tree graph](https://codecov.io/gh/apache/arrow-rs/pull/348/graphs/tree.svg?width=650&height=150&src=pr&token=pq9V9qWZ1N&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)](https://codecov.io/gh/apache/arrow-rs/pull/348?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
```diff
@@ Coverage Diff @@
## master #348 +/- ##
==========================================
+ Coverage 82.56% 82.61% +0.05%
==========================================
Files 162 162
Lines 44063 44197 +134
==========================================
+ Hits 36379 36512 +133
- Misses 7684 7685 +1
```
| [Impacted Files](https://codecov.io/gh/apache/arrow-rs/pull/348?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | |
|---|---|---|
| [arrow/src/array/array\_boolean.rs](https://codecov.io/gh/apache/arrow-rs/pull/348/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2FycmF5L2FycmF5X2Jvb2xlYW4ucnM=) | `90.90% <ø> (ø)` | |
| [arrow/src/csv/reader.rs](https://codecov.io/gh/apache/arrow-rs/pull/348/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2Nzdi9yZWFkZXIucnM=) | `89.91% <99.07%> (+1.54%)` | :arrow_up: |
| [arrow/src/array/transform/mod.rs](https://codecov.io/gh/apache/arrow-rs/pull/348/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2FycmF5L3RyYW5zZm9ybS9tb2QucnM=) | `89.18% <100.00%> (+0.24%)` | :arrow_up: |
| [arrow/src/compute/kernels/concat.rs](https://codecov.io/gh/apache/arrow-rs/pull/348/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2NvbXB1dGUva2VybmVscy9jb25jYXQucnM=) | `100.00% <100.00%> (ø)` | |
| [parquet/src/arrow/arrow\_writer.rs](https://codecov.io/gh/apache/arrow-rs/pull/348/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvYXJyb3cvYXJyb3dfd3JpdGVyLnJz) | `98.11% <100.00%> (+0.03%)` | :arrow_up: |
| [parquet/src/column/writer.rs](https://codecov.io/gh/apache/arrow-rs/pull/348/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvY29sdW1uL3dyaXRlci5ycw==) | `93.29% <100.00%> (ø)` | |
| [parquet/src/encodings/encoding.rs](https://codecov.io/gh/apache/arrow-rs/pull/348/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvZW5jb2RpbmdzL2VuY29kaW5nLnJz) | `94.85% <0.00%> (-0.20%)` | :arrow_down: |
| [arrow/src/array/transform/boolean.rs](https://codecov.io/gh/apache/arrow-rs/pull/348/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2FycmF5L3RyYW5zZm9ybS9ib29sZWFuLnJz) | `84.61% <0.00%> (+7.69%)` | :arrow_up: |
------
[Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow-rs/pull/348?src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
> **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
> `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
> Powered by [Codecov](https://codecov.io/gh/apache/arrow-rs/pull/348?src=pr&el=footer&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Last update [5ac771a...b3bfb5d](https://codecov.io/gh/apache/arrow-rs/pull/348?src=pr&el=lastupdated&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-rs] jorgecarleitao edited a comment on pull request #348: Reduce memory usage of concat (large)utf8
Posted by GitBox <gi...@apache.org>.
jorgecarleitao edited a comment on pull request #348:
URL: https://github.com/apache/arrow-rs/pull/348#issuecomment-850896662
I did not have the time to review this, but I think that this will over-allocate in all cases where `MutableArrayData`'s API is used other than concat.
In particular, this will overallocate in:
* `filter` arrays, as it now pre-allocates the maximum size of the string array, even if we only keep a single slot.
* `zip` kernel: we now allocate `falsy.len() + truthy.len()`, even though we will only pick either from one or the other.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-rs] alamb merged pull request #348: Reduce memory usage of concat (large)utf8
Posted by GitBox <gi...@apache.org>.
alamb merged pull request #348:
URL: https://github.com/apache/arrow-rs/pull/348
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-rs] alamb commented on pull request #348: Reduce memory usage of concat (large)utf8
Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #348:
URL: https://github.com/apache/arrow-rs/pull/348#issuecomment-848286923
The MIRI failure is unrelated to this PR: https://github.com/apache/arrow-rs/issues/345
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-rs] alamb commented on pull request #348: Reduce memory usage of concat (large)utf8
Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #348:
URL: https://github.com/apache/arrow-rs/pull/348#issuecomment-850812404
Thanks @ritchie46
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-rs] ritchie46 commented on pull request #348: Reduce memory usage of concat (large)utf8
Posted by GitBox <gi...@apache.org>.
ritchie46 commented on pull request #348:
URL: https://github.com/apache/arrow-rs/pull/348#issuecomment-848502242
> @ritchie46 do we have some (micro) benchmark results, like:
```
Gnuplot not found, using plotters backend
concat str 10k time: [162.54 us 162.73 us 163.00 us]
change: [-3.7750% -3.6483% -3.5084%] (p = 0.00 < 0.05)
Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
1 (1.00%) high mild
1 (1.00%) high severe
concat str 50% nulls 10k
time: [619.71 us 621.16 us 623.28 us]
change: [-1.1095% -0.9350% -0.7141%] (p = 0.00 < 0.05)
Change within noise threshold.
Found 10 outliers among 100 measurements (10.00%)
1 (1.00%) low mild
2 (2.00%) high mild
7 (7.00%) high severe
concat str 5% nulls 10k time: [391.61 us 391.71 us 391.81 us]
change: [-1.3337% -1.2446% -1.1878%] (p = 0.00 < 0.05)
Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
6 (6.00%) high mild
3 (3.00%) high severe
```
Performance wise it doesn't matter/ or hurt that much. So it mostly is more memory efficient.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-rs] ritchie46 commented on pull request #348: Reduce memory usage of concat (large)utf8
Posted by GitBox <gi...@apache.org>.
ritchie46 commented on pull request #348:
URL: https://github.com/apache/arrow-rs/pull/348#issuecomment-850943703
I did not realize this @jorgecarleitao. What do you think about creating a builder pattern (with an prealloc option) for the concat kernel case and return the `new` implementation to the old state?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-rs] jorgecarleitao commented on pull request #348: Reduce memory usage of concat (large)utf8
Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on pull request #348:
URL: https://github.com/apache/arrow-rs/pull/348#issuecomment-850896662
I did not have the time to review this, but I think that this will over-allocate in all cases where `MutableArrayData`'s API is used.
In particular, this will overallocate in:
* `filter` arrays, as it now pre-allocates the maximum size of the string array, even if we only keep a single slot.
* `zip` kernel: we now allocate `falsy.len() + truthy.len()`, even though we will only pick either from one or the other.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-rs] jorgecarleitao edited a comment on pull request #348: Reduce memory usage of concat (large)utf8
Posted by GitBox <gi...@apache.org>.
jorgecarleitao edited a comment on pull request #348:
URL: https://github.com/apache/arrow-rs/pull/348#issuecomment-850896662
I did not have the time to review this, but I think that this will over-allocate in all cases where `MutableArrayData`'s API is used other than concat.
In particular, IMO this overallocate in:
* `filter` arrays, as it now pre-allocates the maximum size of the string array, even if we only keep a single slot.
* `zip` kernel: we now allocate `falsy.len() + truthy.len()`, even though we will only pick either from one or the other.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-rs] Dandandan commented on pull request #348: Reduce memory usage of concat (large)utf8
Posted by GitBox <gi...@apache.org>.
Dandandan commented on pull request #348:
URL: https://github.com/apache/arrow-rs/pull/348#issuecomment-848290785
@ritchie46 do we have some (micro) benchmark results, like:
```cargo bench --bench concatenate_kernel -- "concat str"```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-rs] ritchie46 commented on pull request #348: Reduce memory usage of concat (large)utf8
Posted by GitBox <gi...@apache.org>.
ritchie46 commented on pull request #348:
URL: https://github.com/apache/arrow-rs/pull/348#issuecomment-850330839
> Did some tests locally with 1025 items in the array, getting a small improvement in timing there too (3-5%).
Nice!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-rs] alamb commented on pull request #348: Reduce memory usage of concat (large)utf8
Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #348:
URL: https://github.com/apache/arrow-rs/pull/348#issuecomment-850812372
MIRI test failure is not related to this PR (we subsequently disabled it until we can fix the errors)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org