You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@pinot.apache.org by "raghavgautam (via GitHub)" <gi...@apache.org> on 2023/04/06 19:30:50 UTC
[GitHub] [pinot] raghavgautam opened a new pull request, #10572: #performance
raghavgautam opened a new pull request, #10572:
URL: https://github.com/apache/pinot/pull/10572
This patch addresses memory problem as discussed in https://github.com/apache/pinot/issues/10571
It uses rocksdb as an on disk key-value store instead of ConcurrentHashMap to reduce memory requirement
The memory requirement is number_of_keys * 10 bits + 64MB for each column family. So, for 1B keys it will use <1.5GB RAM.
Source:
- https://www.bookstack.cn/read/rocksdb-en/5dd063f6b6ed224d.md
- https://github.com/facebook/rocksdb/wiki/Memory-usage-in-RocksDB#indexes-and-filter-blocks
For testing, I was able to ingest at 8-10K/sec, with 265 million records already loaded on my M1 max with 10 cores and 64 GB RAM.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org
[GitHub] [pinot] codecov-commenter commented on pull request #10572: Use disk based key value store for deduplication
Posted by "codecov-commenter (via GitHub)" <gi...@apache.org>.
codecov-commenter commented on PR #10572:
URL: https://github.com/apache/pinot/pull/10572#issuecomment-1499558867
## [Codecov](https://codecov.io/gh/apache/pinot/pull/10572?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
> Merging [#10572](https://codecov.io/gh/apache/pinot/pull/10572?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (d03e5e5) into [master](https://codecov.io/gh/apache/pinot/commit/f6c6d14540500867a167859a05e6822b60ce25c2?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (f6c6d14) will **decrease** coverage by `50.39%`.
> The diff coverage is `0.00%`.
```diff
@@ Coverage Diff @@
## master #10572 +/- ##
=============================================
- Coverage 64.27% 13.89% -50.39%
+ Complexity 6441 439 -6002
=============================================
Files 2049 2050 +1
Lines 110287 110342 +55
Branches 16686 16688 +2
=============================================
- Hits 70887 15330 -55557
- Misses 34257 93758 +59501
+ Partials 5143 1254 -3889
```
| Flag | Coverage Δ | |
|---|---|---|
| unittests1 | `?` | |
| unittests2 | `13.89% <0.00%> (+0.01%)` | :arrow_up: |
Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#carryforward-flags-in-the-pull-request-comment) to find out more.
| [Impacted Files](https://codecov.io/gh/apache/pinot/pull/10572?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | |
|---|---|---|
| [.../pinot/segment/local/dedup/LocalKeyValueStore.java](https://codecov.io/gh/apache/pinot/pull/10572?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC9kZWR1cC9Mb2NhbEtleVZhbHVlU3RvcmUuamF2YQ==) | `0.00% <0.00%> (ø)` | |
| [...ent/local/dedup/PartitionDedupMetadataManager.java](https://codecov.io/gh/apache/pinot/pull/10572?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC9kZWR1cC9QYXJ0aXRpb25EZWR1cE1ldGFkYXRhTWFuYWdlci5qYXZh) | `0.00% <0.00%> (-66.00%)` | :arrow_down: |
... and [1409 files with indirect coverage changes](https://codecov.io/gh/apache/pinot/pull/10572/indirect-changes?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
:mega: We’re building smart automated test selection to slash your CI/CD build times. [Learn more](https://about.codecov.io/iterative-testing/?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org
[GitHub] [pinot] yupeng9 commented on pull request #10572: Use disk based key value store for deduplication
Posted by "yupeng9 (via GitHub)" <gi...@apache.org>.
yupeng9 commented on PR #10572:
URL: https://github.com/apache/pinot/pull/10572#issuecomment-1504340869
> Currently the deduplication is handled using the same way as upsert as a short term solution (not production ready). We have done a lot of bugfixes to the upsert implementation, but not actively maintain the dedup implementation.
>
> My suggestion would be to redesign dedup from scratch since it is not the same as upsert (no need to maintain valid docs, no need to track segment etc.), and TTL (dedup window) should be a must have for dedup. If we have proper TTL, the key size should be much smaller.
>
> After that if we still need disk based KV store we can introduce that as a plug-in. We don't want to introduce RocksDB dependency in default distribution
I think the redesign of dedup makes sense but it'll be a much larger and involved effort, and unlikely for @raghavgautam to take. I think it makes sense to start a separate effort of dedup V2, but in the meantime it's fine for the community to add features and make improvements over V1, assuming it'll take some time for V2 to be ready.
For depedency on KV store, I left a similar comment that we'd better move this to a plugin for simpler dependencies of the core.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org
[GitHub] [pinot] Jackie-Jiang commented on pull request #10572: Use disk based key value store for deduplication
Posted by "Jackie-Jiang (via GitHub)" <gi...@apache.org>.
Jackie-Jiang commented on PR #10572:
URL: https://github.com/apache/pinot/pull/10572#issuecomment-1500737043
Currently the deduplication is handled using the same way as upsert as a short term solution (not production ready). We have done a lot of bugfixes to the upsert implementation, but not actively maintain the dedup implementation.
My suggestion would be to redesign dedup from scratch since it is not the same as upsert (no need to maintain valid docs, no need to track segment etc.), and TTL (dedup window) should be a must have for dedup. If we have proper TTL, the key size should be much smaller.
After that if we still need disk based KV store we can introduce that as a plug-in. We don't want to introduce RocksDB dependency in default distribution
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org
[GitHub] [pinot] yupeng9 commented on pull request #10572: Use disk based key value store for deduplication
Posted by "yupeng9 (via GitHub)" <gi...@apache.org>.
yupeng9 commented on PR #10572:
URL: https://github.com/apache/pinot/pull/10572#issuecomment-1505685210
> > I think the redesign of dedup makes sense but it'll be a much larger and involved effort, and unlikely for @raghavgautam to take. I think it makes sense to start a separate effort of dedup V2, but in the meantime it's fine for the community to add features and make improvements over V1, assuming it'll take some time for V2 to be ready.
>
> My concern for the current dedup implementation is that it is not maintained properly (not production ready), and the bugfixes for upsert is not added to dedup. If we decide to keep V1, we want to first apply all the changes to upsert to dedup as well.
This sounds like a bigger problem that dedup is not ready for the community but we released it. I don't see any warning in [Pinot doc](https://docs.pinot.apache.org/basics/data-import/dedup)...
I think we either maintain V1, or not advocate its use (e.g. mark it as experimental, or remove it). @kishoreg @icefury71 wdyt?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org
[GitHub] [pinot] yupeng9 commented on pull request #10572: Use disk based key value store for deduplication
Posted by "yupeng9 (via GitHub)" <gi...@apache.org>.
yupeng9 commented on PR #10572:
URL: https://github.com/apache/pinot/pull/10572#issuecomment-1500476636
It'll be good to move this rockdb dependency to a plugin subfolder so Pinot core does not depend on this lib
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org
[GitHub] [pinot] Jackie-Jiang commented on pull request #10572: Use disk based key value store for deduplication
Posted by "Jackie-Jiang (via GitHub)" <gi...@apache.org>.
Jackie-Jiang commented on PR #10572:
URL: https://github.com/apache/pinot/pull/10572#issuecomment-1505668507
> I think the redesign of dedup makes sense but it'll be a much larger and involved effort, and unlikely for @raghavgautam to take. I think it makes sense to start a separate effort of dedup V2, but in the meantime it's fine for the community to add features and make improvements over V1, assuming it'll take some time for V2 to be ready.
My concern for the current dedup implementation is that it is not maintained properly (not production ready), and the bugfixes for upsert is not added to dedup. If we decide to keep V1, we want to first apply all the changes to upsert to dedup as well.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org