You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pegasus.apache.org by GitBox <gi...@apache.org> on 2021/04/28 07:07:38 UTC

[GitHub] [incubator-pegasus] ZhongChaoqiang opened a new pull request #728: improve performance of count_data

ZhongChaoqiang opened a new pull request #728:
URL: https://github.com/apache/incubator-pegasus/pull/728


   ### What problem does this PR solve? <!--add issue link with summary if exists-->
   When we precisely count data for a large table, it will cost minutes or hours.
   
   ### What is changed and how does it work?
   Actually,we just need the count of data.So we just need transfer the count of data from server to client,  but not the detailed data.
   In our test, it will 10x faster than before.
   
   ##### Tests <!-- At least one of them must be included. -->
   - Unit test
   - Manual test (add detailed scripts or steps below)
   1. create a table with millions of data.
   2. use "count_data -c -f"
   
   ##### Related changes
   - Need to update the documentation
   - Need to be included in the release note


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pegasus.apache.org
For additional commands, e-mail: dev-help@pegasus.apache.org

[GitHub] [incubator-pegasus] levy5307 edited a comment on pull request #728: improve performance of count_data

Posted by GitBox <gi...@apache.org>.

levy5307 edited a comment on pull request #728:
URL: https://github.com/apache/incubator-pegasus/pull/728#issuecomment-838074458






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pegasus.apache.org
For additional commands, e-mail: dev-help@pegasus.apache.org

[GitHub] [incubator-pegasus] Shuo-Jia commented on pull request #728: improve performance of count_data

Posted by GitBox <gi...@apache.org>.

Shuo-Jia commented on pull request #728:
URL: https://github.com/apache/incubator-pegasus/pull/728#issuecomment-845729024


   > > > > > @Shuo-Jia @levy5307
   > > > > > Thanks your review.
   > > > > > In our scenario, count data precisely is a frequent operation after we bulkload sst files.And the table has a large amount of data, so it often takes a long time.
   > > > > > So the performance of count data precisely is very helpful for us.
   > > > > > Refactor of scan is better idea.Do I need to optimize the code to another RPC, or close this PR?
   > > > > 
   > > > > 
   > > > > Yes, I think it's good to open a new pull request to add another new rpc. @ZhongChaoqiang
   > > > 
   > > > 
   > > > @levy5307 @Shuo-Jia
   > > > 我们开发这个功能除了可以优化count_data的性能外，还有另外一个场景，就是快速查询某个scan条件(例如指定范围或前缀条件等的scan)的kv数量。
   > > > 如果使用单独的RPC，感觉和scan重复的功能太多了。所以，可以再帮忙看看，是不是还是放在现在的scan功能中会更合适一些呢？这样代码会简洁很多。谢谢！
   > > 
   > > 
   > > 对于重复的代码，可以考虑抽出来以复用，这样是否可以？@ZhongChaoqiang
   > 
   > 主要还不是代码的问题。由于我们多个scan的接口都有用到，例如get_scanner/async_get_scanner/get_unordered_scanners/async_get_unordered_scanners，如果count功能不走scan这条路线的话，就可能要新增多个count的接口，以对应原来的scan的功能了，因为count接口的查询条件要和scan接口的保持一致。感觉是不是更复杂了？@Shuo-Jia
   
   不太明白，scan一个rpc，count一个rpc，count和scan共用一套"迭代器"，scan的client接口完全可以保持不变；只需要添加count的api不就行了？


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pegasus.apache.org
For additional commands, e-mail: dev-help@pegasus.apache.org

[GitHub] [incubator-pegasus] Shuo-Jia commented on pull request #728: improve performance of count_data

Posted by GitBox <gi...@apache.org>.

Shuo-Jia commented on pull request #728:
URL: https://github.com/apache/incubator-pegasus/pull/728#issuecomment-837993349


   @ZhongChaoqiang This is good idea.  we have discussed it and think that:  
   1. add code into `on_scan`  will make this `on_scan`  more bloat, we now planing refactor the `scan` logic 
   2. use `scan rpc` to count data actually seem not to be a elegant design，would it be better to add a new RPC and only reuse the code of `on_scan`( which need `1` to refactor )
   3. cplus shell will be abandoned, the client suggest add into https://github.com/pegasus-kv/admin-cli


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pegasus.apache.org
For additional commands, e-mail: dev-help@pegasus.apache.org

[GitHub] [incubator-pegasus] ZhongChaoqiang commented on pull request #728: improve performance of count_data

Posted by GitBox <gi...@apache.org>.

ZhongChaoqiang commented on pull request #728:
URL: https://github.com/apache/incubator-pegasus/pull/728#issuecomment-845716066


   > > > > @Shuo-Jia @levy5307
   > > > > Thanks your review.
   > > > > In our scenario, count data precisely is a frequent operation after we bulkload sst files.And the table has a large amount of data, so it often takes a long time.
   > > > > So the performance of count data precisely is very helpful for us.
   > > > > Refactor of scan is better idea.Do I need to optimize the code to another RPC, or close this PR?
   > > > 
   > > > 
   > > > Yes, I think it's good to open a new pull request to add another new rpc. @ZhongChaoqiang
   > > 
   > > 
   > > @levy5307 @Shuo-Jia
   > > 我们开发这个功能除了可以优化count_data的性能外，还有另外一个场景，就是快速查询某个scan条件(例如指定范围或前缀条件等的scan)的kv数量。
   > > 如果使用单独的RPC，感觉和scan重复的功能太多了。所以，可以再帮忙看看，是不是还是放在现在的scan功能中会更合适一些呢？这样代码会简洁很多。谢谢！
   > 
   > 对于重复的代码，可以考虑抽出来以复用，这样是否可以？@ZhongChaoqiang
   
   主要还不是代码的问题。由于我们多个scan的接口都有用到，例如get_scanner/async_get_scanner/get_unordered_scanners/async_get_unordered_scanners，如果count功能不走scan这条路线的话，就可能要新增多个count的接口，以对应原来的scan的功能了，因为count接口的查询条件要和scan接口的保持一致。感觉是不是更复杂了？@Shuo-Jia


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pegasus.apache.org
For additional commands, e-mail: dev-help@pegasus.apache.org

[GitHub] [incubator-pegasus] levy5307 commented on pull request #728: improve performance of count_data

Posted by GitBox <gi...@apache.org>.

levy5307 commented on pull request #728:
URL: https://github.com/apache/incubator-pegasus/pull/728#issuecomment-838074458


   > @ZhongChaoqiang This is good idea. we have discussed it and think that:
   > 
   > 1. add code into `on_scan`  will make this `on_scan`  more bloat, we now planing refactor the `scan` logic
   > 2. use `scan rpc` to count data actually seem not to be a elegant design，would it be better to add a new RPC and only reuse the code of `on_scan`( which need `1` to refactor )
   > 3. cplus shell will be abandoned, the client suggest add into https://github.com/pegasus-kv/admin-cli
   
   I aggree with @Shuo-Jia . Besides count data precisely is not a common used scenario. Count estimates are enough in most cases


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pegasus.apache.org
For additional commands, e-mail: dev-help@pegasus.apache.org

[GitHub] [incubator-pegasus] Shuo-Jia commented on pull request #728: improve performance of count_data

Posted by GitBox <gi...@apache.org>.

Shuo-Jia commented on pull request #728:
URL: https://github.com/apache/incubator-pegasus/pull/728#issuecomment-845691040


   > > > @Shuo-Jia @levy5307
   > > > Thanks your review.
   > > > In our scenario, count data precisely is a frequent operation after we bulkload sst files.And the table has a large amount of data, so it often takes a long time.
   > > > So the performance of count data precisely is very helpful for us.
   > > > Refactor of scan is better idea.Do I need to optimize the code to another RPC, or close this PR?
   > > 
   > > 
   > > Yes, I think it's good to open a new pull request to add another new rpc. @ZhongChaoqiang
   > 
   > @levy5307 @Shuo-Jia
   > 我们开发这个功能除了可以优化count_data的性能外，还有另外一个场景，就是快速查询某个scan条件(例如指定范围或前缀条件等的scan)的kv数量。
   > 如果使用单独的RPC，感觉和scan重复的功能太多了。所以，可以再帮忙看看，是不是还是放在现在的scan功能中会更合适一些呢？这样代码会简洁很多。谢谢！
   
   对于重复的代码，可以考虑抽出来以复用，这样是否可以？@ZhongChaoqiang


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pegasus.apache.org
For additional commands, e-mail: dev-help@pegasus.apache.org

[GitHub] [incubator-pegasus] ZhongChaoqiang commented on pull request #728: improve performance of count_data

Posted by GitBox <gi...@apache.org>.

ZhongChaoqiang commented on pull request #728:
URL: https://github.com/apache/incubator-pegasus/pull/728#issuecomment-834035671


   @neverchanje @levy5307 
   Can you help to review?Thanks


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pegasus.apache.org
For additional commands, e-mail: dev-help@pegasus.apache.org

[GitHub] [incubator-pegasus] ZhongChaoqiang commented on pull request #728: improve performance of count_data

Posted by GitBox <gi...@apache.org>.

ZhongChaoqiang commented on pull request #728:
URL: https://github.com/apache/incubator-pegasus/pull/728#issuecomment-840342462


   @Shuo-Jia @levy5307 
   Thanks your review.
   In our scenario, count data precisely is a frequent operation after we bulkload sst files.And the table has a large amount of data, so it often takes a long time.
   So the performance of count data precisely is very helpful for us.
   Refactor of scan is better idea.Do I need to optimize the code to another RPC, or close this PR? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pegasus.apache.org
For additional commands, e-mail: dev-help@pegasus.apache.org

[GitHub] [incubator-pegasus] ZhongChaoqiang commented on pull request #728: improve performance of count_data

Posted by GitBox <gi...@apache.org>.

ZhongChaoqiang commented on pull request #728:
URL: https://github.com/apache/incubator-pegasus/pull/728#issuecomment-845056898


   > > @Shuo-Jia @levy5307
   > > Thanks your review.
   > > In our scenario, count data precisely is a frequent operation after we bulkload sst files.And the table has a large amount of data, so it often takes a long time.
   > > So the performance of count data precisely is very helpful for us.
   > > Refactor of scan is better idea.Do I need to optimize the code to another RPC, or close this PR?
   > 
   > Yes, I think it's good to open a new pull request to add another new rpc. @ZhongChaoqiang
   
   @levy5307 @Shuo-Jia 
   我们开发这个功能除了可以优化count_data的性能外，还有另外一个场景，就是快速查询某个scan条件(例如指定范围或前缀条件等的scan)的kv数量。
   如果使用单独的RPC，感觉和scan重复的功能太多了。所以，可以再帮忙看看，是不是还是放在现在的scan功能中会更合适一些呢？这样代码会简洁很多。谢谢！


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pegasus.apache.org
For additional commands, e-mail: dev-help@pegasus.apache.org

[GitHub] [incubator-pegasus] levy5307 commented on pull request #728: improve performance of count_data

Posted by GitBox <gi...@apache.org>.

levy5307 commented on pull request #728:
URL: https://github.com/apache/incubator-pegasus/pull/728#issuecomment-840375971


   > @Shuo-Jia @levy5307
   > Thanks your review.
   > In our scenario, count data precisely is a frequent operation after we bulkload sst files.And the table has a large amount of data, so it often takes a long time.
   > So the performance of count data precisely is very helpful for us.
   > Refactor of scan is better idea.Do I need to optimize the code to another RPC, or close this PR?
   
   Yes, I think it's good to open a new pull request to add another new rpc. @ZhongChaoqiang 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pegasus.apache.org
For additional commands, e-mail: dev-help@pegasus.apache.org