You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pegasus.apache.org by GitBox <gi...@apache.org> on 2022/08/01 08:23:55 UTC

[GitHub] [incubator-pegasus] GehaFearless opened a new pull request, #1091: improve performance of count_data

GehaFearless opened a new pull request, #1091:
URL: https://github.com/apache/incubator-pegasus/pull/1091

   ### What problem does this PR solve? 
   
   https://github.com/apache/incubator-pegasus/issues/1090
   
   When we precisely count data for a large table, it will cost minutes or hours. 
   However, it's unnecessarily return key-values from server to client. 
   
   ### What is changed and how does it work?
   Actually, we just need the count of data. So we just need transfer the count of data from server to client, but not the detailed data.
   When we need it, we can input "count_data -c -o" on pegasus_shell.
   
   ##### Tests <!-- At least one of them must be included. -->
   
   - Unit test
   - Manual test (add detailed scripts or steps below)
   
   ##### Related changes
   
   - Need to update the documentation
   - Need to be included in the release note
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@pegasus.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pegasus.apache.org
For additional commands, e-mail: dev-help@pegasus.apache.org


[GitHub] [incubator-pegasus] acelyc111 commented on a diff in pull request #1091: feat: improve performance of count_data

Posted by GitBox <gi...@apache.org>.
acelyc111 commented on code in PR #1091:
URL: https://github.com/apache/incubator-pegasus/pull/1091#discussion_r952110910


##########
src/client_lib/pegasus_client_impl.h:
##########
@@ -247,13 +247,21 @@ class pegasus_client_impl : public pegasus_client
 
     static void init_error();
 
+    enum class async_scan_type : char

Review Comment:
   Do not expose it to public in header, make it as a private of `class pegasus_scanner_impl` is enough?



##########
src/client_lib/pegasus_client_impl.h:
##########
@@ -247,13 +247,21 @@ class pegasus_client_impl : public pegasus_client
 
     static void init_error();
 
+    enum class async_scan_type : char
+    {
+        NORMAL,
+        COUNT_ONLY,
+        COUNT_ONLY_FINISHED
+    };
+
     class pegasus_scanner_impl : public pegasus_scanner
     {
     public:
         int next(std::string &hashkey,
                  std::string &sortkey,
                  std::string &value,
-                 internal_info *info = nullptr) override;
+                 internal_info *info = nullptr,
+                 int32_t *count = nullptr) override;

Review Comment:
   hashkey-sortkey-value and count will not be returned at the same time? If not, separete it into two functions would be better. one is:
   ```
           int next(std::string &hashkey,
                    std::string &sortkey,
                    std::string &value,
                    internal_info *info = nullptr)
   ```
   the other is 
   ```
           int next(int32_t &count,
                    internal_info *info = nullptr)
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@pegasus.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pegasus.apache.org
For additional commands, e-mail: dev-help@pegasus.apache.org


[GitHub] [incubator-pegasus] GehaFearless commented on a diff in pull request #1091: feat: improve performance of count_data

Posted by GitBox <gi...@apache.org>.
GehaFearless commented on code in PR #1091:
URL: https://github.com/apache/incubator-pegasus/pull/1091#discussion_r946320530


##########
src/server/pegasus_server_impl.cpp:
##########
@@ -1221,6 +1222,10 @@ void pegasus_server_impl::on_get_scanner(get_scanner_rpc rpc)
 
         it->Next();
     }
+    if (request.only_return_count) {
+        resp.kvs.emplace_back(::dsn::apps::key_value());

Review Comment:
   I did it by adding a member variable _already_add_count



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@pegasus.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pegasus.apache.org
For additional commands, e-mail: dev-help@pegasus.apache.org


[GitHub] [incubator-pegasus] acelyc111 commented on a diff in pull request #1091: feat: improve performance of count_data

Posted by GitBox <gi...@apache.org>.
acelyc111 commented on code in PR #1091:
URL: https://github.com/apache/incubator-pegasus/pull/1091#discussion_r947421269


##########
src/shell/commands/data_operations.cpp:
##########
@@ -2397,6 +2398,15 @@ bool count_data(command_executor *e, shell_context *sc, arguments args)
         options.no_value = false;
     else
         options.no_value = true;
+
+    if (diff_hash_key || stat_size || value_filter_type != pegasus::pegasus_client::FT_NO_FILTER ||
+        sort_key_filter_type == pegasus::pegasus_client::FT_MATCH_EXACT) {
+        options.only_return_count = false;

Review Comment:
   The jugement is complex, add comments to explain why.



##########
src/test/function_test/test_scan.cpp:
##########
@@ -168,6 +168,34 @@ class scan : public testing::Test
     }
 };
 
+TEST_F(scan, OVERALL_COUNT_ONLY)
+{
+    ddebug("TEST OVERALL_SCAN_COUNT_ONLY...");
+    pegasus_client::scan_options options;
+    options.only_return_count = true;
+    std::vector<pegasus_client::pegasus_scanner *> scanners;
+    int ret = client->get_unordered_scanners(3, options, scanners);
+    ASSERT_EQ(0, ret) << "Error occurred when getting scanner. error="
+                      << client->get_error_string(ret);
+    ASSERT_LE(scanners.size(), 3);
+
+    std::string hash_key;
+    std::string sort_key;
+    std::string value;
+    int32_t data_counts = 0;
+    for (auto scanner : scanners) {
+        ASSERT_NE(nullptr, scanner);
+        int32_t kv_count;
+        while (PERR_OK == (ret = (scanner->next(hash_key, sort_key, value, nullptr, &kv_count)))) {
+            data_counts += kv_count;
+        }
+        ASSERT_EQ(PERR_SCAN_COMPLETE, ret) << "Error occurred when scan. error="
+                                           << client->get_error_string(ret);
+        delete scanner;
+    }
+    ASSERT_EQ(10990, data_counts);

Review Comment:
   where dose 10990 come from? it would be better to use a variable from previous variables, do not use a magic number.



##########
src/client_lib/pegasus_scanner_impl.cpp:
##########
@@ -261,6 +278,18 @@ void pegasus_client_impl::pegasus_scanner_impl::_on_scan_response(::dsn::error_c
             _kvs = std::move(response.kvs);
             _p = -1;
             _context = response.context_id;
+            // 1. kv_count exist on response means server is newer version (added counting size only
+            // implementation)
+            //   1> kv_count == -1 indicates response still have key and value data
+            //   2> kv_count > -1 indicates response only have kv size count, but not key && value
+            // 2. kv_count is not existed means server is older version
+            if (response.__isset.kv_count) {
+                if (response.kv_count != -1) {
+                    _only_calculate_count = true;

Review Comment:
   Request is sent from client, we can init _only_calculate_count when send request, right?



##########
idl/rrdb.thrift:
##########
@@ -279,6 +279,7 @@ struct get_scanner_request
     11:optional bool    validate_partition_hash;
     12:optional bool    return_expire_ts;
     13:optional bool full_scan; // true means client want to build 'full scan' context with the server side, false otherwise
+    14:optional bool only_return_count;

Review Comment:
   set the default value to false?



##########
src/server/pegasus_server_impl.cpp:
##########
@@ -1312,6 +1319,7 @@ void pegasus_server_impl::on_scan(scan_rpc rpc)
     resp.app_id = _gpid.get_app_id();
     resp.partition_index = _gpid.get_partition_index();
     resp.server = _primary_address;
+    bool only_return_count = false;

Review Comment:
   Define it in `if (context) {...}` scope is enough.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@pegasus.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pegasus.apache.org
For additional commands, e-mail: dev-help@pegasus.apache.org


[GitHub] [incubator-pegasus] empiredan commented on a diff in pull request #1091: feat: improve performance of count_data

Posted by GitBox <gi...@apache.org>.
empiredan commented on code in PR #1091:
URL: https://github.com/apache/incubator-pegasus/pull/1091#discussion_r935080105


##########
src/shell/commands/data_operations.cpp:
##########
@@ -2397,6 +2401,15 @@ bool count_data(command_executor *e, shell_context *sc, arguments args)
         options.no_value = false;
     else
         options.no_value = true;
+
+    if (diff_hash_key || stat_size || value_filter_type != pegasus::pegasus_client::FT_NO_FILTER ||
+        sort_key_filter_type != pegasus::pegasus_client::FT_NO_FILTER) {
+        options.only_return_count = false;
+    }

Review Comment:
   Otherwise, can `only_return_count` be set true automatically ?



##########
src/server/pegasus_server_impl.cpp:
##########
@@ -1365,7 +1373,8 @@ void pegasus_server_impl::on_scan(scan_rpc rpc)
                                                    epoch_now,
                                                    no_value,
                                                    validate_hash,
-                                                   return_expire_ts);
+                                                   return_expire_ts,
+                                                   request.only_return_count ? false : true);

Review Comment:
   ```c++
   !request.only_return_count);
   ```



##########
src/shell/commands/data_operations.cpp:
##########
@@ -2397,6 +2401,15 @@ bool count_data(command_executor *e, shell_context *sc, arguments args)
         options.no_value = false;
     else
         options.no_value = true;
+
+    if (diff_hash_key || stat_size || value_filter_type != pegasus::pegasus_client::FT_NO_FILTER ||

Review Comment:
   Is it possible to pass `value_filter_type` to the server side to filter value remotely ?



##########
src/server/pegasus_server_impl.cpp:
##########
@@ -1449,7 +1463,10 @@ void pegasus_server_impl::on_scan(scan_rpc rpc)
         resp.error = rocksdb::Status::Code::kNotFound;
     }
 
-    _cu_calculator->add_scan_cu(req, resp.error, resp.kvs);
+    // abandon calculate capacity unit
+    if (request.only_return_count) {

Review Comment:
   ```c++
   if (!request.only_return_count)
   ```
   
   On the other hand, should `add_scan_cu` be ignored ? Since it will be count as 1 even if not found.
   ```c++
       if (status == rocksdb::Status::kNotFound) {
           add_read_cu(1);
           return;
       }
   ```



##########
src/shell/commands/data_operations.cpp:
##########
@@ -2169,6 +2169,7 @@ bool clear_data(command_executor *e, shell_context *sc, arguments args)
 bool count_data(command_executor *e, shell_context *sc, arguments args)
 {
     static struct option long_options[] = {{"precise", no_argument, 0, 'c'},
+                                           {"only_return_data_count", no_argument, 0, 'o'},

Review Comment:
   Do we need the option `only_return_data_count` ? Can it be decided accordingly ? 



##########
src/shell/command_helper.h:
##########
@@ -499,36 +501,40 @@ inline void scan_data_next(scan_data_context *context)
                             context->timeout_ms);
                         break;
                     case SCAN_COUNT:
-                        context->split_rows++;
-                        if (context->stat_size && context->statistics) {
-                            long hash_key_size = hash_key.size();
-                            context->statistics->measureTime(
-                                static_cast<uint32_t>(histogram_type::HASH_KEY_SIZE),
-                                hash_key_size);
-
-                            long sort_key_size = sort_key.size();
-                            context->statistics->measureTime(
-                                static_cast<uint32_t>(histogram_type::SORT_KEY_SIZE),
-                                sort_key_size);
-
-                            long value_size = value.size();
-                            context->statistics->measureTime(
-                                static_cast<uint32_t>(histogram_type::VALUE_SIZE), value_size);
-
-                            long row_size = hash_key_size + sort_key_size + value_size;
-                            context->statistics->measureTime(
-                                static_cast<uint32_t>(histogram_type::ROW_SIZE), row_size);
-
-                            if (context->top_count > 0) {
-                                context->top_rows.push(
-                                    std::move(hash_key), std::move(sort_key), row_size);
+                        if (kv_count == -1) {

Review Comment:
   Does `kv_count == -1` means `only_return_count` ? Can `only_return_count` be used as the condition to be checked ?



##########
src/server/pegasus_server_impl.cpp:
##########
@@ -2316,24 +2332,28 @@ pegasus_server_impl::append_key_value_for_scan(std::vector<::dsn::apps::key_valu
             return range_iteration_state::kFiltered;
         }
     }
-    std::shared_ptr<char> key_buf(::dsn::utils::make_shared_array<char>(raw_key.length()));
-    ::memcpy(key_buf.get(), raw_key.data(), raw_key.length());
-    kv.key.assign(std::move(key_buf), 0, raw_key.length());
+    if (fill_value) {

Review Comment:
   ```c++
   if (!fill_value) {
       return range_iteration_state::kNormal;
   }
   ```



##########
src/server/pegasus_server_impl.cpp:
##########
@@ -1198,7 +1198,8 @@ void pegasus_server_impl::on_get_scanner(get_scanner_rpc rpc)
             epoch_now,
             request.no_value,
             request.__isset.validate_partition_hash ? request.validate_partition_hash : true,
-            return_expire_ts);
+            return_expire_ts,
+            request.only_return_count ? false : true);

Review Comment:
   ```c++
   !request.only_return_count);
   ```



##########
src/shell/command_helper.h:
##########
@@ -499,36 +501,40 @@ inline void scan_data_next(scan_data_context *context)
                             context->timeout_ms);
                         break;
                     case SCAN_COUNT:
-                        context->split_rows++;
-                        if (context->stat_size && context->statistics) {
-                            long hash_key_size = hash_key.size();
-                            context->statistics->measureTime(
-                                static_cast<uint32_t>(histogram_type::HASH_KEY_SIZE),
-                                hash_key_size);
-
-                            long sort_key_size = sort_key.size();
-                            context->statistics->measureTime(
-                                static_cast<uint32_t>(histogram_type::SORT_KEY_SIZE),
-                                sort_key_size);
-
-                            long value_size = value.size();
-                            context->statistics->measureTime(
-                                static_cast<uint32_t>(histogram_type::VALUE_SIZE), value_size);
-
-                            long row_size = hash_key_size + sort_key_size + value_size;
-                            context->statistics->measureTime(
-                                static_cast<uint32_t>(histogram_type::ROW_SIZE), row_size);
-
-                            if (context->top_count > 0) {
-                                context->top_rows.push(
-                                    std::move(hash_key), std::move(sort_key), row_size);
+                        if (kv_count == -1) {
+                            context->split_rows++;
+                            if (context->stat_size && context->statistics) {
+                                long hash_key_size = hash_key.size();
+                                context->statistics->measureTime(
+                                    static_cast<uint32_t>(histogram_type::HASH_KEY_SIZE),
+                                    hash_key_size);
+
+                                long sort_key_size = sort_key.size();
+                                context->statistics->measureTime(
+                                    static_cast<uint32_t>(histogram_type::SORT_KEY_SIZE),
+                                    sort_key_size);
+
+                                long value_size = value.size();
+                                context->statistics->measureTime(
+                                    static_cast<uint32_t>(histogram_type::VALUE_SIZE), value_size);
+
+                                long row_size = hash_key_size + sort_key_size + value_size;
+                                context->statistics->measureTime(
+                                    static_cast<uint32_t>(histogram_type::ROW_SIZE), row_size);
+
+                                if (context->top_count > 0) {
+                                    context->top_rows.push(
+                                        std::move(hash_key), std::move(sort_key), row_size);
+                                }
                             }
-                        }
-                        if (context->count_hash_key) {
-                            if (hash_key != context->last_hash_key) {
-                                context->split_hash_key_count++;
-                                context->last_hash_key = std::move(hash_key);
+                            if (context->count_hash_key) {
+                                if (hash_key != context->last_hash_key) {
+                                    context->split_hash_key_count++;
+                                    context->last_hash_key = std::move(hash_key);
+                                }
                             }
+                        } else {

Review Comment:
   ```c++
   if (...) {
       context->split_rows += kv_count;
       scan_data_next(context);
       break;
   }
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@pegasus.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pegasus.apache.org
For additional commands, e-mail: dev-help@pegasus.apache.org


[GitHub] [incubator-pegasus] GehaFearless commented on a diff in pull request #1091: feat: improve performance of count_data

Posted by GitBox <gi...@apache.org>.
GehaFearless commented on code in PR #1091:
URL: https://github.com/apache/incubator-pegasus/pull/1091#discussion_r941957204


##########
src/shell/command_helper.h:
##########
@@ -499,36 +501,40 @@ inline void scan_data_next(scan_data_context *context)
                             context->timeout_ms);
                         break;
                     case SCAN_COUNT:
-                        context->split_rows++;
-                        if (context->stat_size && context->statistics) {
-                            long hash_key_size = hash_key.size();
-                            context->statistics->measureTime(
-                                static_cast<uint32_t>(histogram_type::HASH_KEY_SIZE),
-                                hash_key_size);
-
-                            long sort_key_size = sort_key.size();
-                            context->statistics->measureTime(
-                                static_cast<uint32_t>(histogram_type::SORT_KEY_SIZE),
-                                sort_key_size);
-
-                            long value_size = value.size();
-                            context->statistics->measureTime(
-                                static_cast<uint32_t>(histogram_type::VALUE_SIZE), value_size);
-
-                            long row_size = hash_key_size + sort_key_size + value_size;
-                            context->statistics->measureTime(
-                                static_cast<uint32_t>(histogram_type::ROW_SIZE), row_size);
-
-                            if (context->top_count > 0) {
-                                context->top_rows.push(
-                                    std::move(hash_key), std::move(sort_key), row_size);
+                        if (kv_count == -1) {

Review Comment:
   In order to cope with the different versions of the match, I think this is the best



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@pegasus.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pegasus.apache.org
For additional commands, e-mail: dev-help@pegasus.apache.org


[GitHub] [incubator-pegasus] GehaFearless commented on a diff in pull request #1091: feat: improve performance of count_data

Posted by GitBox <gi...@apache.org>.
GehaFearless commented on code in PR #1091:
URL: https://github.com/apache/incubator-pegasus/pull/1091#discussion_r942269032


##########
src/server/pegasus_server_impl.cpp:
##########
@@ -1198,7 +1198,8 @@ void pegasus_server_impl::on_get_scanner(get_scanner_rpc rpc)
             epoch_now,
             request.no_value,
             request.__isset.validate_partition_hash ? request.validate_partition_hash : true,
-            return_expire_ts);
+            return_expire_ts,
+            !request.only_return_count);

Review Comment:
   Every kv need to be processd by a filter from `append_key_value_for_scan`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@pegasus.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pegasus.apache.org
For additional commands, e-mail: dev-help@pegasus.apache.org


[GitHub] [incubator-pegasus] acelyc111 commented on a diff in pull request #1091: feat: improve performance of count_data

Posted by GitBox <gi...@apache.org>.
acelyc111 commented on code in PR #1091:
URL: https://github.com/apache/incubator-pegasus/pull/1091#discussion_r949835960


##########
src/test/function_test/test_scan.cpp:
##########
@@ -168,6 +168,38 @@ class scan : public testing::Test
     }
 };
 
+TEST_F(scan, OVERALL_COUNT_ONLY)
+{
+    ddebug("TEST OVERALL_SCAN_COUNT_ONLY...");
+    pegasus_client::scan_options options;
+    options.only_return_count = true;
+    std::vector<pegasus_client::pegasus_scanner *> scanners;
+    int ret = client->get_unordered_scanners(3, options, scanners);
+    ASSERT_EQ(0, ret) << "Error occurred when getting scanner. error="
+                      << client->get_error_string(ret);
+    ASSERT_LE(scanners.size(), 3);
+
+    std::string hash_key;
+    std::string sort_key;
+    std::string value;
+    int32_t data_counts = 0;
+    for (auto scanner : scanners) {
+        ASSERT_NE(nullptr, scanner);
+        int32_t kv_count;
+        while (PERR_OK == (ret = (scanner->next(hash_key, sort_key, value, nullptr, &kv_count)))) {
+            data_counts += kv_count;
+        }
+        ASSERT_EQ(PERR_SCAN_COMPLETE, ret) << "Error occurred when scan. error="
+                                           << client->get_error_string(ret);
+        delete scanner;
+    }
+    int data_sizes = 0;

Review Comment:
   better naming as xxx_count



##########
src/client_lib/pegasus_client_impl.h:
##########
@@ -247,13 +247,21 @@ class pegasus_client_impl : public pegasus_client
 
     static void init_error();
 
+    enum class async_scan_type : char
+    {
+        normal,

Review Comment:
   Both ALL_UPPER_CASE and kCamelCase naming are recommend for enum in Pegasus.



##########
src/client_lib/pegasus_scanner_impl.cpp:
##########
@@ -261,6 +278,17 @@ void pegasus_client_impl::pegasus_scanner_impl::_on_scan_response(::dsn::error_c
             _kvs = std::move(response.kvs);
             _p = -1;
             _context = response.context_id;
+            // 1. kv_count exist on response means server is newer version (added counting size only
+            // implementation)
+            //   1> kv_count == -1 indicates response still have key and value data
+            //   2> kv_count > -1 indicates response only have kv size count, but not key && value
+            // 2. kv_count is not existed means server is older version
+            if (response.__isset.kv_count) {
+                if (response.kv_count != -1) {

Review Comment:
   in which case that response.__isset.kv_count is set but response.kv_count == -1?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@pegasus.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pegasus.apache.org
For additional commands, e-mail: dev-help@pegasus.apache.org


[GitHub] [incubator-pegasus] empiredan merged pull request #1091: feat: improve performance of count_data

Posted by GitBox <gi...@apache.org>.
empiredan merged PR #1091:
URL: https://github.com/apache/incubator-pegasus/pull/1091


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@pegasus.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pegasus.apache.org
For additional commands, e-mail: dev-help@pegasus.apache.org


[GitHub] [incubator-pegasus] acelyc111 commented on a diff in pull request #1091: feat: improve performance of count_data

Posted by GitBox <gi...@apache.org>.
acelyc111 commented on code in PR #1091:
URL: https://github.com/apache/incubator-pegasus/pull/1091#discussion_r937403751


##########
src/client_lib/pegasus_scanner_impl.cpp:
##########
@@ -55,7 +55,35 @@ pegasus_client_impl::pegasus_scanner_impl::pegasus_scanner_impl(::dsn::apps::rrd
       _rpc_started(false),
       _validate_partition_hash(validate_partition_hash),
       _full_scan(full_scan)
+

Review Comment:
   remove the blank line



##########
idl/rrdb.thrift:
##########
@@ -279,11 +279,15 @@ struct get_scanner_request
     11:optional bool    validate_partition_hash;
     12:optional bool    return_expire_ts;
     13:optional bool full_scan; // true means client want to build 'full scan' context with the server side, false otherwise
+    14:optional filter_type value_filter_type;
+    15:optional dsn.blob    value_filter_pattern;

Review Comment:
   Are the two parameters related to this patch?



##########
src/client_lib/pegasus_scanner_impl.cpp:
##########
@@ -162,13 +194,15 @@ void pegasus_client_impl::pegasus_scanner_impl::_async_next_internal()
         }
 
         // valid data got
-        std::string hash_key, sort_key;
-        pegasus_restore_key(_kvs[_p].key, hash_key, sort_key);
-        std::string value(_kvs[_p].value.data(), _kvs[_p].value.length());
-        uint32_t expire_ts_seconds = _kvs[_p].__isset.expire_ts_seconds
-                                         ? static_cast<uint32_t>(_kvs[_p].expire_ts_seconds)
-                                         : 0;
-
+        std::string hash_key, sort_key, value = "";

Review Comment:
   ```suggestion
           std::string hash_key, sort_key, value;
   ```



##########
src/client_lib/pegasus_scanner_impl.cpp:
##########
@@ -162,13 +194,15 @@ void pegasus_client_impl::pegasus_scanner_impl::_async_next_internal()
         }
 
         // valid data got
-        std::string hash_key, sort_key;
-        pegasus_restore_key(_kvs[_p].key, hash_key, sort_key);
-        std::string value(_kvs[_p].value.data(), _kvs[_p].value.length());
-        uint32_t expire_ts_seconds = _kvs[_p].__isset.expire_ts_seconds
-                                         ? static_cast<uint32_t>(_kvs[_p].expire_ts_seconds)
-                                         : 0;
-
+        std::string hash_key, sort_key, value = "";
+        uint32_t expire_ts_seconds = 0;
+        if (_kv_count == -1) {

Review Comment:
   it suggest to add some comment to explain what dose `_kv_count == -1` mean



##########
src/server/pegasus_server_impl.cpp:
##########
@@ -1389,6 +1419,11 @@ void pegasus_server_impl::on_scan(scan_rpc rpc)
             it->Next();
         }
 
+        if (request.only_return_count) {

Review Comment:
   you can use the flag in pegasus_scan_context, right?



##########
src/client_lib/pegasus_scanner_impl.cpp:
##########
@@ -55,7 +55,35 @@ pegasus_client_impl::pegasus_scanner_impl::pegasus_scanner_impl(::dsn::apps::rrd
       _rpc_started(false),
       _validate_partition_hash(validate_partition_hash),
       _full_scan(full_scan)
+
+{
+}
+
+int pegasus_client_impl::pegasus_scanner_impl::next(std::string &hashkey,

Review Comment:
   why not reuse the exist `pegasus_client_impl::pegasus_scanner_impl::next` ?



##########
idl/rrdb.thrift:
##########
@@ -279,11 +279,15 @@ struct get_scanner_request
     11:optional bool    validate_partition_hash;
     12:optional bool    return_expire_ts;
     13:optional bool full_scan; // true means client want to build 'full scan' context with the server side, false otherwise
+    14:optional filter_type value_filter_type;
+    15:optional dsn.blob    value_filter_pattern;
+    16:optional bool    only_return_count = false;
 }
 
 struct scan_request
 {
     1:i64           context_id;
+    2:optional bool only_return_count;

Review Comment:
   the `get_scanner_request ` has a `only_return_count` field, why add another one to `scan_request`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@pegasus.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pegasus.apache.org
For additional commands, e-mail: dev-help@pegasus.apache.org


[GitHub] [incubator-pegasus] GehaFearless commented on a diff in pull request #1091: feat: improve performance of count_data

Posted by GitBox <gi...@apache.org>.
GehaFearless commented on code in PR #1091:
URL: https://github.com/apache/incubator-pegasus/pull/1091#discussion_r948701421


##########
src/shell/commands/data_operations.cpp:
##########
@@ -2333,6 +2333,7 @@ bool count_data(command_executor *e, shell_context *sc, arguments args)
         tp.output(std::cout, tp_output_format::kTabular);
         return true;
     }
+    options.only_return_count = true;

Review Comment:
   It's better to change the position



##########
src/client_lib/pegasus_scanner_impl.cpp:
##########
@@ -261,6 +278,17 @@ void pegasus_client_impl::pegasus_scanner_impl::_on_scan_response(::dsn::error_c
             _kvs = std::move(response.kvs);
             _p = -1;
             _context = response.context_id;
+            // 1. kv_count exist on response means server is newer version (added counting size only
+            // implementation)
+            //   1> kv_count == -1 indicates response still have key and value data
+            //   2> kv_count > -1 indicates response only have kv size count, but not key && value
+            // 2. kv_count is not existed means server is older version
+            if (response.__isset.kv_count) {
+                if (response.kv_count != -1) {

Review Comment:
   case value_filter be used 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@pegasus.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pegasus.apache.org
For additional commands, e-mail: dev-help@pegasus.apache.org


[GitHub] [incubator-pegasus] GehaFearless commented on pull request #1091: feat: improve performance of count_data

Posted by GitBox <gi...@apache.org>.
GehaFearless commented on PR #1091:
URL: https://github.com/apache/incubator-pegasus/pull/1091#issuecomment-1214601291

   > Could you give some performance comparation with the old version ?
   for my bench on onebox:
   1> before
   ![wecom-temp-d7e2a1e011357a0ca4c3ad44d910e69e](https://user-images.githubusercontent.com/48315319/184576502-a657228b-3038-4c34-bb29-c1ab5391bfd2.png)
   
   2> this commit
   <img width="813" alt="image" src="https://user-images.githubusercontent.com/48315319/184576461-15dfad35-dbef-4dea-b15b-5a99cf70f527.png">
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@pegasus.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pegasus.apache.org
For additional commands, e-mail: dev-help@pegasus.apache.org


[GitHub] [incubator-pegasus] empiredan commented on a diff in pull request #1091: feat: improve performance of count_data

Posted by GitBox <gi...@apache.org>.
empiredan commented on code in PR #1091:
URL: https://github.com/apache/incubator-pegasus/pull/1091#discussion_r956602425


##########
src/shell/commands/data_operations.cpp:
##########
@@ -2397,6 +2397,17 @@ bool count_data(command_executor *e, shell_context *sc, arguments args)
         options.no_value = false;
     else
         options.no_value = true;
+
+    // cuz option only_return_count not return data(hashkey&sortkey&value) to client
+    // all of this options need check data on client side

Review Comment:
   I think maybe it's better to explain each of conditions here ?
   ```suggestion 
       // Decide whether real data should be returned to client. Once the real data is
       // decided not to be returned to client side: option `only_return_count` will be
       // used.
   ```



##########
src/client_lib/pegasus_scanner_impl.cpp:
##########
@@ -51,13 +51,38 @@ pegasus_client_impl::pegasus_scanner_impl::pegasus_scanner_impl(::dsn::apps::rrd
       _options(options),
       _splits_hash(std::move(hash)),
       _p(-1),
+      _kv_count(-1),
       _context(SCAN_CONTEXT_ID_COMPLETED),
       _rpc_started(false),
       _validate_partition_hash(validate_partition_hash),
-      _full_scan(full_scan)
+      _full_scan(full_scan),
+      _type(async_scan_type::NORMAL)
 {
 }
 
+int pegasus_client_impl::pegasus_scanner_impl::next(int32_t &count, internal_info *info)
+{
+    ::dsn::utils::notify_event op_completed;
+    int ret = -1;
+    auto callback = [&](int err,
+                        std::string &&hash,
+                        std::string &&sort,
+                        std::string &&val,
+                        internal_info &&ii,
+                        uint32_t expire_ts_seconds,
+                        int32_t kv_count) {
+        ret = err;
+        if (info) {
+            (*info) = std::move(ii);
+        }

Review Comment:
   ```suggestion
           if (info != nullptr) {
               *info = std::move(ii);
           }
   ```



##########
src/client_lib/pegasus_scanner_impl.cpp:
##########
@@ -261,6 +297,16 @@ void pegasus_client_impl::pegasus_scanner_impl::_on_scan_response(::dsn::error_c
             _kvs = std::move(response.kvs);
             _p = -1;
             _context = response.context_id;
+            // 1. kv_count exist on response mean (a && b):
+            //   a> server is newer version (added counting size only implementation)
+            //   b> response only have kv size count, but not key && value
+            // 2. kv_count is not existed means (a || b):
+            //   a> server is older version
+            //   b> response still have key and value data

Review Comment:
   I think there is no need to state "otherwise" here which is just the negation of the existence of `kv_count`.
   ```suggestion
               // If `kv_count` exists in response, then:
               //   1) server side supports only counting size, and
               //   2) `kvs` in response must be empty
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@pegasus.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pegasus.apache.org
For additional commands, e-mail: dev-help@pegasus.apache.org


[GitHub] [incubator-pegasus] GehaFearless commented on a diff in pull request #1091: feat: improve performance of count_data

Posted by GitBox <gi...@apache.org>.
GehaFearless commented on code in PR #1091:
URL: https://github.com/apache/incubator-pegasus/pull/1091#discussion_r938599524


##########
src/server/pegasus_server_impl.cpp:
##########
@@ -2316,24 +2332,28 @@ pegasus_server_impl::append_key_value_for_scan(std::vector<::dsn::apps::key_valu
             return range_iteration_state::kFiltered;
         }
     }
-    std::shared_ptr<char> key_buf(::dsn::utils::make_shared_array<char>(raw_key.length()));
-    ::memcpy(key_buf.get(), raw_key.data(), raw_key.length());
-    kv.key.assign(std::move(key_buf), 0, raw_key.length());
+    if (fill_value) {

Review Comment:
   `return range_iteration_state::kNormal` All branches use it, So I think it is unnecessary



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@pegasus.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pegasus.apache.org
For additional commands, e-mail: dev-help@pegasus.apache.org


[GitHub] [incubator-pegasus] GehaFearless commented on a diff in pull request #1091: feat: improve performance of count_data

Posted by GitBox <gi...@apache.org>.
GehaFearless commented on code in PR #1091:
URL: https://github.com/apache/incubator-pegasus/pull/1091#discussion_r942264440


##########
src/client_lib/pegasus_scanner_impl.cpp:
##########
@@ -162,13 +168,16 @@ void pegasus_client_impl::pegasus_scanner_impl::_async_next_internal()
         }
 
         // valid data got
-        std::string hash_key, sort_key;
-        pegasus_restore_key(_kvs[_p].key, hash_key, sort_key);
-        std::string value(_kvs[_p].value.data(), _kvs[_p].value.length());
-        uint32_t expire_ts_seconds = _kvs[_p].__isset.expire_ts_seconds
-                                         ? static_cast<uint32_t>(_kvs[_p].expire_ts_seconds)
-                                         : 0;
-
+        std::string hash_key, sort_key, value;
+        uint32_t expire_ts_seconds = 0;
+        // _kv_count == -1 means req just want to get data counts, not include data value
+        if (_kv_count == -1) {

Review Comment:
   fix comment



##########
src/server/pegasus_server_impl.cpp:
##########
@@ -1221,6 +1222,10 @@ void pegasus_server_impl::on_get_scanner(get_scanner_rpc rpc)
 
         it->Next();
     }
+    if (request.only_return_count) {
+        resp.kvs.emplace_back(::dsn::apps::key_value());

Review Comment:
   I need not to refactor `pegasus_client_impl::pegasus_scanner_impl::_async_next_internal()` if I fill it.



##########
src/server/pegasus_server_impl.cpp:
##########
@@ -1198,7 +1198,8 @@ void pegasus_server_impl::on_get_scanner(get_scanner_rpc rpc)
             epoch_now,
             request.no_value,
             request.__isset.validate_partition_hash ? request.validate_partition_hash : true,
-            return_expire_ts);
+            return_expire_ts,
+            !request.only_return_count);

Review Comment:
   Every kv need to be processd by a filter from `append_key_value_for_scan`.



##########
src/server/pegasus_server_impl.cpp:
##########
@@ -1389,6 +1401,11 @@ void pegasus_server_impl::on_scan(scan_rpc rpc)
             it->Next();
         }
 
+        if (!only_return_count) {

Review Comment:
   `only_return_count ` is true means response `kvs` is empty, false means response `kvs` will be filled by data.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@pegasus.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pegasus.apache.org
For additional commands, e-mail: dev-help@pegasus.apache.org


[GitHub] [incubator-pegasus] acelyc111 commented on pull request #1091: feat: improve performance of count_data

Posted by GitBox <gi...@apache.org>.
acelyc111 commented on PR #1091:
URL: https://github.com/apache/incubator-pegasus/pull/1091#issuecomment-1212817124

   Could you give some performance comparation with the old version ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@pegasus.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pegasus.apache.org
For additional commands, e-mail: dev-help@pegasus.apache.org


[GitHub] [incubator-pegasus] GehaFearless commented on a diff in pull request #1091: feat: improve performance of count_data

Posted by GitBox <gi...@apache.org>.
GehaFearless commented on code in PR #1091:
URL: https://github.com/apache/incubator-pegasus/pull/1091#discussion_r938599524


##########
src/server/pegasus_server_impl.cpp:
##########
@@ -2316,24 +2332,28 @@ pegasus_server_impl::append_key_value_for_scan(std::vector<::dsn::apps::key_valu
             return range_iteration_state::kFiltered;
         }
     }
-    std::shared_ptr<char> key_buf(::dsn::utils::make_shared_array<char>(raw_key.length()));
-    ::memcpy(key_buf.get(), raw_key.data(), raw_key.length());
-    kv.key.assign(std::move(key_buf), 0, raw_key.length());
+    if (fill_value) {

Review Comment:
   `return range_iteration_state::kNormal` All branches use it, So I think is unnecessary



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@pegasus.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pegasus.apache.org
For additional commands, e-mail: dev-help@pegasus.apache.org


[GitHub] [incubator-pegasus] acelyc111 commented on a diff in pull request #1091: feat: improve performance of count_data

Posted by GitBox <gi...@apache.org>.
acelyc111 commented on code in PR #1091:
URL: https://github.com/apache/incubator-pegasus/pull/1091#discussion_r942236139


##########
src/server/pegasus_server_impl.cpp:
##########
@@ -2274,7 +2294,8 @@ pegasus_server_impl::append_key_value_for_scan(std::vector<::dsn::apps::key_valu
                                                uint32_t epoch_now,
                                                bool no_value,
                                                bool request_validate_hash,
-                                               bool request_expire_ts)
+                                               bool request_expire_ts,
+                                               bool fill_value)

Review Comment:
   The function name `append_key_value_for_scan` is not suitable now, because the 'append' operation will not happen when fill_value is false.
   You can do a refactor to separete then key-value valicate part and append part into two functions.



##########
src/server/pegasus_server_impl.cpp:
##########
@@ -1198,7 +1198,8 @@ void pegasus_server_impl::on_get_scanner(get_scanner_rpc rpc)
             epoch_now,
             request.no_value,
             request.__isset.validate_partition_hash ? request.validate_partition_hash : true,
-            return_expire_ts);
+            return_expire_ts,
+            !request.only_return_count);

Review Comment:
   I found you reverse the value every calls of append_key_value_for_scan, why not use it directly?



##########
src/server/pegasus_server_impl.cpp:
##########
@@ -2316,24 +2332,28 @@ pegasus_server_impl::append_key_value_for_scan(std::vector<::dsn::apps::key_valu
             return range_iteration_state::kFiltered;
         }
     }
-    std::shared_ptr<char> key_buf(::dsn::utils::make_shared_array<char>(raw_key.length()));
-    ::memcpy(key_buf.get(), raw_key.data(), raw_key.length());
-    kv.key.assign(std::move(key_buf), 0, raw_key.length());
+    if (fill_value) {

Review Comment:
   Sorry, I didn't get your point..
   I think short-circut return is often a recommand code style too.



##########
src/server/pegasus_server_impl.cpp:
##########
@@ -1389,6 +1401,11 @@ void pegasus_server_impl::on_scan(scan_rpc rpc)
             it->Next();
         }
 
+        if (!only_return_count) {
+            resp.kvs.emplace_back(::dsn::apps::key_value());

Review Comment:
   Same



##########
src/server/pegasus_server_impl.cpp:
##########
@@ -1389,6 +1401,11 @@ void pegasus_server_impl::on_scan(scan_rpc rpc)
             it->Next();
         }
 
+        if (!only_return_count) {

Review Comment:
   should be `if (only_return_count)` ?



##########
src/server/pegasus_server_impl.cpp:
##########
@@ -1221,6 +1222,10 @@ void pegasus_server_impl::on_get_scanner(get_scanner_rpc rpc)
 
         it->Next();
     }
+    if (request.only_return_count) {
+        resp.kvs.emplace_back(::dsn::apps::key_value());

Review Comment:
   Why fill such a empty kv-pair, is it neccessary?



##########
src/server/pegasus_server_impl.cpp:
##########
@@ -1297,7 +1303,10 @@ void pegasus_server_impl::on_get_scanner(get_scanner_rpc rpc)
         _pfc_recent_filter_count->add(filter_count);
     }
 
-    _cu_calculator->add_scan_cu(req, resp.error, resp.kvs);
+    // abandon calculate capacity unit
+    if (!request.only_return_count) {

Review Comment:
   We should also calculate CU, and you have to update add_scan_cu, add some small CU at least.



##########
src/client_lib/pegasus_scanner_impl.cpp:
##########
@@ -162,13 +168,16 @@ void pegasus_client_impl::pegasus_scanner_impl::_async_next_internal()
         }
 
         // valid data got
-        std::string hash_key, sort_key;
-        pegasus_restore_key(_kvs[_p].key, hash_key, sort_key);
-        std::string value(_kvs[_p].value.data(), _kvs[_p].value.length());
-        uint32_t expire_ts_seconds = _kvs[_p].__isset.expire_ts_seconds
-                                         ? static_cast<uint32_t>(_kvs[_p].expire_ts_seconds)
-                                         : 0;
-
+        std::string hash_key, sort_key, value;
+        uint32_t expire_ts_seconds = 0;
+        // _kv_count == -1 means req just want to get data counts, not include data value
+        if (_kv_count == -1) {

Review Comment:
   should be _kv_count != -1 ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@pegasus.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pegasus.apache.org
For additional commands, e-mail: dev-help@pegasus.apache.org


[GitHub] [incubator-pegasus] GehaFearless commented on a diff in pull request #1091: feat: improve performance of count_data

Posted by GitBox <gi...@apache.org>.
GehaFearless commented on code in PR #1091:
URL: https://github.com/apache/incubator-pegasus/pull/1091#discussion_r941957204


##########
src/shell/command_helper.h:
##########
@@ -499,36 +501,40 @@ inline void scan_data_next(scan_data_context *context)
                             context->timeout_ms);
                         break;
                     case SCAN_COUNT:
-                        context->split_rows++;
-                        if (context->stat_size && context->statistics) {
-                            long hash_key_size = hash_key.size();
-                            context->statistics->measureTime(
-                                static_cast<uint32_t>(histogram_type::HASH_KEY_SIZE),
-                                hash_key_size);
-
-                            long sort_key_size = sort_key.size();
-                            context->statistics->measureTime(
-                                static_cast<uint32_t>(histogram_type::SORT_KEY_SIZE),
-                                sort_key_size);
-
-                            long value_size = value.size();
-                            context->statistics->measureTime(
-                                static_cast<uint32_t>(histogram_type::VALUE_SIZE), value_size);
-
-                            long row_size = hash_key_size + sort_key_size + value_size;
-                            context->statistics->measureTime(
-                                static_cast<uint32_t>(histogram_type::ROW_SIZE), row_size);
-
-                            if (context->top_count > 0) {
-                                context->top_rows.push(
-                                    std::move(hash_key), std::move(sort_key), row_size);
+                        if (kv_count == -1) {

Review Comment:
   In order to achieve the match between different versions, I think this is the best.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@pegasus.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pegasus.apache.org
For additional commands, e-mail: dev-help@pegasus.apache.org


[GitHub] [incubator-pegasus] acelyc111 commented on a diff in pull request #1091: feat: improve performance of count_data

Posted by GitBox <gi...@apache.org>.
acelyc111 commented on code in PR #1091:
URL: https://github.com/apache/incubator-pegasus/pull/1091#discussion_r945112243


##########
src/server/pegasus_server_impl.cpp:
##########
@@ -1187,21 +1187,23 @@ void pegasus_server_impl::on_get_scanner(get_scanner_rpc rpc)
 
         limiter->add_count();
 
-        auto state = append_key_value_for_scan(
-            resp.kvs,
+        auto state = validate_key_value_for_scan(
             it->key(),
             it->value(),
             request.hash_key_filter_type,
             request.hash_key_filter_pattern,
             request.sort_key_filter_type,
             request.sort_key_filter_pattern,
             epoch_now,
-            request.no_value,
-            request.__isset.validate_partition_hash ? request.validate_partition_hash : true,
-            return_expire_ts);
+            request.__isset.validate_partition_hash ? request.validate_partition_hash : true);
+
         switch (state) {
         case range_iteration_state::kNormal:
             count++;
+            if (!request.only_return_count) {

Review Comment:
   Should judge wether request.__isset.only_return_count is set at first, client
   maybe old version, it have to keep compatitable.



##########
src/client_lib/pegasus_scanner_impl.cpp:
##########
@@ -162,13 +168,16 @@ void pegasus_client_impl::pegasus_scanner_impl::_async_next_internal()
         }
 
         // valid data got
-        std::string hash_key, sort_key;
-        pegasus_restore_key(_kvs[_p].key, hash_key, sort_key);
-        std::string value(_kvs[_p].value.data(), _kvs[_p].value.length());
-        uint32_t expire_ts_seconds = _kvs[_p].__isset.expire_ts_seconds
-                                         ? static_cast<uint32_t>(_kvs[_p].expire_ts_seconds)
-                                         : 0;
-
+        std::string hash_key, sort_key, value;
+        uint32_t expire_ts_seconds = 0;
+        // _kv_count == -1 means req just want to get data counts, not include data value
+        if (_kv_count == -1) {

Review Comment:
   How about add another variable to indicate directly what you want to do?



##########
src/client_lib/pegasus_scanner_impl.cpp:
##########
@@ -162,13 +168,16 @@ void pegasus_client_impl::pegasus_scanner_impl::_async_next_internal()
         }
 
         // valid data got
-        std::string hash_key, sort_key;
-        pegasus_restore_key(_kvs[_p].key, hash_key, sort_key);
-        std::string value(_kvs[_p].value.data(), _kvs[_p].value.length());
-        uint32_t expire_ts_seconds = _kvs[_p].__isset.expire_ts_seconds
-                                         ? static_cast<uint32_t>(_kvs[_p].expire_ts_seconds)
-                                         : 0;
-
+        std::string hash_key, sort_key, value;
+        uint32_t expire_ts_seconds = 0;
+        // _kv_count == -1 means req just want to get data counts, not include data value
+        if (_kv_count == -1) {

Review Comment:
   You should explain the case of `_kv_count == -1` and all other cases, `_kv_count > -1` is only part of case you have to explain.



##########
src/server/pegasus_server_impl.cpp:
##########
@@ -1221,6 +1222,10 @@ void pegasus_server_impl::on_get_scanner(get_scanner_rpc rpc)
 
         it->Next();
     }
+    if (request.only_return_count) {
+        resp.kvs.emplace_back(::dsn::apps::key_value());

Review Comment:
   It's too tricky...
   How much work if refactor it?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@pegasus.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pegasus.apache.org
For additional commands, e-mail: dev-help@pegasus.apache.org


[GitHub] [incubator-pegasus] GehaFearless commented on a diff in pull request #1091: feat: improve performance of count_data

Posted by GitBox <gi...@apache.org>.
GehaFearless commented on code in PR #1091:
URL: https://github.com/apache/incubator-pegasus/pull/1091#discussion_r938593714


##########
src/shell/commands/data_operations.cpp:
##########
@@ -2397,6 +2401,15 @@ bool count_data(command_executor *e, shell_context *sc, arguments args)
         options.no_value = false;
     else
         options.no_value = true;
+
+    if (diff_hash_key || stat_size || value_filter_type != pegasus::pegasus_client::FT_NO_FILTER ||

Review Comment:
   I will make another one pull request laster.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@pegasus.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pegasus.apache.org
For additional commands, e-mail: dev-help@pegasus.apache.org


[GitHub] [incubator-pegasus] acelyc111 commented on a diff in pull request #1091: feat: improve performance of count_data

Posted by GitBox <gi...@apache.org>.
acelyc111 commented on code in PR #1091:
URL: https://github.com/apache/incubator-pegasus/pull/1091#discussion_r948637764


##########
src/shell/commands/data_operations.cpp:
##########
@@ -2333,6 +2333,7 @@ bool count_data(command_executor *e, shell_context *sc, arguments args)
         tp.output(std::cout, tp_output_format::kTabular);
         return true;
     }
+    options.only_return_count = true;

Review Comment:
   why set it here? only need to set it when `-c` flag is set, right?



##########
src/client_lib/pegasus_scanner_impl.cpp:
##########
@@ -261,6 +277,17 @@ void pegasus_client_impl::pegasus_scanner_impl::_on_scan_response(::dsn::error_c
             _kvs = std::move(response.kvs);
             _p = -1;
             _context = response.context_id;
+            // 1. kv_count exist on response means server is newer version (added counting size only
+            // implementation)
+            //   1> kv_count == -1 indicates response still have key and value data
+            //   2> kv_count > -1 indicates response only have kv size count, but not key && value
+            // 2. kv_count is not existed means server is older version
+            if (response.__isset.kv_count) {
+                if (response.kv_count != -1) {
+                    _already_add_count = false;

Review Comment:
   I don't understand what does `_already_add_count` mean, why set it to false here



##########
src/client_lib/pegasus_scanner_impl.cpp:
##########
@@ -51,17 +51,20 @@ pegasus_client_impl::pegasus_scanner_impl::pegasus_scanner_impl(::dsn::apps::rrd
       _options(options),
       _splits_hash(std::move(hash)),
       _p(-1),
+      _kv_count(-1),
       _context(SCAN_CONTEXT_ID_COMPLETED),
       _rpc_started(false),
       _validate_partition_hash(validate_partition_hash),
-      _full_scan(full_scan)
+      _full_scan(full_scan),
+      _already_add_count(true)

Review Comment:
   why the default value is true?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@pegasus.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pegasus.apache.org
For additional commands, e-mail: dev-help@pegasus.apache.org


[GitHub] [incubator-pegasus] GehaFearless commented on a diff in pull request #1091: feat: improve performance of count_data

Posted by GitBox <gi...@apache.org>.
GehaFearless commented on code in PR #1091:
URL: https://github.com/apache/incubator-pegasus/pull/1091#discussion_r949842378


##########
src/client_lib/pegasus_scanner_impl.cpp:
##########
@@ -261,6 +278,17 @@ void pegasus_client_impl::pegasus_scanner_impl::_on_scan_response(::dsn::error_c
             _kvs = std::move(response.kvs);
             _p = -1;
             _context = response.context_id;
+            // 1. kv_count exist on response means server is newer version (added counting size only
+            // implementation)
+            //   1> kv_count == -1 indicates response still have key and value data
+            //   2> kv_count > -1 indicates response only have kv size count, but not key && value
+            // 2. kv_count is not existed means server is older version
+            if (response.__isset.kv_count) {
+                if (response.kv_count != -1) {

Review Comment:
   case value_filter be used on higher sever



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@pegasus.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pegasus.apache.org
For additional commands, e-mail: dev-help@pegasus.apache.org


[GitHub] [incubator-pegasus] WHBANG commented on a diff in pull request #1091: feat: improve performance of count_data

Posted by GitBox <gi...@apache.org>.
WHBANG commented on code in PR #1091:
URL: https://github.com/apache/incubator-pegasus/pull/1091#discussion_r942097974


##########
src/server/pegasus_server_impl.cpp:
##########
@@ -1449,7 +1466,10 @@ void pegasus_server_impl::on_scan(scan_rpc rpc)
         resp.error = rocksdb::Status::Code::kNotFound;
     }
 
-    _cu_calculator->add_scan_cu(req, resp.error, resp.kvs);
+    // abandon calculate capacity unit
+    if (only_return_count) {

Review Comment:
   if (!only_return_count) 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@pegasus.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pegasus.apache.org
For additional commands, e-mail: dev-help@pegasus.apache.org