You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@datasketches.apache.org by "AlexanderSaydakov (via GitHub)" <gi...@apache.org> on 2024/02/06 01:55:46 UTC

[PR] Tdigest [datasketches-cpp]

AlexanderSaydakov opened a new pull request, #422:
URL: https://github.com/apache/datasketches-cpp/pull/422

   (no comment)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@datasketches.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@datasketches.apache.org
For additional commands, e-mail: commits-help@datasketches.apache.org


Re: [PR] Tdigest [datasketches-cpp]

Posted by "coveralls (via GitHub)" <gi...@apache.org>.
coveralls commented on PR #422:
URL: https://github.com/apache/datasketches-cpp/pull/422#issuecomment-1928646579

   ## Pull Request Test Coverage Report for [Build 7793567234](https://coveralls.io/builds/65518161)
   
   
   * **-16** of **93**   **(82.8%)**  changed or added relevant lines in **2** files are covered.
   * **2** unchanged lines in **1** file lost coverage.
   * Overall coverage decreased (**-0.09%**) to **98.699%**
   
   ---
   
   |  Changes Missing Coverage | Covered Lines | Changed/Added Lines | % |
   | :-----|--------------|--------|---: |
   | [tdigest/include/tdigest_impl.hpp](https://coveralls.io/builds/65518161/source?filename=tdigest%2Finclude%2Ftdigest_impl.hpp#L435) | 68 | 84 | 80.95%
   <!-- | **Total:** | **77** | **93** | **82.8%** | -->
   
   |  Files with Coverage Reduction | New Missed Lines | % |
   | :-----|--------------|--: |
   | [tdigest/include/tdigest_impl.hpp](https://coveralls.io/builds/65518161/source?filename=tdigest%2Finclude%2Ftdigest_impl.hpp#L356) | 2 | 88.27% |
   <!-- | **Total:** | **2** |  | -->
   
   |  Totals | [![Coverage Status](https://coveralls.io/builds/65518161/badge)](https://coveralls.io/builds/65518161) |
   | :-- | --: |
   | Change from base [Build 7715993625](https://coveralls.io/builds/65391363): |  -0.09% |
   | Covered Lines: | 16383 |
   | Relevant Lines: | 16599 |
   
   ---
   ##### 💛  - [Coveralls](https://coveralls.io)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@datasketches.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@datasketches.apache.org
For additional commands, e-mail: commits-help@datasketches.apache.org


Re: [PR] Tdigest [datasketches-cpp]

Posted by "AlexanderSaydakov (via GitHub)" <gi...@apache.org>.
AlexanderSaydakov merged PR #422:
URL: https://github.com/apache/datasketches-cpp/pull/422


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@datasketches.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@datasketches.apache.org
For additional commands, e-mail: commits-help@datasketches.apache.org


Re: [PR] Tdigest [datasketches-cpp]

Posted by "jmalkin (via GitHub)" <gi...@apache.org>.
jmalkin commented on code in PR #422:
URL: https://github.com/apache/datasketches-cpp/pull/422#discussion_r1485670485


##########
tdigest/include/tdigest_impl.hpp:
##########
@@ -426,6 +428,111 @@ tdigest<T, A> tdigest<T, A>::deserialize(const void* bytes, size_t size, const A
   return tdigest(reverse_merge, k, min, max, std::move(centroids), total_weight, allocator);
 }
 
+template<typename T, typename A>
+tdigest<T, A> tdigest<T, A>::deserialize_compat(std::istream& is, const A& allocator) {
+  const auto type = read<uint8_t>(is);
+  if (type != COMPAT_DOUBLE && type != COMPAT_FLOAT) {
+    throw std::invalid_argument("unexpected sketch preamble: 0 0 0 " + std::to_string(type));

Review Comment:
   Since we only read one byte here this was a little confusing. I realized it's that we already read 3 bytes before calling this method. I think it'd be good to have a comment explicitly noting that somewhere.



##########
tdigest/include/tdigest_impl.hpp:
##########
@@ -426,6 +428,111 @@ tdigest<T, A> tdigest<T, A>::deserialize(const void* bytes, size_t size, const A
   return tdigest(reverse_merge, k, min, max, std::move(centroids), total_weight, allocator);
 }
 
+template<typename T, typename A>
+tdigest<T, A> tdigest<T, A>::deserialize_compat(std::istream& is, const A& allocator) {
+  const auto type = read<uint8_t>(is);
+  if (type != COMPAT_DOUBLE && type != COMPAT_FLOAT) {
+    throw std::invalid_argument("unexpected sketch preamble: 0 0 0 " + std::to_string(type));
+  }
+  if (type == COMPAT_DOUBLE) {
+    const auto min = read_big_endian<double>(is);
+    const auto max = read_big_endian<double>(is);
+    const auto k = static_cast<uint16_t>(read_big_endian<double>(is));
+    const auto num_centroids = read_big_endian<uint32_t>(is);
+    vector_centroid centroids(num_centroids, centroid(0, 0), allocator);
+    uint64_t total_weight = 0;
+    for (auto& c: centroids) {
+      const uint64_t weight = static_cast<uint64_t>(read_big_endian<double>(is));
+      const auto mean = read_big_endian<double>(is);
+      c = centroid(mean, weight);
+      total_weight += weight;
+    }
+    return tdigest(false, k, min, max, std::move(centroids), total_weight, allocator);
+  }
+  // compatibility with asSmallBytes()
+  const auto min = read_big_endian<double>(is); // reference implementation uses doubles for min and max
+  const auto max = read_big_endian<double>(is);
+  const auto k = static_cast<uint16_t>(read_big_endian<float>(is));
+  read<uint32_t>(is); // unused
+  const auto num_centroids = read_big_endian<uint16_t>(is);
+  vector_centroid centroids(num_centroids, centroid(0, 0), allocator);
+  uint64_t total_weight = 0;
+  for (auto& c: centroids) {
+    const uint64_t weight = static_cast<uint64_t>(read_big_endian<float>(is));
+    const auto mean = read_big_endian<float>(is);
+    c = centroid(mean, weight);
+    total_weight += weight;
+  }
+  return tdigest(false, k, min, max, std::move(centroids), total_weight, allocator);
+}
+
+template<typename T, typename A>
+tdigest<T, A> tdigest<T, A>::deserialize_compat(const void* bytes, size_t size, const A& allocator) {
+  const char* ptr = static_cast<const char*>(bytes);
+  const auto type = *ptr++;
+  if (type != COMPAT_DOUBLE && type != COMPAT_FLOAT) {
+    throw std::invalid_argument("unexpected sketch preamble: 0 0 0 " + std::to_string(type));
+  }
+  const char* end_ptr = static_cast<const char*>(bytes) + size;
+  if (type == COMPAT_DOUBLE) {
+    ensure_minimum_memory(end_ptr - ptr, sizeof(double) * 3 + sizeof(uint32_t));
+    double min;
+    ptr += copy_from_mem(ptr, min);
+    min = byteswap(min);
+    double max;
+    ptr += copy_from_mem(ptr, max);
+    max = byteswap(max);
+    double k_double;
+    ptr += copy_from_mem(ptr, k_double);
+    const uint16_t k = static_cast<uint16_t>(byteswap(k_double));
+    uint32_t num_centroids;
+    ptr += copy_from_mem(ptr, num_centroids);
+    num_centroids = byteswap(num_centroids);
+    ensure_minimum_memory(end_ptr - ptr, sizeof(double) * num_centroids * 2);
+    vector_centroid centroids(num_centroids, centroid(0, 0), allocator);
+    uint64_t total_weight = 0;
+    for (auto& c: centroids) {
+      double weight;
+      ptr += copy_from_mem(ptr, weight);
+      weight = byteswap(weight);
+      double mean;
+      ptr += copy_from_mem(ptr, mean);
+      mean = byteswap(mean);
+      c = centroid(mean, static_cast<uint64_t>(weight));
+      total_weight += static_cast<uint64_t>(weight);

Review Comment:
   For stream deserialization weight was cast to uint64_t first which avoided casing the same value to the same type twice.



##########
tdigest/include/tdigest_impl.hpp:
##########
@@ -426,6 +428,111 @@ tdigest<T, A> tdigest<T, A>::deserialize(const void* bytes, size_t size, const A
   return tdigest(reverse_merge, k, min, max, std::move(centroids), total_weight, allocator);
 }
 
+template<typename T, typename A>
+tdigest<T, A> tdigest<T, A>::deserialize_compat(std::istream& is, const A& allocator) {
+  const auto type = read<uint8_t>(is);
+  if (type != COMPAT_DOUBLE && type != COMPAT_FLOAT) {
+    throw std::invalid_argument("unexpected sketch preamble: 0 0 0 " + std::to_string(type));
+  }
+  if (type == COMPAT_DOUBLE) {
+    const auto min = read_big_endian<double>(is);
+    const auto max = read_big_endian<double>(is);
+    const auto k = static_cast<uint16_t>(read_big_endian<double>(is));
+    const auto num_centroids = read_big_endian<uint32_t>(is);
+    vector_centroid centroids(num_centroids, centroid(0, 0), allocator);
+    uint64_t total_weight = 0;
+    for (auto& c: centroids) {
+      const uint64_t weight = static_cast<uint64_t>(read_big_endian<double>(is));
+      const auto mean = read_big_endian<double>(is);
+      c = centroid(mean, weight);
+      total_weight += weight;
+    }
+    return tdigest(false, k, min, max, std::move(centroids), total_weight, allocator);
+  }
+  // compatibility with asSmallBytes()
+  const auto min = read_big_endian<double>(is); // reference implementation uses doubles for min and max
+  const auto max = read_big_endian<double>(is);
+  const auto k = static_cast<uint16_t>(read_big_endian<float>(is));
+  read<uint32_t>(is); // unused
+  const auto num_centroids = read_big_endian<uint16_t>(is);

Review Comment:
   This one line seems to be what prevents having an easily templatized common deserialize. I guess we'd need to condition on the size of T and use a type that's half that size, which gets a bit messier. A pity.



##########
tdigest/include/tdigest_impl.hpp:
##########
@@ -426,6 +428,111 @@ tdigest<T, A> tdigest<T, A>::deserialize(const void* bytes, size_t size, const A
   return tdigest(reverse_merge, k, min, max, std::move(centroids), total_weight, allocator);
 }
 
+template<typename T, typename A>
+tdigest<T, A> tdigest<T, A>::deserialize_compat(std::istream& is, const A& allocator) {
+  const auto type = read<uint8_t>(is);
+  if (type != COMPAT_DOUBLE && type != COMPAT_FLOAT) {
+    throw std::invalid_argument("unexpected sketch preamble: 0 0 0 " + std::to_string(type));
+  }
+  if (type == COMPAT_DOUBLE) {
+    const auto min = read_big_endian<double>(is);
+    const auto max = read_big_endian<double>(is);
+    const auto k = static_cast<uint16_t>(read_big_endian<double>(is));
+    const auto num_centroids = read_big_endian<uint32_t>(is);
+    vector_centroid centroids(num_centroids, centroid(0, 0), allocator);
+    uint64_t total_weight = 0;
+    for (auto& c: centroids) {
+      const uint64_t weight = static_cast<uint64_t>(read_big_endian<double>(is));
+      const auto mean = read_big_endian<double>(is);
+      c = centroid(mean, weight);
+      total_weight += weight;
+    }
+    return tdigest(false, k, min, max, std::move(centroids), total_weight, allocator);
+  }
+  // compatibility with asSmallBytes()
+  const auto min = read_big_endian<double>(is); // reference implementation uses doubles for min and max
+  const auto max = read_big_endian<double>(is);
+  const auto k = static_cast<uint16_t>(read_big_endian<float>(is));
+  read<uint32_t>(is); // unused
+  const auto num_centroids = read_big_endian<uint16_t>(is);
+  vector_centroid centroids(num_centroids, centroid(0, 0), allocator);
+  uint64_t total_weight = 0;
+  for (auto& c: centroids) {
+    const uint64_t weight = static_cast<uint64_t>(read_big_endian<float>(is));
+    const auto mean = read_big_endian<float>(is);
+    c = centroid(mean, weight);
+    total_weight += weight;
+  }
+  return tdigest(false, k, min, max, std::move(centroids), total_weight, allocator);
+}
+
+template<typename T, typename A>
+tdigest<T, A> tdigest<T, A>::deserialize_compat(const void* bytes, size_t size, const A& allocator) {
+  const char* ptr = static_cast<const char*>(bytes);
+  const auto type = *ptr++;
+  if (type != COMPAT_DOUBLE && type != COMPAT_FLOAT) {
+    throw std::invalid_argument("unexpected sketch preamble: 0 0 0 " + std::to_string(type));
+  }
+  const char* end_ptr = static_cast<const char*>(bytes) + size;
+  if (type == COMPAT_DOUBLE) {
+    ensure_minimum_memory(end_ptr - ptr, sizeof(double) * 3 + sizeof(uint32_t));
+    double min;
+    ptr += copy_from_mem(ptr, min);
+    min = byteswap(min);
+    double max;
+    ptr += copy_from_mem(ptr, max);
+    max = byteswap(max);
+    double k_double;
+    ptr += copy_from_mem(ptr, k_double);
+    const uint16_t k = static_cast<uint16_t>(byteswap(k_double));
+    uint32_t num_centroids;
+    ptr += copy_from_mem(ptr, num_centroids);
+    num_centroids = byteswap(num_centroids);
+    ensure_minimum_memory(end_ptr - ptr, sizeof(double) * num_centroids * 2);
+    vector_centroid centroids(num_centroids, centroid(0, 0), allocator);
+    uint64_t total_weight = 0;
+    for (auto& c: centroids) {
+      double weight;
+      ptr += copy_from_mem(ptr, weight);
+      weight = byteswap(weight);
+      double mean;
+      ptr += copy_from_mem(ptr, mean);
+      mean = byteswap(mean);
+      c = centroid(mean, static_cast<uint64_t>(weight));
+      total_weight += static_cast<uint64_t>(weight);
+    }
+    return tdigest(false, k, min, max, std::move(centroids), total_weight, allocator);
+  }
+  ensure_minimum_memory(end_ptr - ptr, sizeof(double) * 2 + sizeof(float) + sizeof(uint16_t) * 3);

Review Comment:
   Please add a comment about compatibility with asSmallBytes() or noting that this is COMPAT_FLOAT mode



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@datasketches.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@datasketches.apache.org
For additional commands, e-mail: commits-help@datasketches.apache.org