You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@orc.apache.org by GitBox <gi...@apache.org> on 2021/09/23 03:27:40 UTC

[GitHub] [orc] guiyanakuang edited a comment on pull request #915: ORC-98: Add support for t-digests to ORC

guiyanakuang edited a comment on pull request #915:
URL: https://github.com/apache/orc/pull/915#issuecomment-925480666


   Hi @dongjoon-hyun @wgtmac. After some testing and some thought, I have decided to modify this pr in the following way, and we will discuss any disagreements.
   
   ## Enhancing ColumnStatistics with a plugin approach
   The data structures in either TDigest or datasketches can be specific implementations in the plugin. orc-core does not add any dependencies, test and benchmark modules add dependencies and specific implementations.
   
   ```proto
   message Digest {
     optional string digestName = 1;
     optional bytes digestContent = 2;
   }
   
   message DoubleStatistics {
     optional double minimum = 1;
     optional double maximum = 2;
     optional double sum = 3;
     optional Digest digest = 4;
   }
   ```
   Both Java and C++ will use digestName to find specific plugin implementation. Failed to find degrades to a default empty implementation.
   
   1. Does digest has breaking compatibility of serialization among different versions ?
   Since Digest is defined as optional. Older versions will automatically ignore the Digest field when reading newer versions of files, I did some tests that looked good. This will also be added to the unit test.
   
   2. How do I deal with the serialisation of digest between Java and C++ ?
   As the enhancement is provided in the form of a plugin, if the user needs java to write C++ to read or otherwise. This requires a user implementation to ensure serialisation between languages. I thought we could add example based on datasketches (which has multiple language implementations).
   
   Also, I think I'll add a command to the tool to see the field's digestName.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org