You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@orc.apache.org by GitBox <gi...@apache.org> on 2021/09/21 16:25:06 UTC

[GitHub] [orc] guiyanakuang opened a new pull request #915: ORC-98: Add support for t-digests to ORC

guiyanakuang opened a new pull request #915:
URL: https://github.com/apache/orc/pull/915

### What changes were proposed in this pull request?

CollectionColumnStatisticsImpl / IntegerStatisticsImpl / DoubleStatisticsImpl / DateStatisticsImpl supports the recording of statistical information via TDigest.

### Why are the changes needed?

1. Easy access to histograms/quartiles. Get the distribution of the data.
2. As supporting information for your query. For example it can assist in filtering conditions and finding faster paths for descent pruning.

### How was this patch tested?

Adding relevant unit tests.

### Current Status
1. More unit tests are needed.
2. Lack of comments.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] guiyanakuang commented on a change in pull request #915: ORC-98: Add support for t-digests to ORC

Posted by GitBox <gi...@apache.org>.

guiyanakuang commented on a change in pull request #915:
URL: https://github.com/apache/orc/pull/915#discussion_r713696199



##########
File path: proto/orc_proto.proto
##########
@@ -26,12 +26,14 @@ message IntegerStatistics  {
   optional sint64 minimum = 1;
   optional sint64 maximum = 2;
   optional sint64 sum = 3;
+  optional bytes digest = 4;
 }
 
 message DoubleStatistics {
   optional double minimum = 1;
   optional double maximum = 2;
   optional double sum = 3;
+  optional bytes digest = 4;

Review comment:
       Yes you are right, after researching DataSketches I will come back with my proposal.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] guiyanakuang commented on a change in pull request #915: ORC-98: Add support for t-digests to ORC

Posted by GitBox <gi...@apache.org>.

guiyanakuang commented on a change in pull request #915:
URL: https://github.com/apache/orc/pull/915#discussion_r713676144



##########
File path: java/core/pom.xml
##########
@@ -75,6 +75,10 @@
       <groupId>org.threeten</groupId>
       <artifactId>threeten-extra</artifactId>
     </dependency>
+    <dependency>
+      <groupId>com.tdunning</groupId>

Review comment:
       @wgtmac  I first learned about TDigest from this [issue](https://issues.apache.org/jira/projects/ORC/issues/ORC-98?filter=allopenissues), which inspired me to. So I don't actually have a preference, I see that DataSketches has implementations in different languages, and if it is well compatible with reading and writing serialisation in different languages, then I am in favour of using DataSketches. I need some time to learn it. :)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] dongjoon-hyun commented on a change in pull request #915: ORC-98: Add support for t-digests to ORC

Posted by GitBox <gi...@apache.org>.

dongjoon-hyun commented on a change in pull request #915:
URL: https://github.com/apache/orc/pull/915#discussion_r713408854



##########
File path: java/core/pom.xml
##########
@@ -75,6 +75,10 @@
       <groupId>org.threeten</groupId>
       <artifactId>threeten-extra</artifactId>
     </dependency>
+    <dependency>
+      <groupId>com.tdunning</groupId>

Review comment:
       This reminds me the following Spark PR. In the community, it's a little hard to add this kind of dependency without a significant justification.
   - https://github.com/apache/spark/pull/26197/files#diff-283ead138b37209525130359830c4c6eeafaf5c6639b0df82b770389d24002bfR134




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] dongjoon-hyun commented on a change in pull request #915: ORC-98: Add support for t-digests to ORC

Posted by GitBox <gi...@apache.org>.

dongjoon-hyun commented on a change in pull request #915:
URL: https://github.com/apache/orc/pull/915#discussion_r713410549



##########
File path: java/core/src/java/org/apache/orc/PhysicalWriter.java
##########
@@ -77,6 +77,8 @@ void writeIndex(StreamName name,
   void writeBloomFilter(StreamName name,
                         OrcProto.BloomFilterIndex.Builder bloom) throws IOException;
 
+//  void writeTDigest(StreamName name, OrcProto) throws IOException;
+

Review comment:
       Shall we remove this comments?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] guiyanakuang commented on a change in pull request #915: ORC-98: Add support for t-digests to ORC

Posted by GitBox <gi...@apache.org>.

guiyanakuang commented on a change in pull request #915:
URL: https://github.com/apache/orc/pull/915#discussion_r713542266



##########
File path: java/core/pom.xml
##########
@@ -75,6 +75,10 @@
       <groupId>org.threeten</groupId>
       <artifactId>threeten-extra</artifactId>
     </dependency>
+    <dependency>
+      <groupId>com.tdunning</groupId>

Review comment:
       > Do we have some benchmark result for this enhancement?
   > 
   > > For example it can assist in filtering conditions and finding faster paths for pruning.
   
   @dongjoon-hyun, Currently pr is still in draft form. I will add benchmarking, unit testing and documentation. It might be a bit slow because I need to finish the work at hand first. 😃 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] guiyanakuang commented on a change in pull request #915: ORC-98: Add support for t-digests to ORC

Posted by GitBox <gi...@apache.org>.

guiyanakuang commented on a change in pull request #915:
URL: https://github.com/apache/orc/pull/915#discussion_r713694228



##########
File path: java/core/src/test/org/apache/orc/impl/TestDigest.java
##########
@@ -0,0 +1,109 @@
+/*
+ * Licensed to Ted Dunning under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.orc.impl;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.hive.ql.exec.vector.BytesColumnVector;
+import org.apache.hadoop.hive.ql.exec.vector.LongColumnVector;
+import org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch;
+import org.apache.orc.ColumnStatistics;
+import org.apache.orc.OrcConf;
+import org.apache.orc.OrcFile;
+import org.apache.orc.Reader;
+import org.apache.orc.TypeDescription;
+import org.apache.orc.Writer;
+import org.junit.jupiter.api.AfterEach;
+import org.junit.jupiter.api.BeforeEach;
+import org.junit.jupiter.api.Test;
+
+import java.io.IOException;
+import java.nio.charset.StandardCharsets;
+
+import static org.junit.jupiter.api.Assertions.assertEquals;
+import static org.junit.jupiter.api.Assertions.assertFalse;
+import static org.junit.jupiter.api.Assertions.assertTrue;
+
+public class TestDigest {
+
+  Path workDir = new Path(System.getProperty("test.tmp.dir"));
+  Configuration conf;
+  FileSystem fs;
+  Path testFilePath;
+  TypeDescription schema;
+
+  @BeforeEach
+  public void openFileSystem() throws Exception {
+    conf = new Configuration();
+    conf.set(OrcConf.DIGEST_COLUMNS.getAttribute(), "id:100");
+    fs = FileSystem.getLocal(conf);
+    fs.setWorkingDirectory(workDir);
+    testFilePath = new Path("testWriterImpl.orc");
+    fs.create(testFilePath, true);
+    schema = TypeDescription.fromString("struct<id:int,name:string>");
+  }
+
+  @AfterEach
+  public void deleteTestFile() throws Exception {
+    fs.delete(testFilePath, false);
+  }
+
+  private void write() throws IOException {
+    Writer writer = OrcFile.createWriter(testFilePath,
+        OrcFile.writerOptions(conf)
+            .overwrite(true)
+            .setSchema(schema)
+    );
+    VectorizedRowBatch batch = schema.createRowBatch();
+    LongColumnVector id = (LongColumnVector) batch.cols[0];
+    BytesColumnVector name = (BytesColumnVector) batch.cols[1];
+    for (int r = 1; r < 20; ++r) {
+      int row = batch.size++;
+      id.vector[row] = r;
+      byte[] buffer = String.valueOf(r).getBytes(StandardCharsets.UTF_8);
+      name.setRef(row, buffer, 0, buffer.length);
+    }
+    int row = batch.size++;
+    id.vector[row] = 1_000_000;
+    byte[] buffer = String.valueOf(1_000_000).getBytes(StandardCharsets.UTF_8);
+    name.setRef(row, buffer, 0, buffer.length);
+    writer.addRowBatch(batch);
+    writer.close();
+  }
+
+  private void read() throws IOException {

Review comment:
       Current statistics already support encryption.
   https://github.com/apache/orc/blob/cf720d7b2333618c652445299486cb9600bdbe4f/java/core/src/java/org/apache/orc/impl/ReaderImpl.java#L355-L382
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] dongjoon-hyun commented on pull request #915: ORC-98: Add support for t-digests to ORC

Posted by GitBox <gi...@apache.org>.

dongjoon-hyun commented on pull request #915:
URL: https://github.com/apache/orc/pull/915#issuecomment-928756377


   cc @omalley since this is ORC-98.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] wgtmac commented on pull request #915: ORC-98: Add support for t-digests to ORC

Posted by GitBox <gi...@apache.org>.

wgtmac commented on pull request #915:
URL: https://github.com/apache/orc/pull/915#issuecomment-927607518


   > In some environments the test has an overflow check that is not as expected. I haven't figured out why, but I believe this overflow detection code is incorrect.
   > https://github.com/apache/orc/blob/6da96bb8ceb64528d082974efed411c4c29f3408/c%2B%2B/src/Statistics.cc#L180-L190
   > 
   > 
   > A counter-example can easily be given
   > Assume `sum=1`, `update(std::numeric_limits<int64_t>::max(), 3);`
   > `value * repetitions + _stats.getSum()` is overflowed, but is still a positive number : 9223372036854775806
   > @dongjoon-hyun @wgtmac What do you think? :)
   
   This piece of code was simply copied from the java side:
   https://github.com/apache/orc/blob/main/java/core/src/java/org/apache/orc/impl/ColumnStatisticsImpl.java#L362-L379
   
   We need make sure value * repetitions does not overflow before checking the sum. : (


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] guiyanakuang commented on a change in pull request #915: ORC-98: Add support for t-digests to ORC

Posted by GitBox <gi...@apache.org>.

guiyanakuang commented on a change in pull request #915:
URL: https://github.com/apache/orc/pull/915#discussion_r713679762



##########
File path: java/core/src/java/org/apache/orc/DigestConf.java
##########
@@ -0,0 +1,72 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.orc;
+
+import com.tdunning.math.stats.TDigest;
+
+public class DigestConf {
+
+  public static final DigestConf NO_CREATE_DIGEST = new DigestConf(false, 0);
+
+  public static final DigestConf DEFAULT_DIGEST = new DigestConf(true, 1000);
+
+  private final boolean createDigest;
+
+  private final double digestCompression;

Review comment:
       Parameter for Digest initialisation, which controls the precision of the final returned histogram, quantile. The larger it is the less data is stored internally and the lower the precision, the smaller it is when serialised. So I have designed it so that the user can control the precision of the different fields.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] guiyanakuang commented on pull request #915: ORC-98: Add support for t-digests to ORC

Posted by GitBox <gi...@apache.org>.

guiyanakuang commented on pull request #915:
URL: https://github.com/apache/orc/pull/915#issuecomment-936629718


   I have completed a benchmark test (dba9a1a) using the current implementation for the time being. To show the benefits of custom statistics.
   Added OptimizeFilterBenchmark. Tested the performance of the default query, and of the filter condition with the base percentage of filter values re-ordered by TDigest.
   
   proportion: Ratio of cardinal number between columns
   quota: Minimum cardinal number
   
   ```
   Benchmark                            (proportion)  (quota)  Mode  Cnt     Score     Error  Units
   OptimizeFilterBenchmark.noUseTDigest             2       10  avgt   20  1052.305 ±  10.632  us/op
   OptimizeFilterBenchmark.noUseTDigest             2      100  avgt   20  1109.375 ±  10.162  us/op
   OptimizeFilterBenchmark.noUseTDigest             2     1000  avgt   20  1173.790 ±  11.696  us/op
   OptimizeFilterBenchmark.noUseTDigest             3       10  avgt   20  1056.139 ±   8.359  us/op
   OptimizeFilterBenchmark.noUseTDigest             3      100  avgt   20  1154.665 ±   9.152  us/op
   OptimizeFilterBenchmark.noUseTDigest             3     1000  avgt   20  1168.113 ±   9.115  us/op
   OptimizeFilterBenchmark.useTDigest               2       10  avgt   20  1116.076 ±   6.330  us/op
   OptimizeFilterBenchmark.useTDigest               2      100  avgt   20  1162.956 ±   9.865  us/op
   OptimizeFilterBenchmark.useTDigest               2     1000  avgt   20  1220.028 ±  22.544  us/op
   OptimizeFilterBenchmark.useTDigest               3       10  avgt   20  1114.617 ±  10.220  us/op
   OptimizeFilterBenchmark.useTDigest               3      100  avgt   20  1219.488 ± 138.798  us/op
   OptimizeFilterBenchmark.useTDigest               3     1000  avgt   20   651.001 ±  20.784  us/op
   ```
   
   Tests show some performance loss in reading custom statistics structures, around 100 us/op. When there are sparse values in the filter conditions, reordering the filter conditions results in a larger performance gain, as this allows for early pruning.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] guiyanakuang commented on pull request #915: ORC-98: Add support for t-digests to ORC

Posted by GitBox <gi...@apache.org>.

guiyanakuang commented on pull request #915:
URL: https://github.com/apache/orc/pull/915#issuecomment-948207235


   I'm posting this comment to show that I haven't forgotten about this pr and that I still need some time to refactor the code to make it meet expectations.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] dongjoon-hyun commented on pull request #915: ORC-98: Add support for t-digests to ORC

Posted by GitBox <gi...@apache.org>.

dongjoon-hyun commented on pull request #915:
URL: https://github.com/apache/orc/pull/915#issuecomment-928756377


   cc @omalley since this is ORC-98.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] wgtmac commented on a change in pull request #915: ORC-98: Add support for t-digests to ORC

Posted by GitBox <gi...@apache.org>.

wgtmac commented on a change in pull request #915:
URL: https://github.com/apache/orc/pull/915#discussion_r713630200



##########
File path: java/core/src/java/org/apache/orc/impl/ColumnStatisticsImpl.java
##########
@@ -423,6 +502,12 @@ public void merge(ColumnStatisticsImpl other) {
       if (!overflow) {
         intb.setSum(sum);
       }
+      if (tDigest != null && digestConf.getPersistence()) {
+        ByteBuffer byteBuffer = ByteBuffer.allocate(tDigest.byteSize());
+        tDigest.asBytes(byteBuffer);
+        byteBuffer.flip();

Review comment:
       Why do we need a flip here?

##########
File path: java/core/src/java/org/apache/orc/DigestConf.java
##########
@@ -0,0 +1,72 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.orc;
+
+import com.tdunning.math.stats.TDigest;
+
+public class DigestConf {
+
+  public static final DigestConf NO_CREATE_DIGEST = new DigestConf(false, 0);
+
+  public static final DigestConf DEFAULT_DIGEST = new DigestConf(true, 1000);
+
+  private final boolean createDigest;
+
+  private final double digestCompression;

Review comment:
       What does it mean? Does t-digest supports compression natively (using standard compression algorithm) or simple serialization?

##########
File path: java/core/pom.xml
##########
@@ -75,6 +75,10 @@
       <groupId>org.threeten</groupId>
       <artifactId>threeten-extra</artifactId>
     </dependency>
+    <dependency>
+      <groupId>com.tdunning</groupId>

Review comment:
       I agree with @dongjoon-hyun that introducing a new dependency adds more complexity to this project. It looks like t-digest has been widely adopted by many projects and big name companies: https://www.sciencedirect.com/science/article/pii/S2665963820300403. It would be good if we can make it as a plugin instead of a direct dependency.

##########
File path: java/core/pom.xml
##########
@@ -75,6 +75,10 @@
       <groupId>org.threeten</groupId>
       <artifactId>threeten-extra</artifactId>
     </dependency>
+    <dependency>
+      <groupId>com.tdunning</groupId>

Review comment:
       @guiyanakuang I am also curious of your preference of t-digest over DataSketches which is also an Apache project shipped with similar feature.

##########
File path: proto/orc_proto.proto
##########
@@ -26,12 +26,14 @@ message IntegerStatistics  {
   optional sint64 minimum = 1;
   optional sint64 maximum = 2;
   optional sint64 sum = 3;
+  optional bytes digest = 4;
 }
 
 message DoubleStatistics {
   optional double minimum = 1;
   optional double maximum = 2;
   optional double sum = 3;
+  optional bytes digest = 4;

Review comment:
       Does t-digest has breaking compatibility of serialization among different versions? Additionally, dealing with serialized t-digests between Java and C++ is also something we should be aware of.

##########
File path: java/core/src/test/org/apache/orc/impl/TestDigest.java
##########
@@ -0,0 +1,109 @@
+/*
+ * Licensed to Ted Dunning under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.orc.impl;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.hive.ql.exec.vector.BytesColumnVector;
+import org.apache.hadoop.hive.ql.exec.vector.LongColumnVector;
+import org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch;
+import org.apache.orc.ColumnStatistics;
+import org.apache.orc.OrcConf;
+import org.apache.orc.OrcFile;
+import org.apache.orc.Reader;
+import org.apache.orc.TypeDescription;
+import org.apache.orc.Writer;
+import org.junit.jupiter.api.AfterEach;
+import org.junit.jupiter.api.BeforeEach;
+import org.junit.jupiter.api.Test;
+
+import java.io.IOException;
+import java.nio.charset.StandardCharsets;
+
+import static org.junit.jupiter.api.Assertions.assertEquals;
+import static org.junit.jupiter.api.Assertions.assertFalse;
+import static org.junit.jupiter.api.Assertions.assertTrue;
+
+public class TestDigest {
+
+  Path workDir = new Path(System.getProperty("test.tmp.dir"));
+  Configuration conf;
+  FileSystem fs;
+  Path testFilePath;
+  TypeDescription schema;
+
+  @BeforeEach
+  public void openFileSystem() throws Exception {
+    conf = new Configuration();
+    conf.set(OrcConf.DIGEST_COLUMNS.getAttribute(), "id:100");
+    fs = FileSystem.getLocal(conf);
+    fs.setWorkingDirectory(workDir);
+    testFilePath = new Path("testWriterImpl.orc");
+    fs.create(testFilePath, true);
+    schema = TypeDescription.fromString("struct<id:int,name:string>");
+  }
+
+  @AfterEach
+  public void deleteTestFile() throws Exception {
+    fs.delete(testFilePath, false);
+  }
+
+  private void write() throws IOException {
+    Writer writer = OrcFile.createWriter(testFilePath,
+        OrcFile.writerOptions(conf)
+            .overwrite(true)
+            .setSchema(schema)
+    );
+    VectorizedRowBatch batch = schema.createRowBatch();
+    LongColumnVector id = (LongColumnVector) batch.cols[0];
+    BytesColumnVector name = (BytesColumnVector) batch.cols[1];
+    for (int r = 1; r < 20; ++r) {
+      int row = batch.size++;
+      id.vector[row] = r;
+      byte[] buffer = String.valueOf(r).getBytes(StandardCharsets.UTF_8);
+      name.setRef(row, buffer, 0, buffer.length);
+    }
+    int row = batch.size++;
+    id.vector[row] = 1_000_000;
+    byte[] buffer = String.valueOf(1_000_000).getBytes(StandardCharsets.UTF_8);
+    name.setRef(row, buffer, 0, buffer.length);
+    writer.addRowBatch(batch);
+    writer.close();
+  }
+
+  private void read() throws IOException {

Review comment:
       Should we encrypt the digest if column encryption is enabled to avoid data leak?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] guiyanakuang commented on a change in pull request #915: ORC-98: Add support for t-digests to ORC

Posted by GitBox <gi...@apache.org>.

guiyanakuang commented on a change in pull request #915:
URL: https://github.com/apache/orc/pull/915#discussion_r713539981



##########
File path: java/core/pom.xml
##########
@@ -75,6 +75,10 @@
       <groupId>org.threeten</groupId>
       <artifactId>threeten-extra</artifactId>
     </dependency>
+    <dependency>
+      <groupId>com.tdunning</groupId>

Review comment:
       > This reminds me the following Spark PR. In the community, it's a little hard to add this kind of dependency without a significant justification.
   > 
   > * https://github.com/apache/spark/pull/26197/files#diff-283ead138b37209525130359830c4c6eeafaf5c6639b0df82b770389d24002bfR134
   
   I understand the point and the fear of not being maintained. But the actual algorithm is short and easy to understand.
   
   > https://github.com/addthis/stream-lib/blob/master/src/main/java/com/clearspring/analytics/stream/quantile/TDigest.java
   
   Can I copy this file into the ORC project instead of the dependency? 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] guiyanakuang commented on pull request #915: ORC-98: Add support for t-digests to ORC

Posted by GitBox <gi...@apache.org>.

guiyanakuang commented on pull request #915:
URL: https://github.com/apache/orc/pull/915#issuecomment-927619421


   Agreed, I also think that using a string to mark the implementation feels unreliable. But the good thing is that the current implementation proves the feasibility. A new module that implements custom digest feels good.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] guiyanakuang commented on pull request #915: ORC-98: Add support for t-digests to ORC

Posted by GitBox <gi...@apache.org>.

guiyanakuang commented on pull request #915:
URL: https://github.com/apache/orc/pull/915#issuecomment-927577261


   In some environments the test has an overflow check that is not as expected. I haven't figured out why, but I believe this overflow detection code is incorrect.
   https://github.com/apache/orc/blob/6da96bb8ceb64528d082974efed411c4c29f3408/c%2B%2B/src/Statistics.cc#L180-L190
   A counter-example can easily be given
   Assume `sum=1`, `update(std::numeric_limits<int64_t>::max(), 3);`
   `value * repetitions + _stats.getSum()` is overflowed, but is still a positive number : 9223372036854775806
   @dongjoon-hyun @wgtmac What do you think? :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] wgtmac commented on pull request #915: ORC-98: Add support for t-digests to ORC

Posted by GitBox <gi...@apache.org>.

wgtmac commented on pull request #915:
URL: https://github.com/apache/orc/pull/915#issuecomment-927612662


   > Hi @dongjoon-hyun @wgtmac. After some testing and some thought, I have decided to modify this pr in the following way, and we will discuss any disagreements.
   > 
   > ## Enhancing ColumnStatistics with a plugin approach
   > The data structures in either TDigest or datasketches can be specific implementations in the plugin. orc-core does not add any dependencies, test and benchmark modules add dependencies and specific implementations.
   > 
   > ```proto
   > message Digest {
   >   optional string digestName = 1;
   >   optional bytes digestContent = 2;
   > }
   > 
   > message DoubleStatistics {
   >   optional double minimum = 1;
   >   optional double maximum = 2;
   >   optional double sum = 3;
   >   optional Digest digest = 4;
   > }
   > ```
   > 
   > Both Java and C++ will use digestName to find specific plugin implementation. Failed to find degrades to a default empty implementation.
   > 
   > 1. Does digest has breaking compatibility of serialization among different versions ?
   >    Since Digest is defined as optional. Older versions will automatically ignore the Digest field when reading newer versions of files, I did some tests that looked good. This will also be added to the unit test.
   > 2. How do I deal with the serialisation of digest between Java and C++ ?
   >    As the enhancement is provided in the form of a plugin, if the user needs java to write C++ to read or otherwise. This requires a user implementation to ensure serialisation between languages. I thought we could add example based on datasketches (which has multiple language implementations).
   > 
   > Also, I think I'll add a command to the tool to see the field's digestName.
   
   So I suppose you want to use digestName as a hint to let readers know how to decode it? Should we write some useful information including type (digest, datasketches or sth else), version, binding (java, c++, etc.) so that the readers are able to decide if they can understand it? For the plugin approach on the java side, we may provide an individual module to help read/write digest.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] guiyanakuang commented on pull request #915: ORC-98: Add support for t-digests to ORC

Posted by GitBox <gi...@apache.org>.

guiyanakuang commented on pull request #915:
URL: https://github.com/apache/orc/pull/915#issuecomment-925480666

After some testing and some thought, I have decided to modify this pr in the following way, and we will discuss any disagreements.

## Enhancing ColumnStatistics with a plugin approach
The data structures in either TDigest or datasketches can be specific implementations in the plugin. orc-core does not add any dependencies, test and benchmark modules add dependencies and specific implementations.

```proto
message Digest {
optional string digestName = 1;
optional bytes digestContent = 2;
}

message DoubleStatistics {
optional double minimum = 1;
optional double maximum = 2;
optional double sum = 3;
optional Digest digest = 4;
}
```
Both Java and C++ will use digestName to find specific plugin implementation. Failed to find degrades to a default empty implementation.

1. Does digest has breaking compatibility of serialization among different versions ?
Since Digest is defined as optional. Older versions will automatically ignore the Digest field when reading newer versions of files, I did some tests that looked good. This will also be added to the unit test.

2. How do I deal with the serialisation of digest between Java and C++ ?
As the enhancement is provided in the form of a plugin, if the user needs java to write C++ to read or otherwise. This requires a user implementation to ensure serialisation between languages. I thought we could add example based on datasketches (which has multiple language implementations).

Also, I think I'll add a command to the tool to see the field's digestName.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] guiyanakuang commented on pull request #915: ORC-98: Add support for t-digests to ORC

Posted by GitBox <gi...@apache.org>.

guiyanakuang commented on pull request #915:
URL: https://github.com/apache/orc/pull/915#issuecomment-927611970


   @wgtmac You're right, we need to check the value * repetitions first, I've created the issue, added the c++ and java component tags and I'll try to fix it later.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] dongjoon-hyun commented on a change in pull request #915: ORC-98: Add support for t-digests to ORC

Posted by GitBox <gi...@apache.org>.

dongjoon-hyun commented on a change in pull request #915:
URL: https://github.com/apache/orc/pull/915#discussion_r713410344



##########
File path: java/core/pom.xml
##########
@@ -75,6 +75,10 @@
       <groupId>org.threeten</groupId>
       <artifactId>threeten-extra</artifactId>
     </dependency>
+    <dependency>
+      <groupId>com.tdunning</groupId>

Review comment:
       Do we have some benchmark result for this enhancement?
   > For example it can assist in filtering conditions and finding faster paths for pruning.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] guiyanakuang commented on a change in pull request #915: ORC-98: Add support for t-digests to ORC

Posted by GitBox <gi...@apache.org>.

guiyanakuang commented on a change in pull request #915:
URL: https://github.com/apache/orc/pull/915#discussion_r713684138



##########
File path: java/core/src/java/org/apache/orc/impl/ColumnStatisticsImpl.java
##########
@@ -423,6 +502,12 @@ public void merge(ColumnStatisticsImpl other) {
       if (!overflow) {
         intb.setSum(sum);
       }
+      if (tDigest != null && digestConf.getPersistence()) {
+        ByteBuffer byteBuffer = ByteBuffer.allocate(tDigest.byteSize());
+        tDigest.asBytes(byteBuffer);
+        byteBuffer.flip();

Review comment:
       A statement after ： `intb.setDigest(ByteString.copyFrom(byteBuffer));`
   
   ```
     public static ByteString copyFrom(ByteBuffer bytes) {
       return copyFrom(bytes, bytes.remaining());
     }
   ```
   ByteBuffer to ByteString, flipped to read the correct size.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] guiyanakuang commented on a change in pull request #915: ORC-98: Add support for t-digests to ORC

Posted by GitBox <gi...@apache.org>.

guiyanakuang commented on a change in pull request #915:
URL: https://github.com/apache/orc/pull/915#discussion_r713544747



##########
File path: java/core/src/java/org/apache/orc/PhysicalWriter.java
##########
@@ -77,6 +77,8 @@ void writeIndex(StreamName name,
   void writeBloomFilter(StreamName name,
                         OrcProto.BloomFilterIndex.Builder bloom) throws IOException;
 
+//  void writeTDigest(StreamName name, OrcProto) throws IOException;
+

Review comment:
       Fix in 9133711




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] guiyanakuang commented on pull request #915: ORC-98: Add support for t-digests to ORC

Posted by GitBox <gi...@apache.org>.

guiyanakuang commented on pull request #915:
URL: https://github.com/apache/orc/pull/915#issuecomment-924693919


   > @guiyanakuang This work is interesting. However, we should be careful if it involves format change.
   
   @wgtmac Couldn't agree with you more. I am also submitting this draft in the hope of having a full discussion. Thanks for the review.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] guiyanakuang edited a comment on pull request #915: ORC-98: Add support for t-digests to ORC

Posted by GitBox <gi...@apache.org>.

guiyanakuang edited a comment on pull request #915:
URL: https://github.com/apache/orc/pull/915#issuecomment-925480666

Hi @dongjoon-hyun @wgtmac. After some testing and some thought, I have decided to modify this pr in the following way, and we will discuss any disagreements.

```proto
message Digest {
optional string digestName = 1;
optional bytes digestContent = 2;
}

Also, I think I'll add a command to the tool to see the field's digestName.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org