You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@orc.apache.org by GitBox <gi...@apache.org> on 2021/09/22 06:47:50 UTC

[GitHub] [orc] wgtmac commented on a change in pull request #915: ORC-98: Add support for t-digests to ORC

wgtmac commented on a change in pull request #915:
URL: https://github.com/apache/orc/pull/915#discussion_r713630200



##########
File path: java/core/src/java/org/apache/orc/impl/ColumnStatisticsImpl.java
##########
@@ -423,6 +502,12 @@ public void merge(ColumnStatisticsImpl other) {
       if (!overflow) {
         intb.setSum(sum);
       }
+      if (tDigest != null && digestConf.getPersistence()) {
+        ByteBuffer byteBuffer = ByteBuffer.allocate(tDigest.byteSize());
+        tDigest.asBytes(byteBuffer);
+        byteBuffer.flip();

Review comment:
       Why do we need a flip here?

##########
File path: java/core/src/java/org/apache/orc/DigestConf.java
##########
@@ -0,0 +1,72 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.orc;
+
+import com.tdunning.math.stats.TDigest;
+
+public class DigestConf {
+
+  public static final DigestConf NO_CREATE_DIGEST = new DigestConf(false, 0);
+
+  public static final DigestConf DEFAULT_DIGEST = new DigestConf(true, 1000);
+
+  private final boolean createDigest;
+
+  private final double digestCompression;

Review comment:
       What does it mean? Does t-digest supports compression natively (using standard compression algorithm) or simple serialization?

##########
File path: java/core/pom.xml
##########
@@ -75,6 +75,10 @@
       <groupId>org.threeten</groupId>
       <artifactId>threeten-extra</artifactId>
     </dependency>
+    <dependency>
+      <groupId>com.tdunning</groupId>

Review comment:
       I agree with @dongjoon-hyun that introducing a new dependency adds more complexity to this project. It looks like t-digest has been widely adopted by many projects and big name companies: https://www.sciencedirect.com/science/article/pii/S2665963820300403. It would be good if we can make it as a plugin instead of a direct dependency.

##########
File path: java/core/pom.xml
##########
@@ -75,6 +75,10 @@
       <groupId>org.threeten</groupId>
       <artifactId>threeten-extra</artifactId>
     </dependency>
+    <dependency>
+      <groupId>com.tdunning</groupId>

Review comment:
       @guiyanakuang I am also curious of your preference of t-digest over DataSketches which is also an Apache project shipped with similar feature.

##########
File path: proto/orc_proto.proto
##########
@@ -26,12 +26,14 @@ message IntegerStatistics  {
   optional sint64 minimum = 1;
   optional sint64 maximum = 2;
   optional sint64 sum = 3;
+  optional bytes digest = 4;
 }
 
 message DoubleStatistics {
   optional double minimum = 1;
   optional double maximum = 2;
   optional double sum = 3;
+  optional bytes digest = 4;

Review comment:
       Does t-digest has breaking compatibility of serialization among different versions? Additionally, dealing with serialized t-digests between Java and C++ is also something we should be aware of.

##########
File path: java/core/src/test/org/apache/orc/impl/TestDigest.java
##########
@@ -0,0 +1,109 @@
+/*
+ * Licensed to Ted Dunning under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.orc.impl;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.hive.ql.exec.vector.BytesColumnVector;
+import org.apache.hadoop.hive.ql.exec.vector.LongColumnVector;
+import org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch;
+import org.apache.orc.ColumnStatistics;
+import org.apache.orc.OrcConf;
+import org.apache.orc.OrcFile;
+import org.apache.orc.Reader;
+import org.apache.orc.TypeDescription;
+import org.apache.orc.Writer;
+import org.junit.jupiter.api.AfterEach;
+import org.junit.jupiter.api.BeforeEach;
+import org.junit.jupiter.api.Test;
+
+import java.io.IOException;
+import java.nio.charset.StandardCharsets;
+
+import static org.junit.jupiter.api.Assertions.assertEquals;
+import static org.junit.jupiter.api.Assertions.assertFalse;
+import static org.junit.jupiter.api.Assertions.assertTrue;
+
+public class TestDigest {
+
+  Path workDir = new Path(System.getProperty("test.tmp.dir"));
+  Configuration conf;
+  FileSystem fs;
+  Path testFilePath;
+  TypeDescription schema;
+
+  @BeforeEach
+  public void openFileSystem() throws Exception {
+    conf = new Configuration();
+    conf.set(OrcConf.DIGEST_COLUMNS.getAttribute(), "id:100");
+    fs = FileSystem.getLocal(conf);
+    fs.setWorkingDirectory(workDir);
+    testFilePath = new Path("testWriterImpl.orc");
+    fs.create(testFilePath, true);
+    schema = TypeDescription.fromString("struct<id:int,name:string>");
+  }
+
+  @AfterEach
+  public void deleteTestFile() throws Exception {
+    fs.delete(testFilePath, false);
+  }
+
+  private void write() throws IOException {
+    Writer writer = OrcFile.createWriter(testFilePath,
+        OrcFile.writerOptions(conf)
+            .overwrite(true)
+            .setSchema(schema)
+    );
+    VectorizedRowBatch batch = schema.createRowBatch();
+    LongColumnVector id = (LongColumnVector) batch.cols[0];
+    BytesColumnVector name = (BytesColumnVector) batch.cols[1];
+    for (int r = 1; r < 20; ++r) {
+      int row = batch.size++;
+      id.vector[row] = r;
+      byte[] buffer = String.valueOf(r).getBytes(StandardCharsets.UTF_8);
+      name.setRef(row, buffer, 0, buffer.length);
+    }
+    int row = batch.size++;
+    id.vector[row] = 1_000_000;
+    byte[] buffer = String.valueOf(1_000_000).getBytes(StandardCharsets.UTF_8);
+    name.setRef(row, buffer, 0, buffer.length);
+    writer.addRowBatch(batch);
+    writer.close();
+  }
+
+  private void read() throws IOException {

Review comment:
       Should we encrypt the digest if column encryption is enabled to avoid data leak?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org