You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@orc.apache.org by do...@apache.org on 2022/01/10 18:33:30 UTC

[orc] branch branch-1.7 updated: ORC-1082: Improve FileDump and JsonFileDump to be robust on missing column statistics (#1003)

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch branch-1.7
in repository https://gitbox.apache.org/repos/asf/orc.git


The following commit(s) were added to refs/heads/branch-1.7 by this push:
     new 95d7f08  ORC-1082: Improve FileDump and JsonFileDump to be robust on missing column statistics (#1003)
95d7f08 is described below

commit 95d7f080d7cd35846788c0060e3f142e18f76bdf
Author: Yiqun Zhang <gu...@gmail.com>
AuthorDate: Tue Jan 11 02:29:31 2022 +0800

    ORC-1082: Improve FileDump and JsonFileDump to be robust on missing column statistics (#1003)
    
    ### What changes were proposed in this pull request?
    
    This pr is aims to fixing code in the FileDump and JsonFileDump commands that determines whether a column statistic exists or not.
    
    ### Why are the changes needed?
    
    Before we can get the ColumnStatistics from RowIndex we need to determine whether it exists or not.
    
    `entry.getStatistics()` does not return null at any time, and will return a default object when it is not set.
    
    The current impl is wrong
    ```java
          OrcProto.ColumnStatistics colStats = entry.getStatistics();
          if (colStats == null) {
            buf.append("no stats at ");
          }
    ```
    
    ### How was this patch tested?
    
    This can be tested with the following code
    
    ```java
      @Test
      public void test() {
        OrcProto.RowIndexEntry entry = OrcProto.RowIndexEntry.newBuilder().addPositions(0).build();
        assertFalse(entry.hasStatistics());
        assertNotNull(entry.getStatistics());
      }
    ```
    But since this is just testing Protobuf generated code, I'm not sure if I should add it to the unit tests.
    
    On the other hand, our proto is defined as follows, statistics is optional
    
    https://github.com/apache/orc/blob/7d2838b4b38111bc1bfa00592d84a8b36286b9f2/proto/orc_proto.proto#L102-L105
    
    Our orc writer implementation, however, will definitely write statistics, and there is no optional configuration, so it cannot be verified by the files written by orc
    
    https://github.com/apache/orc/blob/7d2838b4b38111bc1bfa00592d84a8b36286b9f2/java/core/src/java/org/apache/orc/impl/writer/TreeWriterBase.java#L322-L330
    (cherry picked from commit 043a827783c2d21ebf7f28355f66f0bb97db0834)
    Signed-off-by: Dongjoon Hyun <do...@apache.org>
---
 java/tools/src/java/org/apache/orc/tools/FileDump.java     | 4 ++--
 java/tools/src/java/org/apache/orc/tools/JsonFileDump.java | 2 +-
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/java/tools/src/java/org/apache/orc/tools/FileDump.java b/java/tools/src/java/org/apache/orc/tools/FileDump.java
index 92d0562..0030fcb 100644
--- a/java/tools/src/java/org/apache/orc/tools/FileDump.java
+++ b/java/tools/src/java/org/apache/orc/tools/FileDump.java
@@ -747,10 +747,10 @@ public final class FileDump {
         buf.append("unknown\n");
         continue;
       }
-      OrcProto.ColumnStatistics colStats = entry.getStatistics();
-      if (colStats == null) {
+      if (!entry.hasStatistics()) {
         buf.append("no stats at ");
       } else {
+        OrcProto.ColumnStatistics colStats = entry.getStatistics();
         ColumnStatistics cs =
             ColumnStatisticsImpl.deserialize(colSchema, colStats,
                 reader.writerUsedProlepticGregorian(),
diff --git a/java/tools/src/java/org/apache/orc/tools/JsonFileDump.java b/java/tools/src/java/org/apache/orc/tools/JsonFileDump.java
index e636d86..b05aff2 100644
--- a/java/tools/src/java/org/apache/orc/tools/JsonFileDump.java
+++ b/java/tools/src/java/org/apache/orc/tools/JsonFileDump.java
@@ -456,7 +456,7 @@ public class JsonFileDump {
       writer.object();
       writer.key("entryId").value(entryIx);
       OrcProto.RowIndexEntry entry = index.getEntry(entryIx);
-      if (entry == null) {
+      if (entry == null || !entry.hasStatistics()) {
         continue;
       }
       OrcProto.ColumnStatistics colStats = entry.getStatistics();