You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@orc.apache.org by do...@apache.org on 2022/01/10 18:33:30 UTC
[orc] branch branch-1.7 updated: ORC-1082: Improve FileDump and JsonFileDump to be robust on missing column statistics (#1003)
This is an automated email from the ASF dual-hosted git repository.
dongjoon pushed a commit to branch branch-1.7
in repository https://gitbox.apache.org/repos/asf/orc.git
The following commit(s) were added to refs/heads/branch-1.7 by this push:
new 95d7f08 ORC-1082: Improve FileDump and JsonFileDump to be robust on missing column statistics (#1003)
95d7f08 is described below
commit 95d7f080d7cd35846788c0060e3f142e18f76bdf
Author: Yiqun Zhang <gu...@gmail.com>
AuthorDate: Tue Jan 11 02:29:31 2022 +0800
ORC-1082: Improve FileDump and JsonFileDump to be robust on missing column statistics (#1003)
### What changes were proposed in this pull request?
This pr is aims to fixing code in the FileDump and JsonFileDump commands that determines whether a column statistic exists or not.
### Why are the changes needed?
Before we can get the ColumnStatistics from RowIndex we need to determine whether it exists or not.
`entry.getStatistics()` does not return null at any time, and will return a default object when it is not set.
The current impl is wrong
```java
OrcProto.ColumnStatistics colStats = entry.getStatistics();
if (colStats == null) {
buf.append("no stats at ");
}
```
### How was this patch tested?
This can be tested with the following code
```java
@Test
public void test() {
OrcProto.RowIndexEntry entry = OrcProto.RowIndexEntry.newBuilder().addPositions(0).build();
assertFalse(entry.hasStatistics());
assertNotNull(entry.getStatistics());
}
```
But since this is just testing Protobuf generated code, I'm not sure if I should add it to the unit tests.
On the other hand, our proto is defined as follows, statistics is optional
https://github.com/apache/orc/blob/7d2838b4b38111bc1bfa00592d84a8b36286b9f2/proto/orc_proto.proto#L102-L105
Our orc writer implementation, however, will definitely write statistics, and there is no optional configuration, so it cannot be verified by the files written by orc
https://github.com/apache/orc/blob/7d2838b4b38111bc1bfa00592d84a8b36286b9f2/java/core/src/java/org/apache/orc/impl/writer/TreeWriterBase.java#L322-L330
(cherry picked from commit 043a827783c2d21ebf7f28355f66f0bb97db0834)
Signed-off-by: Dongjoon Hyun <do...@apache.org>
---
java/tools/src/java/org/apache/orc/tools/FileDump.java | 4 ++--
java/tools/src/java/org/apache/orc/tools/JsonFileDump.java | 2 +-
2 files changed, 3 insertions(+), 3 deletions(-)
diff --git a/java/tools/src/java/org/apache/orc/tools/FileDump.java b/java/tools/src/java/org/apache/orc/tools/FileDump.java
index 92d0562..0030fcb 100644
--- a/java/tools/src/java/org/apache/orc/tools/FileDump.java
+++ b/java/tools/src/java/org/apache/orc/tools/FileDump.java
@@ -747,10 +747,10 @@ public final class FileDump {
buf.append("unknown\n");
continue;
}
- OrcProto.ColumnStatistics colStats = entry.getStatistics();
- if (colStats == null) {
+ if (!entry.hasStatistics()) {
buf.append("no stats at ");
} else {
+ OrcProto.ColumnStatistics colStats = entry.getStatistics();
ColumnStatistics cs =
ColumnStatisticsImpl.deserialize(colSchema, colStats,
reader.writerUsedProlepticGregorian(),
diff --git a/java/tools/src/java/org/apache/orc/tools/JsonFileDump.java b/java/tools/src/java/org/apache/orc/tools/JsonFileDump.java
index e636d86..b05aff2 100644
--- a/java/tools/src/java/org/apache/orc/tools/JsonFileDump.java
+++ b/java/tools/src/java/org/apache/orc/tools/JsonFileDump.java
@@ -456,7 +456,7 @@ public class JsonFileDump {
writer.object();
writer.key("entryId").value(entryIx);
OrcProto.RowIndexEntry entry = index.getEntry(entryIx);
- if (entry == null) {
+ if (entry == null || !entry.hasStatistics()) {
continue;
}
OrcProto.ColumnStatistics colStats = entry.getStatistics();