You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@orc.apache.org by "Gang Wu (JIRA)" <ji...@apache.org> on 2018/10/09 19:02:00 UTC
[jira] [Assigned] (ORC-415) [C++] Fix writing ColumnStatistics
[ https://issues.apache.org/jira/browse/ORC-415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Gang Wu reassigned ORC-415:
---------------------------
> [C++] Fix writing ColumnStatistics
> ----------------------------------
>
> Key: ORC-415
> URL: https://issues.apache.org/jira/browse/ORC-415
> Project: ORC
> Issue Type: Bug
> Components: C++
> Reporter: Gang Wu
> Assignee: Gang Wu
> Priority: Major
>
> Current C++ ORC writer implementation has two issues about column statistics.
> 1. A new batch may override previous batch's has_null info of colIndexStatistics if the new batch has no null but the previous batch has at least one null values.
> {code:java}
> bool hasNull = false;
> if (!structBatch->hasNulls) {
> colIndexStatistics->increase(numValues);
> } else {
> const char* notNull = structBatch->notNull.data() + offset;
> for (uint64_t i = 0; i < numValues; ++i) {
> if (notNull[i]) {
> colIndexStatistics->increase(1);
> } else if (!hasNull) {
> hasNull = true;
> }
> }
> }
> colIndexStatistics->setHasNull(hasNull);{code}
> 2. If ColumnStatistics does not have any not-null data, it has no sum/min/max infos and this results in writing generic but not type-specific ColumnStatistics in the protobuf serialization. The problem is that reader will have a hard time to deserialize the ColumnStatistics correctly.
> {code:java}
> void toProtoBuf(proto::ColumnStatistics& pbStats) const override {
> pbStats.set_hasnull(_stats.hasNull());
> pbStats.set_numberofvalues(_stats.getNumberOfValues());
> if (_stats.hasMinimum()) {
> proto::DateStatistics* dateStatistics = pbStats.mutable_datestatistics();
> dateStatistics->set_maximum(_stats.getMaximum());
> dateStatistics->set_minimum(_stats.getMinimum());
> }
> }
> {code}
>
> The scope of this Jira is to fix these two problems.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)