You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@orc.apache.org by "Gang Wu (JIRA)" <ji...@apache.org> on 2018/10/09 19:02:00 UTC

[jira] [Assigned] (ORC-415) [C++] Fix writing ColumnStatistics

     [ https://issues.apache.org/jira/browse/ORC-415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Gang Wu reassigned ORC-415:
---------------------------


> [C++] Fix writing ColumnStatistics
> ----------------------------------
>
>                 Key: ORC-415
>                 URL: https://issues.apache.org/jira/browse/ORC-415
>             Project: ORC
>          Issue Type: Bug
>          Components: C++
>            Reporter: Gang Wu
>            Assignee: Gang Wu
>            Priority: Major
>
> Current C++ ORC writer implementation has two issues about column statistics.
> 1. A new batch may override previous batch's has_null info of colIndexStatistics if the new batch has no null but the previous batch has at least one null values.
> {code:java}
> bool hasNull = false;
> if (!structBatch->hasNulls) {
>   colIndexStatistics->increase(numValues);
> } else {
>   const char* notNull = structBatch->notNull.data() + offset;
>   for (uint64_t i = 0; i < numValues; ++i) {
>     if (notNull[i]) {
>       colIndexStatistics->increase(1);
>     } else if (!hasNull) {
>       hasNull = true;
>     }
>   }
> }
> colIndexStatistics->setHasNull(hasNull);{code}
> 2. If ColumnStatistics does not have any not-null data, it has no sum/min/max infos and this results in writing generic but not type-specific ColumnStatistics in the protobuf serialization. The problem is that reader will have a hard time to deserialize the ColumnStatistics correctly.
> {code:java}
> void toProtoBuf(proto::ColumnStatistics& pbStats) const override {
>   pbStats.set_hasnull(_stats.hasNull());
>   pbStats.set_numberofvalues(_stats.getNumberOfValues());
>   if (_stats.hasMinimum()) {
>     proto::DateStatistics* dateStatistics = pbStats.mutable_datestatistics();
>     dateStatistics->set_maximum(_stats.getMaximum());
>     dateStatistics->set_minimum(_stats.getMinimum());
>   }
> }
> {code}
>  
> The scope of this Jira is to fix these two problems.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)