You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@orc.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2018/10/09 19:04:00 UTC

[jira] [Commented] (ORC-415) [C++] Fix writing ColumnStatistics

    [ https://issues.apache.org/jira/browse/ORC-415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16643955#comment-16643955 ] 

ASF GitHub Bot commented on ORC-415:
------------------------------------

wgtmac opened a new pull request #319: ORC-415: [C++] Fix writing ColumnStatistics
URL: https://github.com/apache/orc/pull/319
 
 
   Fix two issues below:
   1. A new batch may override previous batch's has_null info of colIndexStatistics if the new batch has no null but the previous batch has at least one null values.
   2. If ColumnStatistics does not have any not-null data, it has no sum/min/max infos and this results in writing generic ColumnStatistics in the proto-buf serialization. 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


> [C++] Fix writing ColumnStatistics
> ----------------------------------
>
>                 Key: ORC-415
>                 URL: https://issues.apache.org/jira/browse/ORC-415
>             Project: ORC
>          Issue Type: Bug
>          Components: C++
>            Reporter: Gang Wu
>            Assignee: Gang Wu
>            Priority: Major
>
> Current C++ ORC writer implementation has two issues about column statistics.
> 1. A new batch may override previous batch's has_null info of colIndexStatistics if the new batch has no null but the previous batch has at least one null values.
> {code:java}
> bool hasNull = false;
> if (!structBatch->hasNulls) {
>   colIndexStatistics->increase(numValues);
> } else {
>   const char* notNull = structBatch->notNull.data() + offset;
>   for (uint64_t i = 0; i < numValues; ++i) {
>     if (notNull[i]) {
>       colIndexStatistics->increase(1);
>     } else if (!hasNull) {
>       hasNull = true;
>     }
>   }
> }
> colIndexStatistics->setHasNull(hasNull);{code}
> 2. If ColumnStatistics does not have any not-null data, it has no sum/min/max infos and this results in writing generic but not type-specific ColumnStatistics in the protobuf serialization. The problem is that reader will have a hard time to deserialize the ColumnStatistics correctly.
> {code:java}
> void toProtoBuf(proto::ColumnStatistics& pbStats) const override {
>   pbStats.set_hasnull(_stats.hasNull());
>   pbStats.set_numberofvalues(_stats.getNumberOfValues());
>   if (_stats.hasMinimum()) {
>     proto::DateStatistics* dateStatistics = pbStats.mutable_datestatistics();
>     dateStatistics->set_maximum(_stats.getMaximum());
>     dateStatistics->set_minimum(_stats.getMinimum());
>   }
> }
> {code}
>  
> The scope of this Jira is to fix these two problems.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)