You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by "Alan Gates (JIRA)" <ji...@apache.org> on 2014/10/18 00:59:35 UTC
[jira] [Updated] (HIVE-8474) Vectorized reads of transactional tables fail when not all columns are selected

     [ https://issues.apache.org/jira/browse/HIVE-8474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alan Gates updated HIVE-8474:
-----------------------------
    Attachment: HIVE-8474.patch

This patch makes several changes in vectorization.  [~mmccline] and [~ashutoshc], as I am not very familiar with this code and as I know the code is very performance sensitive I would appreciate your feedback on the patch.

The issue causing problems was that VectorizedBatchUtil.addRowToBatchFrom is used by VectorizedOrcAcidRowReader to take the merged rows from and acid read and put them in a vector batch.  But this method appears to have been built to be used by vector operators, not file formats where columns may be missing because they have been projected out or may already have values set as they are partition columns.  So I made the following changes:
# I changed addRowToBatchFrom to skip writing values into ColumnVectors that are null.  This handles the case where columns have been projected out and thus the ColumnVector is null.
# I changed VectorizedRowBatch to have a boolean array to track which columns are partition columns and VectorizedRowBatchCtx.createVectorizedRowBatch to populate this array
# I changed addRowToBatchFrom to skip writing values into ColumnVectors that are marked in VectorizedRowBatch as partition columns, since this results in overwriting the values that have already been put there by VectorizedRowBatchCtx.addPartitionColumnsToBatch

My concern is whether it is appropriate to mix in this functionality to skip projected out and partition columns into addRowToBatchFrom.  If you think it isn't good, I can write a new method to do this.  But that will involve a fair amount of duplicate code.  

[~owen.omalley], I also changed VectorizedOrcAcidRowReader to set the partition column values after every call to VectorizedRowBatch.reset in next.  Without doing this the code was NPEing later in the pipeline because the partition column had been set to null.  It appeared that you had copied the code from VectorizedOrcInputFormat, which only called addPartitionColsToBatch once, but which never called reset.  I tried removing the call to reset but that caused other issues.

> Vectorized reads of transactional tables fail when not all columns are selected
> -------------------------------------------------------------------------------
>
>                 Key: HIVE-8474
>                 URL: https://issues.apache.org/jira/browse/HIVE-8474
>             Project: Hive
>          Issue Type: Bug
>          Components: Transactions, Vectorization
>    Affects Versions: 0.14.0
>            Reporter: Alan Gates
>            Assignee: Alan Gates
>            Priority: Critical
>             Fix For: 0.14.0
>
>         Attachments: HIVE-8474.patch
>
>
> {code}
> create table concur_orc_tab(name varchar(50), age int, gpa decimal(3, 2)) clustered by (age) into 2 buckets stored as orc TBLPROPERTIES ('transactional'='true');
> select name, age from concur_orc_tab order by name;
> {code}
> results in
> {code}
> Diagnostic Messages for this Task:
> Error: java.io.IOException: java.lang.NullPointerException
>         at org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderNextException(HiveIOExceptionHandlerChain.java:121)
>         at org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderNextException(HiveIOExceptionHandlerUtil.java:77)
>         at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:352)
>         at org.apache.hadoop.hive.ql.io.HiveRecordReader.doNext(HiveRecordReader.java:79)
>         at org.apache.hadoop.hive.ql.io.HiveRecordReader.doNext(HiveRecordReader.java:33)
>         at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.next(HiveContextAwareRecordReader.java:115)
>         at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:199)
>         at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:185)
>         at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:52)
>         at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:450)
>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
>         at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:415)
>         at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
>         at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
> Caused by: java.lang.NullPointerException
>         at org.apache.hadoop.hive.ql.exec.vector.VectorizedBatchUtil.setNullColIsNullValue(VectorizedBatchUtil.java:63)
>         at org.apache.hadoop.hive.ql.exec.vector.VectorizedBatchUtil.addRowToBatchFrom(VectorizedBatchUtil.java:443)
>         at org.apache.hadoop.hive.ql.exec.vector.VectorizedBatchUtil.addRowToBatch(VectorizedBatchUtil.java:214)
>         at org.apache.hadoop.hive.ql.io.orc.VectorizedOrcAcidRowReader.next(VectorizedOrcAcidRowReader.java:95)
>         at org.apache.hadoop.hive.ql.io.orc.VectorizedOrcAcidRowReader.next(VectorizedOrcAcidRowReader.java:43)
>         at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:347)
>         ... 13 more
> {code}
> The issue is that the object inspector passed to VectorizedOrcAcidRowReader has all of the columns in the file rather than only the projected columns.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)