You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "Syed Shameerur Rahman (Jira)" <ji...@apache.org> on 2020/08/07 13:02:00 UTC

[jira] [Comment Edited] (HIVE-18284) NPE when inserting data with 'distribute by' clause

    [ https://issues.apache.org/jira/browse/HIVE-18284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17173124#comment-17173124 ] 

Syed Shameerur Rahman edited comment on HIVE-18284 at 8/7/20, 1:01 PM:
-----------------------------------------------------------------------

Steps to reproduce:


{code:java}
create table table1 (col1 string, datekey int);
insert into table1 values ('ROW1', 1), ('ROW2', 2), ('ROW3', 1);
create table table2 (col1 string) partitioned by (datekey int);

set hive.vectorized.execution.enabled=false;
set hive.optimize.sort.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
insert into table table2
PARTITION(datekey)
select col1,
datekey
from table1
distribute by datekey ;
{code}

when *hive.optimize.sort.dynamic.partition=true* we expect the keys to be sorted in the reducer side so that reducers can keep only one record writer open at any time thereby reducing the memory pressure on the reducers (HIVE-6455) , But from the query execution it looks like the property of *distribute by clause* (does not guarantee clustering or sorting properties on the distributed keys) takes precedence and hence the keys at the reducer side are not in sorted order and fails with null pointer exception.

[~prasanth_j] [~jcamachorodriguez] [~vikram.dixit] any thoughts on this?
*Note*: This issue is not seen when vectorization is enabled



was (Author: srahman):
Steps to reproduce:


{code:java}
create table table1 (col1 string, datekey int);
insert into table1 values ('ROW1', 1), ('ROW2', 2), ('ROW3', 1);
create table table2 (col1 string) partitioned by (datekey int);

set hive.vectorized.execution.enabled=false;
set hive.optimize.sort.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
insert into table table2
PARTITION(datekey)
select col1,
datekey
from table1
distribute by datekey ;
{code}

when * hive.optimize.sort.dynamic.partition=true* we expect the keys to be sorted in the reducer side so that reducers can keep only one record writer open at any time thereby reducing the memory pressure on the reducers (HIVE-6455) , But from the query execution it looks like the property of *distribute by clause* (does not guarantee clustering or sorting properties on the distributed keys) takes precedence and hence the keys at the reducer side are not in sorted order and fails with null pointer exception.

[~prasanth_j] [~jcamachorodriguez] [~vikram.dixit] any thoughts on this?
*Note*: This issue is not seen when vectorization is enabled


> NPE when inserting data with 'distribute by' clause
> ---------------------------------------------------
>
>                 Key: HIVE-18284
>                 URL: https://issues.apache.org/jira/browse/HIVE-18284
>             Project: Hive
>          Issue Type: Bug
>          Components: Query Processor
>    Affects Versions: 2.3.1, 2.3.2
>         Environment: EMR
>            Reporter: Aki Tanaka
>            Assignee: Lynch Lee
>            Priority: Major
>
> A Null Pointer Exception occurs when inserting data with 'distribute by' clause. The following snippet query reproduces this issue:
> {code:java}
> create table table1 (col1 string, datekey int);
> insert into table1 values ('ROW1', 1), ('ROW2', 2), ('ROW3', 1);
> create table table2 (col1 string) partitioned by (datekey int);
> set hive.exec.dynamic.partition.mode=nonstrict;
> insert into table table2
> PARTITION(datekey)
> select col1,
> datekey
> from table1
> distribute by datekey ;
> {code}
> I could run the insert query without the error if I remove Distribute By  or use Cluster By clause.
> It seems that the issue happens because Distribute By does not guarantee clustering or sorting properties on the distributed keys.
> FileSinkOperator removes the previous fsp. FileSinkOperator will remove the previous fsp which might be re-used when we use Distribute By.
> https://github.com/apache/hive/blob/branch-2.3/ql/src/java/org/apache/hadoop/hive/ql/exec/FileSinkOperator.java#L972
> The following stack trace is logged.
> {code:java}
> Vertex failed, vertexName=Reducer 2, vertexId=vertex_1513111717879_0056_1_01, diagnostics=[Task failed, taskId=task_1513111717879_0056_1_01_000000, diagnostics=[TaskAttempt 0 failed, info=[Error: Error while running task ( failure ) : attempt_1513111717879_0056_1_01_000000_0:java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row (tag=0) {"key":{},"value":{"_col0":"ROW3","_col1":1}}
> 	at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:211)
> 	at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:168)
> 	at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:370)
> 	at org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:73)
> 	at org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:61)
> 	at java.security.AccessController.doPrivileged(Native Method)
> 	at javax.security.auth.Subject.doAs(Subject.java:422)
> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
> 	at org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:61)
> 	at org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:37)
> 	at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
> 	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> 	at java.lang.Thread.run(Thread.java:748)
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row (tag=0) {"key":{},"value":{"_col0":"ROW3","_col1":1}}
> 	at org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource$GroupIterator.next(ReduceRecordSource.java:365)
> 	at org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource.pushRecord(ReduceRecordSource.java:250)
> 	at org.apache.hadoop.hive.ql.exec.tez.ReduceRecordProcessor.run(ReduceRecordProcessor.java:317)
> 	at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:185)
> 	... 14 more
> Caused by: java.lang.NullPointerException
> 	at org.apache.hadoop.hive.ql.exec.FileSinkOperator.process(FileSinkOperator.java:762)
> 	at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:897)
> 	at org.apache.hadoop.hive.ql.exec.SelectOperator.process(SelectOperator.java:95)
> 	at org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource$GroupIterator.next(ReduceRecordSource.java:356)
> 	... 17 more
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)