You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by "Mithun Radhakrishnan (JIRA)" <ji...@apache.org> on 2017/01/10 23:43:58 UTC
[jira] [Created] (HIVE-15575) ALTER TABLE CONCATENATE and
hive.merge.tezfiles seems busted for UNION ALL output
Mithun Radhakrishnan created HIVE-15575:
-------------------------------------------
Summary: ALTER TABLE CONCATENATE and hive.merge.tezfiles seems busted for UNION ALL output
Key: HIVE-15575
URL: https://issues.apache.org/jira/browse/HIVE-15575
Project: Hive
Issue Type: Bug
Reporter: Mithun Radhakrishnan
Priority: Critical
Hive {{UNION ALL}} produces data in sub-directories under the table/partition directories. E.g.
{noformat}
hive (mythdb_hadooppf_17544)> create table source ( foo string, bar string, goo string ) stored as textfile;
OK
Time taken: 0.322 seconds
hive (mythdb_hadooppf_17544)> create table results_partitioned( foo string, bar string, goo string ) partitioned by ( dt string ) stored as orcfile;
OK
Time taken: 0.322 seconds
hive (mythdb_hadooppf_17544)> set hive.merge.tezfiles=false; insert overwrite table results_partitioned partition( dt ) select 'goo', 'bar', 'foo', '1' from source UNION ALL select 'go', 'far', 'moo', '1' from source;
...
Loading data to table mythdb_hadooppf_17544.results_partitioned partition (dt=null)
Time taken for load dynamic partitions : 311
Loading partition {dt=1}
Time taken for adding to write entity : 3
OK
Time taken: 27.659 seconds
hive (mythdb_hadooppf_17544)> dfs -ls -R /tmp/mythdb_hadooppf_17544/results_partitioned;
drwxrwxrwt - dfsload hdfs 0 2017-01-10 23:13 /tmp/mythdb_hadooppf_17544/results_partitioned/dt=1
drwxrwxrwt - dfsload hdfs 0 2017-01-10 23:13 /tmp/mythdb_hadooppf_17544/results_partitioned/dt=1/1
-rwxrwxrwt 3 dfsload hdfs 349 2017-01-10 23:13 /tmp/mythdb_hadooppf_17544/results_partitioned/dt=1/1/000000_0
drwxrwxrwt - dfsload hdfs 0 2017-01-10 23:13 /tmp/mythdb_hadooppf_17544/results_partitioned/dt=1/2
-rwxrwxrwt 3 dfsload hdfs 368 2017-01-10 23:13 /tmp/mythdb_hadooppf_17544/results_partitioned/dt=1/2/000000_0
{noformat}
These results can only be read if {{mapred.input.dir.recursive=true}}, as {{TezCompiler::init()}} seems to do. But the Hadoop default for this is {{false}}. This leads to the following errors:
1. Running {{CONCATENATE}} on the partition on the partition causes data-loss.
{noformat}
hive --database mythdb_hadooppf_17544 -e " set mapred.input.dir.recursive; alter table results_partitioned partition ( dt='1' ) concatenate ; set mapred.input.dir.recursive; "
...
OK
Time taken: 2.151 seconds
mapred.input.dir.recursive=false
Status: Running (Executing on YARN cluster with App id application_1481756273279_5088754)
--------------------------------------------------------------------------------
VERTICES STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED
--------------------------------------------------------------------------------
File Merge SUCCEEDED 0 0 0 0 0 0
--------------------------------------------------------------------------------
VERTICES: 01/01 [>>--------------------------] 0% ELAPSED TIME: 0.35 s
--------------------------------------------------------------------------------
Loading data to table mythdb_hadooppf_17544.results_partitioned partition (dt=1)
Moved: 'hdfs://cluster-nn1.mygrid.myth.net:8020/tmp/mythdb_hadooppf_17544/results_partitioned/dt=1/1' to trash at: hdfs://cluster-nn1.mygrid.myth.net:8020/user/dfsload/.Trash/Current
Moved: 'hdfs://cluster-nn1.mygrid.myth.net:8020/tmp/mythdb_hadooppf_17544/results_partitioned/dt=1/2' to trash at: hdfs://cluster-nn1.mygrid.myth.net:8020/user/dfsload/.Trash/Current
OK
Time taken: 25.873 seconds
$ hdfs dfs -count -h /tmp/mythdb_hadooppf_17544/results_partitioned/dt=1
1 0 0 /tmp/mythdb_hadooppf_17544/results_partitioned/dt=1
{noformat}
2. hive.merge.tezfiles is busted, because the merge-task attempts to merge files across {{results_partitioned/dt=1/1}} and {{results_partitioned/dt=1/2}}:
{noformat}
$ hive --database mythdb_hadooppf_17544 -e " set hive.merge.tezfiles=true; insert overwrite table results_partitioned partition( dt ) select 'goo', 'bar', 'foo', '1' from source UNION ALL select 'go', 'far', 'moo', '1' from source; "
...
Query ID = dfsload_20170110233558_51289333-d9da-4851-8671-bfe653d26e45
Total jobs = 3
Launching Job 1 out of 3
Status: Running (Executing on YARN cluster with App id application_1481756273279_5089989)
--------------------------------------------------------------------------------
VERTICES STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED
--------------------------------------------------------------------------------
Map 1 .......... SUCCEEDED 1 1 0 0 0 0
Map 3 .......... SUCCEEDED 1 1 0 0 0 0
--------------------------------------------------------------------------------
VERTICES: 02/02 [==========================>>] 100% ELAPSED TIME: 13.07 s
--------------------------------------------------------------------------------
Stage-4 is filtered out by condition resolver.
Stage-3 is selected by condition resolver.
Stage-5 is filtered out by condition resolver.
Launching Job 3 out of 3
Status: Running (Executing on YARN cluster with App id application_1481756273279_5089989)
--------------------------------------------------------------------------------
VERTICES STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED
--------------------------------------------------------------------------------
File Merge RUNNING 1 0 1 0 2 0
--------------------------------------------------------------------------------
VERTICES: 00/01 [>>--------------------------] 0% ELAPSED TIME: 3.06 s
--------------------------------------------------------------------------------
...
{noformat}
The {{File Merge}} fails with the following:
{noformat}
TaskAttempt 3 failed, info=[Error: Failure while running task:java.lang.RuntimeException: java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: java.io.IOException: Multiple partitions for one merge mapper: hdfs://cluster-nn1.mygrid.myth.net:8020/tmp/mythdb_hadooppf_17544/results_partitioned/.hive-staging_hive_2017-01-10_23-35-58_881_4062579557908207136-1/-ext-10002/dt=1/2 NOT EQUAL TO hdfs://cluster-nn1.mygrid.myth.net:8020/tmp/mythdb_hadooppf_17544/results_partitioned/.hive-staging_hive_2017-01-10_23-35-58_881_4062579557908207136-1/-ext-10002/dt=1/1
at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:171)
at org.apache.hadoop.hive.ql.exec.tez.MergeFileTezProcessor.run(MergeFileTezProcessor.java:42)
at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:362)
at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:192)
at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:184)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1738)
at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:184)
at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:180)
at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: java.io.IOException: Multiple partitions for one merge mapper: hdfs://cluster-nn1.mygrid.myth.net:8020/tmp/mythdb_hadooppf_17544/results_partitioned/.hive-staging_hive_2017-01-10_23-35-58_881_4062579557908207136-1/-ext-10002/dt=1/2 NOT EQUAL TO hdfs://cluster-nn1.mygrid.myth.net:8020/tmp/mythdb_hadooppf_17544/results_partitioned/.hive-staging_hive_2017-01-10_23-35-58_881_4062579557908207136-1/-ext-10002/dt=1/1
at org.apache.hadoop.hive.ql.exec.tez.MergeFileRecordProcessor.processRow(MergeFileRecordProcessor.java:217)
at org.apache.hadoop.hive.ql.exec.tez.MergeFileRecordProcessor.run(MergeFileRecordProcessor.java:151)
at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:148)
... 14 more
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: java.io.IOException: Multiple partitions for one merge mapper: hdfs://cluster-nn1.mygrid.myth.net:8020/tmp/mythdb_hadooppf_17544/results_partitioned/.hive-staging_hive_2017-01-10_23-35-58_881_4062579557908207136-1/-ext-10002/dt=1/2 NOT EQUAL TO hdfs://cluster-nn1.mygrid.myth.net:8020/tmp/mythdb_hadooppf_17544/results_partitioned/.hive-staging_hive_2017-01-10_23-35-58_881_4062579557908207136-1/-ext-10002/dt=1/1
at org.apache.hadoop.hive.ql.exec.OrcFileMergeOperator.processKeyValuePairs(OrcFileMergeOperator.java:159)
at org.apache.hadoop.hive.ql.exec.OrcFileMergeOperator.process(OrcFileMergeOperator.java:62)
at org.apache.hadoop.hive.ql.exec.tez.MergeFileRecordProcessor.processRow(MergeFileRecordProcessor.java:208)
... 16 more
Caused by: java.io.IOException: Multiple partitions for one merge mapper: hdfs://cluster-nn1.mygrid.myth.net:8020/tmp/mythdb_hadooppf_17544/results_partitioned/.hive-staging_hive_2017-01-10_23-35-58_881_4062579557908207136-1/-ext-10002/dt=1/2 NOT EQUAL TO hdfs://cluster-nn1.mygrid.myth.net:8020/tmp/mythdb_hadooppf_17544/results_partitioned/.hive-staging_hive_2017-01-10_23-35-58_881_4062579557908207136-1/-ext-10002/dt=1/1
at org.apache.hadoop.hive.ql.exec.AbstractFileMergeOperator.checkPartitionsMatch(AbstractFileMergeOperator.java:174)
at org.apache.hadoop.hive.ql.exec.AbstractFileMergeOperator.fixTmpPath(AbstractFileMergeOperator.java:191)
at org.apache.hadoop.hive.ql.exec.OrcFileMergeOperator.processKeyValuePairs(OrcFileMergeOperator.java:86)
... 18 more
]], Vertex did not succeed due to OWN_TASK_FAILURE, failedTasks:1 killedTasks:0, Vertex vertex_1481756273279_5089989_2_00 [File Merge] killed/failed due to:OWN_TASK_FAILURE]DAG did not succeed due to VERTEX_FAILURE. failedVertices:1 killedVertices:0
{noformat}
3. Data produced with Hive {{UNION ALL}} will not be readable by Pig/HCatalog, without {{mapred.input.dir.recursive}}.
Setting {{mapred.input.dir.recursive=true}} in {{hive-site.xml}} should resolve the first and third problem. But is this the recommendation? This is intrusive, and doesn't solve #2. The Pig {{UNION}} doesn't work this way, as per my limited understanding.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)