You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2021/04/28 20:56:12 UTC

[GitHub] [iceberg] tprelle opened a new issue #2541: Hive: insert into from hive tez it's not working for simple insert query

tprelle opened a new issue #2541:
URL: https://github.com/apache/iceberg/issues/2541


   insert into from hive tez it's not working for simple insert query because we do not have forcommit file because it's only created for reduce task
   https://github.com/apache/iceberg/blob/master/mr/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergOutputCommitter.java#L82
   Just an insert into export_table select * from import_table; reproduce the bug.
   `
   ----------------------------------------------------------------------------------------------
           VERTICES      MODE        STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED  
   ----------------------------------------------------------------------------------------------
   Map 1 .......... container     SUCCEEDED      1          1        0        0       0       0  
   ----------------------------------------------------------------------------------------------
   VERTICES: 01/01  [==========================>>] 100%  ELAPSED TIME: 28.92 s    
   ----------------------------------------------------------------------------------------------
   ERROR : Commit failed for output: outputName:out_Map 1 of vertex/vertexGroup:Map 1 isVertexGroupOutput:false, org.apache.iceberg.exceptions.NotFoundException: Failed to open input stream for file: hdfs://.../table/temp/hive_20210428203410_e66e7c0d-64d8-4266-9d96-8512a097ade2-job_16191695629360_106820/task-1.forCommit
           at org.apache.iceberg.hadoop.HadoopInputFile.newStream(HadoopInputFile.java:177)
           at org.apache.iceberg.mr.hive.HiveIcebergOutputCommitter.readFileForCommit(HiveIcebergOutputCommitter.java:439)
           at org.apache.iceberg.mr.hive.HiveIcebergOutputCommitter.lambda$dataFiles$9(HiveIcebergOutputCommitter.java:394)
           at org.apache.iceberg.util.Tasks$Builder.runTaskWithRetry(Tasks.java:404)
           at org.apache.iceberg.util.Tasks$Builder.access$300(Tasks.java:70)
           at org.apache.iceberg.util.Tasks$Builder$1.run(Tasks.java:310)
           at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
           at java.util.concurrent.FutureTask.run(FutureTask.java:266)
           at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
           at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
           at java.lang.Thread.run(Thread.java:748)`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] tprelle edited a comment on issue #2541: Hive: insert into from hive tez it's not working for Map Only insert query

Posted by GitBox <gi...@apache.org>.
tprelle edited a comment on issue #2541:
URL: https://github.com/apache/iceberg/issues/2541#issuecomment-831888173


   @marton-bod  sure : 
   For tez from apache 0.10.0 tag i add 
   -  https://issues.apache.org/jira/projects/TEZ/issues/TEZ-4238 
   - https://issues.apache.org/jira/projects/TEZ/issues/TEZ-4264
   
   For hive it was a bit complex from HDP 3.1.5-2-4 versions i add :
    - https://issues.apache.org/jira/browse/HIVE-23190 for be able to go to tez 0.10
    - https://issues.apache.org/jira/browse/HIVE-24629 for output committer classe
    - https://issues.apache.org/jira/browse/HIVE-24207 because i need that hive tez processor fill jobconf https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/tez/TezProcessor.java#L202 for TEZ_VERTEX_ID_HIVE in order to make TaskAttemptWrapper https://github.com/apache/iceberg/blob/master/mr/src/main/java/org/apache/iceberg/mr/hive/TezUtil.java#L95
    
   With this version i add still an issue : with this line https://github.com/apache/iceberg/blob/master/mr/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergOutputCommitter.java#L382
   Because conf.getNumReduceTasks() and conf.getNumMapTasks() was never setup by Hive.
   I found a way (but i do not know if it's the correct one or it's because of HDP fork) to fix.
   
   - For  ReduceWork plan, i add at this line
   
    https://github.com/apache/hive/blob/branch-3.1/ql/src/java/org/apache/hadoop/hive/ql/exec/tez/DagUtils.java#L800
   `    conf.setNumReduceTasks(reduceWork.isAutoReduceParallelism() ?
               reduceWork.getMaxReduceTasks() :
               reduceWork.getNumReduceTasks());`
   
   - For MergeJoinWork i add at this line https://github.com/apache/hive/blob/branch-3.1/ql/src/java/org/apache/hadoop/hive/ql/exec/tez/DagUtils.java#L596 `conf.setNumMapTasks(mapWorkList.size() + 1);`
   
   - For MapWork, i was able only in one condition, if hive.compute.splits.in.am=false by adding at  https://github.com/apache/hive/blob/branch-3.1/ql/src/java/org/apache/hadoop/hive/ql/exec/tez/DagUtils.java#L716 `conf.setNumMapTasks(numTasks);`
   But with hive.compute.splits.in.am=false vectorisation it's not longer working because row ids a not longer projected.
   
   I need to set me up an hive from latest 3.1 version in order to be able to test.
   I take as example Apache code as Cloudera deside to remove from internet Hortonworks github but it's seems it's almost the same code from apache branch 3.1


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] tprelle commented on issue #2541: Hive: insert into from hive tez it's not working for Map Only insert query

Posted by GitBox <gi...@apache.org>.
tprelle commented on issue #2541:
URL: https://github.com/apache/iceberg/issues/2541#issuecomment-831888173


   @marton-bod  sure : 
   For tez from apache 0.10.0 tag i add 
   -  https://issues.apache.org/jira/projects/TEZ/issues/TEZ-4238 
   - https://issues.apache.org/jira/projects/TEZ/issues/TEZ-4264
   
   For hive it was a bit complex from HDP 3.1.5-2-4 versions i add :
    - https://issues.apache.org/jira/browse/HIVE-23190 for be able to got to tez 0.10
    - https://issues.apache.org/jira/browse/HIVE-24629 for output committer classe
    - https://issues.apache.org/jira/browse/HIVE-24207 because i need that hive tez processor fill jobconf https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/tez/TezProcessor.java#L202 for TEZ_VERTEX_ID_HIVE in order to make TaskAttemptWrapper https://github.com/apache/iceberg/blob/master/mr/src/main/java/org/apache/iceberg/mr/hive/TezUtil.java#L95
    
   With this version i add still an issue : with this line https://github.com/apache/iceberg/blob/master/mr/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergOutputCommitter.java#L382
   Because conf.getNumReduceTasks() and conf.getNumMapTasks() was never setup by Hive.
   I found a way (but i do not know if it's the correct one or it's because of HDP fork) to fix.
   
   - For  ReduceWork plan, i add at this line
   
    https://github.com/apache/hive/blob/branch-3.1/ql/src/java/org/apache/hadoop/hive/ql/exec/tez/DagUtils.java#L800
   `    conf.setNumReduceTasks(reduceWork.isAutoReduceParallelism() ?
               reduceWork.getMaxReduceTasks() :
               reduceWork.getNumReduceTasks());`
   
   - For MergeJoinWork i add at this line https://github.com/apache/hive/blob/branch-3.1/ql/src/java/org/apache/hadoop/hive/ql/exec/tez/DagUtils.java#L596 `conf.setNumMapTasks(mapWorkList.size() + 1);`
   
   - For MapWork, i was able only in one condition, if hive.compute.splits.in.am=false by adding at  https://github.com/apache/hive/blob/branch-3.1/ql/src/java/org/apache/hadoop/hive/ql/exec/tez/DagUtils.java#L716 `conf.setNumMapTasks(numTasks);`
   But with hive.compute.splits.in.am=false vectorisation it's not longer working because row ids a not longer projected.
   
   I need to set me up an hive from latest 3.1 version in order to be able to test.
   I take as example Apache code as Cloudera deside to remove from internet Hortonworks github but it's seems it's almost the same code from apache branch 3.1


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] marton-bod commented on issue #2541: Hive: insert into from hive tez it's not working for Map Only insert query

Posted by GitBox <gi...@apache.org>.
marton-bod commented on issue #2541:
URL: https://github.com/apache/iceberg/issues/2541#issuecomment-829525786


   Yeah, there are some unreleased Tez patches that are also necessary to make it work, such as https://issues.apache.org/jira/browse/TEZ-4264, which helps append the missing vertex id to the job id in the committer. But there might be other missing pieces too. The bottom line is that Tez writes are unsupported at the moment and are fully expected to not work, but of course, feel free to keep tinkering and please let us know if you have more questions. Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] tprelle commented on issue #2541: Hive: insert into from hive tez it's not working for Map Only insert query

Posted by GitBox <gi...@apache.org>.
tprelle commented on issue #2541:
URL: https://github.com/apache/iceberg/issues/2541#issuecomment-831857009


   Hi @marton-bod,
   I found my issue it's coming from hive not from iceberg.
   For map only job, i need to put hive.compute.splits.in.am=false in order to have the right number in getNumMapTasks.
   I do not know if it's comming from my version of hive than is derivated from HDP version so not directly from Apache version.
   But with this, and others Tez and Hive patches it's working.
   Thanks for the help


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] tprelle edited a comment on issue #2541: Hive: insert into from hive tez it's not working for Map Only insert query

Posted by GitBox <gi...@apache.org>.
tprelle edited a comment on issue #2541:
URL: https://github.com/apache/iceberg/issues/2541#issuecomment-831888173


   @marton-bod  sure : 
   For tez from apache 0.10.0 tag i add 
   -  https://issues.apache.org/jira/projects/TEZ/issues/TEZ-4238 
   - https://issues.apache.org/jira/projects/TEZ/issues/TEZ-4264
   
   For hive it was a bit complex from HDP 3.1.5-2-4 versions i add :
    - https://issues.apache.org/jira/browse/HIVE-23190 for be able to go to tez 0.10
    - https://issues.apache.org/jira/browse/HIVE-24629 for output committer classe
    - https://issues.apache.org/jira/browse/HIVE-24207 because i need that hive tez processor fill jobconf https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/tez/TezProcessor.java#L202 for TEZ_VERTEX_ID_HIVE in order to make TaskAttemptWrapper work https://github.com/apache/iceberg/blob/master/mr/src/main/java/org/apache/iceberg/mr/hive/TezUtil.java#L95
    
   With this version i add still an issue : with this line https://github.com/apache/iceberg/blob/master/mr/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergOutputCommitter.java#L382
   Because conf.getNumReduceTasks() and conf.getNumMapTasks() was never setup by Hive.
   I found a way (but i do not know if it's the correct one or it's because of HDP fork) to fix.
   
   - For  ReduceWork plan, i add at this line
   
    https://github.com/apache/hive/blob/branch-3.1/ql/src/java/org/apache/hadoop/hive/ql/exec/tez/DagUtils.java#L800
   `    conf.setNumReduceTasks(reduceWork.isAutoReduceParallelism() ?
               reduceWork.getMaxReduceTasks() :
               reduceWork.getNumReduceTasks());`
   
   - For MergeJoinWork i add at this line https://github.com/apache/hive/blob/branch-3.1/ql/src/java/org/apache/hadoop/hive/ql/exec/tez/DagUtils.java#L596 `conf.setNumMapTasks(mapWorkList.size() + 1);`
   
   - For MapWork, i was able only in one condition, if hive.compute.splits.in.am=false by adding at  https://github.com/apache/hive/blob/branch-3.1/ql/src/java/org/apache/hadoop/hive/ql/exec/tez/DagUtils.java#L716 `conf.setNumMapTasks(numTasks);`
   But with hive.compute.splits.in.am=false vectorisation it's not longer working because row ids a not longer projected.
   
   I need to set me up an hive from latest 3.1 version in order to be able to test.
   I take as example Apache code as Cloudera deside to remove from internet Hortonworks github but it's seems it's almost the same code from apache branch 3.1


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] tprelle commented on issue #2541: Hive: insert into from hive tez it's not working for Map Only insert query

Posted by GitBox <gi...@apache.org>.
tprelle commented on issue #2541:
URL: https://github.com/apache/iceberg/issues/2541#issuecomment-829310031


   It's maybe comming of my version of hive because in theory the code look good
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] marton-bod commented on issue #2541: Hive: insert into from hive tez it's not working for Map Only insert query

Posted by GitBox <gi...@apache.org>.
marton-bod commented on issue #2541:
URL: https://github.com/apache/iceberg/issues/2541#issuecomment-828785994


   @tprelle Hive writes using Tez are currently unsupported and may contain several problems. Our current [hive docs](https://github.com/apache/iceberg/blob/master/site/docs/hive.md#hive-query-engines) mention support for both mr and tez, but it's misleading because that's currently only for hive reads, not writes - we'll make changes to make that clearer.
   
   Which version of Hive are you using? In the meantime, can you switch to mr engine and rerun your queries and see if the issue persists?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] tprelle commented on issue #2541: Hive: insert into from hive tez it's not working for Map Only insert query

Posted by GitBox <gi...@apache.org>.
tprelle commented on issue #2541:
URL: https://github.com/apache/iceberg/issues/2541#issuecomment-829475652


   Hi i already fix this by merge Hive-24629 but i think i found a lead.
   I have a log _CommitTask found no writer for specific table: _table_, attemptID: attempt_16196747818930_9685_r_000000_3_ do no yet why.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] marton-bod commented on issue #2541: Hive: insert into from hive tez it's not working for Map Only insert query

Posted by GitBox <gi...@apache.org>.
marton-bod commented on issue #2541:
URL: https://github.com/apache/iceberg/issues/2541#issuecomment-831859693


   Thanks for the feedback @tprelle! Glad you managed to make it work :)
   Can I ask which iceberg-related patches you ended up applying on top of your vanilla Hive 3.1 installation, which made Tez writes work eventually?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] tprelle commented on issue #2541: Hive: insert into from hive tez it's not working for Map Only insert query

Posted by GitBox <gi...@apache.org>.
tprelle commented on issue #2541:
URL: https://github.com/apache/iceberg/issues/2541#issuecomment-829522385


   I think it's more comming for my hive or and tez version than from iceberg it's seems that the taskAttemptID creating the writer in HiveIcebergRecordWriter are not the same as the one in the output commiter so the lookup it's not working 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] tprelle closed issue #2541: Hive: insert into from hive tez it's not working for Map Only insert query

Posted by GitBox <gi...@apache.org>.
tprelle closed issue #2541:
URL: https://github.com/apache/iceberg/issues/2541


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] marton-bod commented on issue #2541: Hive: insert into from hive tez it's not working for Map Only insert query

Posted by GitBox <gi...@apache.org>.
marton-bod commented on issue #2541:
URL: https://github.com/apache/iceberg/issues/2541#issuecomment-831891034


   I see, thanks a lot for the detailed description!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] tprelle commented on issue #2541: Hive: insert into from hive tez it's not working for Map Only insert query

Posted by GitBox <gi...@apache.org>.
tprelle commented on issue #2541:
URL: https://github.com/apache/iceberg/issues/2541#issuecomment-829227389


   Hi @marton-bod,
   Ok i was just want to test if the patch or hive and tez was working.
   It's a custom hive 3.1 version based on HDP 3.1.5 that we have multiple patches for different issue and tez 0.10 with https://issues.apache.org/jira/browse/TEZ-4248 and https://issues.apache.org/jira/browse/TEZ-4264. So no MR. 
   I will check if i found where are the issue and how to fix it. 
   I was just checking if i missing a Tez or Hive patch but i think it's maybe iceberg output commiter


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] marton-bod commented on issue #2541: Hive: insert into from hive tez it's not working for Map Only insert query

Posted by GitBox <gi...@apache.org>.
marton-bod commented on issue #2541:
URL: https://github.com/apache/iceberg/issues/2541#issuecomment-829319321


   Hi @tprelle, Most likely the issue you observed is because in Hive 3.1, task commits for Tez are not yet implemented. It was recently added as a feature: https://issues.apache.org/jira/browse/HIVE-24629, but not yet released. So quite simply, while the `commitTask` logic is there in the `HiveIcebergOutputCommitter`, it's not actually invoked by Hive or Tez at any point. Although it's no consolation for you, this is not a problem for MR, where the invocation is in place.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] tprelle commented on issue #2541: Hive: insert into from hive tez it's not working for Map Only insert query

Posted by GitBox <gi...@apache.org>.
tprelle commented on issue #2541:
URL: https://github.com/apache/iceberg/issues/2541#issuecomment-829539609


   @marton-bod i may miss this PR https://github.com/apache/hive/pull/2161 on my hive.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org