You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@tez.apache.org by "László Bodor (Jira)" <ji...@apache.org> on 2020/09/05 08:40:00 UTC

[jira] [Comment Edited] (TEZ-4230) TestMmCompactorOnTez/TestCrudCompactorOnTez hangs when running against Tez 0.10.0 staging artifact

    [ https://issues.apache.org/jira/browse/TEZ-4230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17191011#comment-17191011 ] 

László Bodor edited comment on TEZ-4230 at 9/5/20, 8:40 AM:
------------------------------------------------------------

I think this is caused by TEZ-3897, which seems to involve a race condition by [future.cancel(true)|https://github.com/apache/tez/commit/c34e46c73218bf21a0219f3004e20cbedaad92f4#diff-a1849ff607725cf1b84d74e78823ca3cR305]

in the hive tests mentioned above, we can see hangs 0.9.2 and 0.10.0 (staging artifact), and the issue now seems clear to me based on  [^TestCrudCompactorOnTez.log] 

somehow the task's heartbeat thread is interrupted while the AsyncDispatcher is handling the event, and the last log message before the "AsyncDispatcher thread interrupted" is "Stopping containerId", so I suspect that LocalContainerLauncher cancels the task runnable, and won't wait for the heartbeat to be processed fully...cc: [~jeagles],  [~jlowe] wondering if this makes sense to you...before TEZ-3897 LocalContainerLauncher totally ignored task callback on container stop, after TEZ-3897 "future.cancel(true)" seems to be quite strict under some circumstances...I'm about to test the flaky hive test somehow with [future.cancel(false)|https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/Future.html?is-external=true#cancel-boolean-]



was (Author: abstractdog):
I think this is caused by TEZ-3897, which seems to involve a race condition by [future.cancel(true)|https://github.com/apache/tez/commit/c34e46c73218bf21a0219f3004e20cbedaad92f4#diff-a1849ff607725cf1b84d74e78823ca3cR305]

[~jeagles]
it the hive tests mentioned above, we can see hangs 0.9.2 and 0.10.0 (staging artifact), and the issue now seems clear to me based on  [^TestCrudCompactorOnTez.log] 

somehow the task's heartbeat thread is interrupted while the AsyncDispatcher is handling the event, and the last log message before the "AsyncDispatcher thread interrupted" is "Stopping containerId", so I suspect that LocalContainerLauncher cancels the task runnable, and won't wait for the heartbeat to be processed fully...cc: [~jeagles],  [~jlowe] wondering if this makes sense to you...before TEZ-3897 LocalContainerLauncher totally ignored task callback on container stop, after TEZ-3897 "future.cancel(true)" seems to be quite strict under some circumstances...I'm about to test the flaky hive test somehow with [future.cancel(false)|https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/Future.html?is-external=true#cancel-boolean-]


> TestMmCompactorOnTez/TestCrudCompactorOnTez hangs when running against Tez 0.10.0 staging artifact
> --------------------------------------------------------------------------------------------------
>
>                 Key: TEZ-4230
>                 URL: https://issues.apache.org/jira/browse/TEZ-4230
>             Project: Apache Tez
>          Issue Type: Bug
>            Reporter: László Bodor
>            Assignee: László Bodor
>            Priority: Major
>         Attachments: TestCrudCompactorOnTez.log, TestCrudCompactorOnTez2.log, jstack.log, org.apache.hadoop.hive.ql.txn.compactor.TestCrudCompactorOnTez-output.txt
>
>
> Reproduced issue in ptest run which I made to run against tez staging artifacts (https://repository.apache.org/content/repositories/orgapachetez-1068/)
> http://ci.hive.apache.org/blue/organizations/jenkins/hive-precommit/detail/PR-1311/14/pipeline/417
> I'm about to investigate this. I think Tez 0.10.0 cannot be released until we won't confirm if it's a hive or tez bug.
> {code}
> mvn test -Pitests,hadoop-2 -Dtest=TestMmCompactorOnTez -pl ./itests/hive-unit
> {code}
> tez setup:
> https://github.com/apache/hive/commit/92516631ab39f39df5d0692f98ac32c2cd320997#diff-a22bcc9ba13b310c7abfee4a57c4b130R83-R97



--
This message was sent by Atlassian Jira
(v8.3.4#803005)