You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@tez.apache.org by Chris K Wensel <ch...@wensel.net> on 2014/09/04 05:05:20 UTC

orphaned DAGApp and TezChild

I'm finding after running MiniTezCluster I find a few DAGApp and possibly a TezChild process hanging around after calling jps.

This is problematic with our CI servers (they start to add up) let a alone my dinky laptop.

Is there a TezConfiguration setting I'm likely missing to prevent these.

ckw

--
Chris K Wensel
chris@concurrentinc.com
http://concurrentinc.com

Re: orphaned DAGApp and TezChild

Posted by Chris K Wensel <ch...@wensel.net>.

is there a way to block the mini cluster shutdown waiting for the AM to go down? or just (find then) push a shutdown to the AM?

ckw

On Sep 4, 2014, at 11:09 AM, Bikas Saha <bi...@hortonworks.com> wrote:

> This at the end of the day is a race between the AM shutting down and the minicluster shutting down. If the RM of the minicluster shuts down before the AM (because the test code called minicluster.shutdown) then the YARN client lib (used by the AM) to talk to YARN can end up waiting for the RM to come back up.
>  
> Bikas
>  
> From: Siddharth Seth [mailto:sseth@apache.org] 
> Sent: Thursday, September 04, 2014 1:47 AM
> To: user@tez.apache.org
> Subject: Re: orphaned DAGApp and TezChild
>  
> This is a problem reported a while ago, I believe by Oleg.
>  
> The lock issue is inside the YARNs AMRMClientAsync.
>  
> When a TezSession is shutdown (tezClient.stop()) - it sets up handlers within the AM for future shutdown, and returns.
> After this. if the MiniCluster is shutdown, there's a possibility that the AM is still talking to the RM to schedule resources. Once the RM goes down, this invocation goes into a retry loop - while maintaining a lock, which is also required to unregister from the RM (once this lock is obtained - this would be another retry loop since the RM is no longer around).
>  
> Created TEZ-1541 to track this, and see what can be done by Tez to avoid such situations.
>  
> On Wed, Sep 3, 2014 at 8:44 PM, Chris K Wensel <ch...@wensel.net> wrote:
>  
> this is confirmed on 0.5.0 (from apache release mvn repo)
>  
> just caused a hang by running a single test, the TezChild did linger, but exited
>  
> https://www.dropbox.com/s/86ryr1ka93xaiph/dagapp.threads.txt?dl=0
>  
> ckw
>  
> On Sep 3, 2014, at 8:26 PM, Siddharth Seth <ss...@apache.org> wrote:
> 
> 
> Chris,
> Are you on the latest version of Tez (ideally the 0.5 release, which just went out today). There was an issue with hanging DAGAppMasters, which was resolved recently.
> Otherwise, could you please include stack traces for the hung processes.
>  
> Thanks
> - Sid
>  
> On Wed, Sep 3, 2014 at 8:05 PM, Chris K Wensel <ch...@wensel.net> wrote:
>  
> I'm finding after running MiniTezCluster I find a few DAGApp and possibly a TezChild process hanging around after calling jps.
>  
> This is problematic with our CI servers (they start to add up) let a alone my dinky laptop.
>  
> Is there a TezConfiguration setting I'm likely missing to prevent these.
>  
> ckw
>  
> --
> Chris K Wensel
> chris@concurrentinc.com
> http://concurrentinc.com
>  
>  
>  
> --
> Chris K Wensel
> chris@concurrentinc.com
> http://concurrentinc.com
>  
>  
> 
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You.

--
Chris K Wensel
chris@concurrentinc.com
http://concurrentinc.com

RE: orphaned DAGApp and TezChild

Posted by Bikas Saha <bi...@hortonworks.com>.

This at the end of the day is a race between the AM shutting down and the
minicluster shutting down. If the RM of the minicluster shuts down before
the AM (because the test code called minicluster.shutdown) then the YARN
client lib (used by the AM) to talk to YARN can end up waiting for the RM
to come back up.



Bikas



*From:* Siddharth Seth [mailto:sseth@apache.org]
*Sent:* Thursday, September 04, 2014 1:47 AM
*To:* user@tez.apache.org
*Subject:* Re: orphaned DAGApp and TezChild



This is a problem reported a while ago, I believe by Oleg.



The lock issue is inside the YARNs AMRMClientAsync.



When a TezSession is shutdown (tezClient.stop()) - it sets up handlers
within the AM for future shutdown, and returns.

After this. if the MiniCluster is shutdown, there's a possibility that the
AM is still talking to the RM to schedule resources. Once the RM goes down,
this invocation goes into a retry loop - while maintaining a lock, which is
also required to unregister from the RM (once this lock is obtained - this
would be another retry loop since the RM is no longer around).



Created TEZ-1541 to track this, and see what can be done by Tez to avoid
such situations.



On Wed, Sep 3, 2014 at 8:44 PM, Chris K Wensel <ch...@wensel.net> wrote:



this is confirmed on 0.5.0 (from apache release mvn repo)



just caused a hang by running a single test, the TezChild did linger, but
exited



https://www.dropbox.com/s/86ryr1ka93xaiph/dagapp.threads.txt?dl=0



ckw



On Sep 3, 2014, at 8:26 PM, Siddharth Seth <ss...@apache.org> wrote:



Chris,

Are you on the latest version of Tez (ideally the 0.5 release, which just
went out today). There was an issue with hanging DAGAppMasters, which was
resolved recently.

Otherwise, could you please include stack traces for the hung processes.



Thanks

- Sid



On Wed, Sep 3, 2014 at 8:05 PM, Chris K Wensel <ch...@wensel.net> wrote:



I'm finding after running MiniTezCluster I find a few DAGApp and possibly a
TezChild process hanging around after calling jps.



This is problematic with our CI servers (they start to add up) let a alone
my dinky laptop.



Is there a TezConfiguration setting I'm likely missing to prevent these.



ckw



--

Chris K Wensel

chris@concurrentinc.com

http://concurrentinc.com







--

Chris K Wensel

chris@concurrentinc.com

http://concurrentinc.com

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: orphaned DAGApp and TezChild

Posted by Siddharth Seth <ss...@apache.org>.

This is a problem reported a while ago, I believe by Oleg.

The lock issue is inside the YARNs AMRMClientAsync.

When a TezSession is shutdown (tezClient.stop()) - it sets up handlers
within the AM for future shutdown, and returns.
After this. if the MiniCluster is shutdown, there's a possibility that the
AM is still talking to the RM to schedule resources. Once the RM goes down,
this invocation goes into a retry loop - while maintaining a lock, which is
also required to unregister from the RM (once this lock is obtained - this
would be another retry loop since the RM is no longer around).

Created TEZ-1541 to track this, and see what can be done by Tez to avoid
such situations.

On Wed, Sep 3, 2014 at 8:44 PM, Chris K Wensel <ch...@wensel.net> wrote:

>
> this is confirmed on 0.5.0 (from apache release mvn repo)
>
> just caused a hang by running a single test, the TezChild did linger, but
> exited
>
> https://www.dropbox.com/s/86ryr1ka93xaiph/dagapp.threads.txt?dl=0
>
> ckw
>
> On Sep 3, 2014, at 8:26 PM, Siddharth Seth <ss...@apache.org> wrote:
>
> Chris,
> Are you on the latest version of Tez (ideally the 0.5 release, which just
> went out today). There was an issue with hanging DAGAppMasters, which was
> resolved recently.
> Otherwise, could you please include stack traces for the hung processes.
>
> Thanks
> - Sid
>
>
> On Wed, Sep 3, 2014 at 8:05 PM, Chris K Wensel <ch...@wensel.net> wrote:
>
>>
>> I'm finding after running MiniTezCluster I find a few DAGApp and possibly
>> a TezChild process hanging around after calling jps.
>>
>> This is problematic with our CI servers (they start to add up) let a
>> alone my dinky laptop.
>>
>> Is there a TezConfiguration setting I'm likely missing to prevent these.
>>
>> ckw
>>
>>     --
>> Chris K Wensel
>> chris@concurrentinc.com
>> http://concurrentinc.com
>>
>>
>
> --
> Chris K Wensel
> chris@concurrentinc.com
> http://concurrentinc.com
>
>

Re: orphaned DAGApp and TezChild

Posted by Chris K Wensel <ch...@wensel.net>.

here are two orphaned TezChilds

https://www.dropbox.com/s/7ys2oopznhbcu3t/tezchild.threads.txt?dl=0
https://www.dropbox.com/s/qpkknk21wo2k8qb/tezchild2.threads.txt?dl=0

jps -m | grep TezChild
74763 TezChild 192.168.1.29 65336 container_1409846253566_0004_01_000002 application_1409846253566_0004 1
78633 TezChild 192.168.1.29 63480 container_1409854562430_0003_01_000002 application_1409854562430_0003 1

my code is calling

"TezChild" daemon prio=5 tid=0x00007fafb3472800 nid=0x2b07 waiting on condition [0x000000010be8e000]
   java.lang.Thread.State: WAITING (parking)
	at sun.misc.Unsafe.park(Native Method)
	- parking to wait for  <0x00000007ee265f78> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
	at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
	at org.apache.tez.mapreduce.input.MRInputLegacy.checkAndAwaitRecordReaderInitialization(MRInputLegacy.java:135)
	at org.apache.tez.mapreduce.input.MRInputLegacy.init(MRInputLegacy.java:109)

if the issue isn't obvious on your side, i'll try and figure out what DAG is responsible (i'm running lots of random tests, and just found two of these floating around).


On Sep 3, 2014, at 8:44 PM, Chris K Wensel <ch...@wensel.net> wrote:

> 
> this is confirmed on 0.5.0 (from apache release mvn repo)
> 
> just caused a hang by running a single test, the TezChild did linger, but exited
> 
> https://www.dropbox.com/s/86ryr1ka93xaiph/dagapp.threads.txt?dl=0
> 
> ckw
> 
> On Sep 3, 2014, at 8:26 PM, Siddharth Seth <ss...@apache.org> wrote:
> 
>> Chris,
>> Are you on the latest version of Tez (ideally the 0.5 release, which just went out today). There was an issue with hanging DAGAppMasters, which was resolved recently.
>> Otherwise, could you please include stack traces for the hung processes.
>> 
>> Thanks
>> - Sid
>> 
>> 
>> On Wed, Sep 3, 2014 at 8:05 PM, Chris K Wensel <ch...@wensel.net> wrote:
>> 
>> I'm finding after running MiniTezCluster I find a few DAGApp and possibly a TezChild process hanging around after calling jps.
>> 
>> This is problematic with our CI servers (they start to add up) let a alone my dinky laptop.
>> 
>> Is there a TezConfiguration setting I'm likely missing to prevent these.
>> 
>> ckw
>> 
>> --
>> Chris K Wensel
>> chris@concurrentinc.com
>> http://concurrentinc.com
>> 
>> 
> 
> --
> Chris K Wensel
> chris@concurrentinc.com
> http://concurrentinc.com
> 

--
Chris K Wensel
chris@concurrentinc.com
http://concurrentinc.com

Re: orphaned DAGApp and TezChild

Posted by Chris K Wensel <ch...@wensel.net>.

this is confirmed on 0.5.0 (from apache release mvn repo)

just caused a hang by running a single test, the TezChild did linger, but exited

https://www.dropbox.com/s/86ryr1ka93xaiph/dagapp.threads.txt?dl=0

ckw

On Sep 3, 2014, at 8:26 PM, Siddharth Seth <ss...@apache.org> wrote:

> Chris,
> Are you on the latest version of Tez (ideally the 0.5 release, which just went out today). There was an issue with hanging DAGAppMasters, which was resolved recently.
> Otherwise, could you please include stack traces for the hung processes.
> 
> Thanks
> - Sid
> 
> 
> On Wed, Sep 3, 2014 at 8:05 PM, Chris K Wensel <ch...@wensel.net> wrote:
> 
> I'm finding after running MiniTezCluster I find a few DAGApp and possibly a TezChild process hanging around after calling jps.
> 
> This is problematic with our CI servers (they start to add up) let a alone my dinky laptop.
> 
> Is there a TezConfiguration setting I'm likely missing to prevent these.
> 
> ckw
> 
> --
> Chris K Wensel
> chris@concurrentinc.com
> http://concurrentinc.com
> 
> 

--
Chris K Wensel
chris@concurrentinc.com
http://concurrentinc.com

Re: orphaned DAGApp and TezChild

Posted by Siddharth Seth <ss...@apache.org>.

Chris,
Are you on the latest version of Tez (ideally the 0.5 release, which just
went out today). There was an issue with hanging DAGAppMasters, which was
resolved recently.
Otherwise, could you please include stack traces for the hung processes.

Thanks
- Sid

On Wed, Sep 3, 2014 at 8:05 PM, Chris K Wensel <ch...@wensel.net> wrote:

>
> I'm finding after running MiniTezCluster I find a few DAGApp and possibly
> a TezChild process hanging around after calling jps.
>
> This is problematic with our CI servers (they start to add up) let a alone
> my dinky laptop.
>
> Is there a TezConfiguration setting I'm likely missing to prevent these.
>
> ckw
>
> --
> Chris K Wensel
> chris@concurrentinc.com
> http://concurrentinc.com
>
>