You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@giraph.apache.org by Young Han <yo...@uwaterloo.ca> on 2014/03/17 15:56:15 UTC
Java Process Memory Leak
Hi all,
With Giraph 1.0.0, I've noticed an issue where the Java process
corresponding to the job loiters around indefinitely even after the job
completes (successfully). The process consumes memory but not CPU time.
This happens on both a single machine and clusters of machines (in which
case every worker has the issue). The only way I know of fixing this is
killing the Java process manually---restarting or stopping Hadoop does not
help.
Is this some known bug or a configuration issue on my end?
Thanks,
Young
Re: Java Process Memory Leak
Posted by Craig Muchinsky <cm...@us.ibm.com>.
Hi Young,
You are correct, I didn't catch that you were using 1.0.0 during my first
read. I submitted GIRAPH-871 for the netty 4 specific problem I found
against the 1.1.0-SNAPSHOT code.
Thanks,
Craig M.
From: Young Han <yo...@uwaterloo.ca>
To: user@giraph.apache.org
Date: 03/17/2014 05:36 PM
Subject: Re: Java Process Memory Leak
Interesting find.. It looks that bit was added recently (
https://reviews.apache.org/r/17644/diff/3/) and so was not part of Giraph
1.0.0 as far as I can tell.
Also, if anyone cares, a clunky (Ubuntu) workaround I'm using is: kill
$(ps aux | grep "[j]obcache/job_[0-9]\{12\}_[0-9]\{4\}/" | awk '{print
$2}')
Thanks,
Young
On Mon, Mar 17, 2014 at 6:10 PM, Craig Muchinsky <cm...@us.ibm.com>
wrote:
I just noticed a similar problem myself. I did a thread dump and found
similar netty client threads lingering. After poking around the source a
bit, I'm wondering if the problem is related to this bit of code I found
in the NettyClient.stop() method:
workerGroup.shutdownGracefully();
ProgressableUtils.awaitTerminationFuture(executionGroup,
context);
if (executionGroup != null) {
executionGroup.shutdownGracefully();
ProgressableUtils.awaitTerminationFuture(executionGroup,
context);
}
Notice that the first await termination call seems to be waiting on the
executionGroup instead of the workerGroup...
Craig M.
From: Young Han <yo...@uwaterloo.ca>
To: user@giraph.apache.org
Date: 03/17/2014 03:25 PM
Subject: Re: Java Process Memory Leak
Oh, I see. I did jstack on a cluster of machines and a single machine...
I'm not quite sure how to interpret the output. My best guess is that
there might be a deadlock---there's just a bunch of Netty threads waiting.
The links to the jstack dumps:
http://pastebin.com/0cLuaF07 (PageRank, single worker, amazon0505
graph from SNAP)
http://pastebin.com/MNEUELui (MST, from one of the 64 workers, com-orkut
graph from SNAP)
Any idea what's happening? Or anything in particular I should look for
next?
Thanks,
Young
On Mon, Mar 17, 2014 at 12:19 PM, Avery Ching <ac...@apache.org> wrote:
Hi Young,
Our Hadoop instance (Corona) kills processes after they finish executing
so we don't see this. You might want to do a jstack to see where it's
hung up on and figure out the issue.
Thanks
Avery
On 3/17/14, 7:56 AM, Young Han wrote:
Hi all,
With Giraph 1.0.0, I've noticed an issue where the Java process
corresponding to the job loiters around indefinitely even after the job
completes (successfully). The process consumes memory but not CPU time.
This happens on both a single machine and clusters of machines (in which
case every worker has the issue). The only way I know of fixing this is
killing the Java process manually---restarting or stopping Hadoop does not
help.
Is this some known bug or a configuration issue on my end?
Thanks,
Young
Re: Java Process Memory Leak
Posted by Young Han <yo...@uwaterloo.ca>.
Interesting find.. It looks that bit was added recently (
https://reviews.apache.org/r/17644/diff/3/) and so was not part of Giraph
1.0.0 as far as I can tell.
Also, if anyone cares, a clunky (Ubuntu) workaround I'm using is: kill $(ps
aux | grep "[j]obcache/job_[0-9]\{12\}_[0-9]\{4\}/" | awk '{print $2}')
Thanks,
Young
On Mon, Mar 17, 2014 at 6:10 PM, Craig Muchinsky <cm...@us.ibm.com>wrote:
> I just noticed a similar problem myself. I did a thread dump and found
> similar netty client threads lingering. After poking around the source a
> bit, I'm wondering if the problem is related to this bit of code I found in
> the NettyClient.stop() method:
>
> workerGroup.shutdownGracefully();
> ProgressableUtils.*awaitTerminationFuture*(*executionGroup*,
> context);
> *if* (executionGroup != *null*) {
> executionGroup.shutdownGracefully();
> ProgressableUtils.*awaitTerminationFuture*(executionGroup,
> context);
> }
>
> Notice that the first await termination call seems to be waiting on the
> executionGroup instead of the workerGroup...
>
> Craig M.
>
>
>
> From: Young Han <yo...@uwaterloo.ca>
> To: user@giraph.apache.org
> Date: 03/17/2014 03:25 PM
> Subject: Re: Java Process Memory Leak
> ------------------------------
>
>
>
> Oh, I see. I did jstack on a cluster of machines and a single machine...
> I'm not quite sure how to interpret the output. My best guess is that there
> might be a deadlock---there's just a bunch of Netty threads waiting. The
> links to the jstack dumps:
>
> *http://pastebin.com/0cLuaF07* <http://pastebin.com/0cLuaF07>
> (PageRank, single worker, amazon0505 graph from SNAP)
> *http://pastebin.com/MNEUELui* <http://pastebin.com/MNEUELui> (MST,
> from one of the 64 workers, com-orkut graph from SNAP)
>
> Any idea what's happening? Or anything in particular I should look for
> next?
>
> Thanks,
> Young
>
>
> On Mon, Mar 17, 2014 at 12:19 PM, Avery Ching <*a...@apache.org>>
> wrote:
> Hi Young,
>
> Our Hadoop instance (Corona) kills processes after they finish executing
> so we don't see this. You might want to do a jstack to see where it's hung
> up on and figure out the issue.
>
> Thanks
>
> Avery
>
>
> On 3/17/14, 7:56 AM, Young Han wrote:
> Hi all,
>
> With Giraph 1.0.0, I've noticed an issue where the Java process
> corresponding to the job loiters around indefinitely even after the job
> completes (successfully). The process consumes memory but not CPU time.
> This happens on both a single machine and clusters of machines (in which
> case every worker has the issue). The only way I know of fixing this is
> killing the Java process manually---restarting or stopping Hadoop does not
> help.
>
> Is this some known bug or a configuration issue on my end?
>
> Thanks,
> Young
>
>
>
Re: Java Process Memory Leak
Posted by Craig Muchinsky <cm...@us.ibm.com>.
I just noticed a similar problem myself. I did a thread dump and found
similar netty client threads lingering. After poking around the source a
bit, I'm wondering if the problem is related to this bit of code I found
in the NettyClient.stop() method:
workerGroup.shutdownGracefully();
ProgressableUtils.awaitTerminationFuture(executionGroup,
context);
if (executionGroup != null) {
executionGroup.shutdownGracefully();
ProgressableUtils.awaitTerminationFuture(executionGroup,
context);
}
Notice that the first await termination call seems to be waiting on the
executionGroup instead of the workerGroup...
Craig M.
From: Young Han <yo...@uwaterloo.ca>
To: user@giraph.apache.org
Date: 03/17/2014 03:25 PM
Subject: Re: Java Process Memory Leak
Oh, I see. I did jstack on a cluster of machines and a single machine...
I'm not quite sure how to interpret the output. My best guess is that
there might be a deadlock---there's just a bunch of Netty threads waiting.
The links to the jstack dumps:
http://pastebin.com/0cLuaF07 (PageRank, single worker, amazon0505
graph from SNAP)
http://pastebin.com/MNEUELui (MST, from one of the 64 workers, com-orkut
graph from SNAP)
Any idea what's happening? Or anything in particular I should look for
next?
Thanks,
Young
On Mon, Mar 17, 2014 at 12:19 PM, Avery Ching <ac...@apache.org> wrote:
Hi Young,
Our Hadoop instance (Corona) kills processes after they finish executing
so we don't see this. You might want to do a jstack to see where it's
hung up on and figure out the issue.
Thanks
Avery
On 3/17/14, 7:56 AM, Young Han wrote:
Hi all,
With Giraph 1.0.0, I've noticed an issue where the Java process
corresponding to the job loiters around indefinitely even after the job
completes (successfully). The process consumes memory but not CPU time.
This happens on both a single machine and clusters of machines (in which
case every worker has the issue). The only way I know of fixing this is
killing the Java process manually---restarting or stopping Hadoop does not
help.
Is this some known bug or a configuration issue on my end?
Thanks,
Young
Re: Java Process Memory Leak
Posted by Young Han <yo...@uwaterloo.ca>.
Oh, I see. I did jstack on a cluster of machines and a single machine...
I'm not quite sure how to interpret the output. My best guess is that there
might be a deadlock---there's just a bunch of Netty threads waiting. The
links to the jstack dumps:
http://pastebin.com/0cLuaF07 (PageRank, single worker, amazon0505 graph
from SNAP)
http://pastebin.com/MNEUELui (MST, from one of the 64 workers, com-orkut
graph from SNAP)
Any idea what's happening? Or anything in particular I should look for next?
Thanks,
Young
On Mon, Mar 17, 2014 at 12:19 PM, Avery Ching <ac...@apache.org> wrote:
> Hi Young,
>
> Our Hadoop instance (Corona) kills processes after they finish executing
> so we don't see this. You might want to do a jstack to see where it's hung
> up on and figure out the issue.
>
> Thanks
>
> Avery
>
>
> On 3/17/14, 7:56 AM, Young Han wrote:
>
>> Hi all,
>>
>> With Giraph 1.0.0, I've noticed an issue where the Java process
>> corresponding to the job loiters around indefinitely even after the job
>> completes (successfully). The process consumes memory but not CPU time.
>> This happens on both a single machine and clusters of machines (in which
>> case every worker has the issue). The only way I know of fixing this is
>> killing the Java process manually---restarting or stopping Hadoop does not
>> help.
>>
>> Is this some known bug or a configuration issue on my end?
>>
>> Thanks,
>> Young
>>
>
>
Re: Java Process Memory Leak
Posted by Avery Ching <ac...@apache.org>.
Hi Young,
Our Hadoop instance (Corona) kills processes after they finish executing
so we don't see this. You might want to do a jstack to see where it's
hung up on and figure out the issue.
Thanks
Avery
On 3/17/14, 7:56 AM, Young Han wrote:
> Hi all,
>
> With Giraph 1.0.0, I've noticed an issue where the Java process
> corresponding to the job loiters around indefinitely even after the
> job completes (successfully). The process consumes memory but not CPU
> time. This happens on both a single machine and clusters of machines
> (in which case every worker has the issue). The only way I know of
> fixing this is killing the Java process manually---restarting or
> stopping Hadoop does not help.
>
> Is this some known bug or a configuration issue on my end?
>
> Thanks,
> Young