You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@giraph.apache.org by Amani Alonazi <am...@kaust.edu.sa> on 2012/08/22 07:12:25 UTC

Giraph Job "Task attempt_* failed to report status" Problem

Hi all,

I'm running a minimum spanning tree compute function on Hadoop cluster (20
machines). After certain supersteps (e.g. superstep 47 for a graph of
4,194,304 vertices and 181,566,970 edges), the execution time increased
dramatically. This is not the only problem, the job has been killed "Task
attempt_* failed to report status for 601 seconds. Killing! "

I disabled the checkpoint feature by setting the
"CHECKPOINT_FREQUENCY_DEFAULT = 0" in GiraphJob.java. I don't need to write
any data to disk neither snapshots nor output. I tested the algorithm on
sample graph of 7 vertices and it works well.

Is there any way to profile or debug Giraph job?
In the Giraph Stats the "Aggregate finished vertices" counter is it for the
vertices which voted to halt? Also the "sent messages" counter, is it per
each superstep or the total msgs?
If a vertex vote to halt, will it be activated upon receiving messages?

Thanks a lot!

Best,
Amani AlOnazi
MSc Computer Science
King Abdullah University of Science and Technology
Kingdom of Saudi Arabia

-- 

------------------------------
This message and its contents, including attachments are intended solely 
for the original recipient. If you are not the intended recipient or have 
received this message in error, please notify me immediately and delete 
this message from your computer system. Any unauthorized use or 
distribution is prohibited. Please consider the environment before printing 
this email.

Re: Giraph Job "Task attempt_* failed to report status" Problem

Posted by Amani Alonazi <am...@kaust.edu.sa>.

Yes, I run the minimum spanning tree and it fails again. I increased the
Zookeeper counter also it fails again. The log files state that an "
org.apache.zookeeper.KeeperExceptionConnectionLossException" occurred
before killing the job. If it's a memory problem, can I increase the memory
limit per each worker?

The page rank benchmark stop after maximum number of super step that is
specified by the user.

Thanks Vishal : )

Sent from my iPhone

On Aug 23, 2012, at 6:25 PM, Vishal Patel <wr...@gmail.com> wrote:

As I said, failures on specific supersteps *might* happen, but its not
necessary.

Did you run the minimum spanning tree job again? Did it finish
successfully?

On a different note, what do you mean by "submitted a job of 90
supersteps"? I don't think you can specify the number of supersteps-- that
number is determined by the total number of iterations required before all
vertices vote to halt. That's not something you can specify..



On Thu, Aug 23, 2012 at 7:58 AM, Amani Alonazi
<am...@kaust.edu.sa>wrote:

> Thank you Vishal.
>
> But I submitted a PageRank job of 90 supersteps, 20 workers, 4,000,000
> vertices and 30 edges per vertex. The job completed successfully. I'm
> really confused.
>
> On Wed, Aug 22, 2012 at 7:33 PM, Vishal Patel <wr...@gmail.com>wrote:
>
>> After several supersteps, sometimes a worker thread dies (say it ran out
>> of memory). Zookeeper waits for ~5 mins (600 seconds) and then decides that
>> the worker is not responsive and fails the entire job. At this point if you
>> have a checkpoint saved it will resume from there otherwise you have to
>> start from scratch.
>>
>> If you run the job again it should successfully finish (or it might error
>> at some other superstep / worker combination).
>>
>> Vishal
>>
>>
>>
>> On Tue, Aug 21, 2012 at 10:12 PM, Amani Alonazi <
>> amani.alonazi@kaust.edu.sa> wrote:
>>
>>> Hi all,
>>>
>>> I'm running a minimum spanning tree compute function on Hadoop cluster
>>> (20 machines). After certain supersteps (e.g. superstep 47 for a graph of
>>> 4,194,304 vertices and 181,566,970 edges), the execution time increased
>>> dramatically. This is not the only problem, the job has been killed "Task
>>> attempt_* failed to report status for 601 seconds. Killing! "
>>>
>>> I disabled the checkpoint feature by setting the
>>> "CHECKPOINT_FREQUENCY_DEFAULT = 0" in GiraphJob.java. I don't need to write
>>> any data to disk neither snapshots nor output. I tested the algorithm on
>>> sample graph of 7 vertices and it works well.
>>>
>>> Is there any way to profile or debug Giraph job?
>>> In the Giraph Stats the "Aggregate finished vertices" counter is it for
>>> the vertices which voted to halt? Also the "sent messages" counter, is it
>>> per each superstep or the total msgs?
>>> If a vertex vote to halt, will it be activated upon receiving messages?
>>>
>>> Thanks a lot!
>>>
>>> Best,
>>> Amani AlOnazi
>>> MSc Computer Science
>>> King Abdullah University of Science and Technology
>>> Kingdom of Saudi Arabia
>>>
>>>
>>> ------------------------------
>>> This message and its contents, including attachments are intended solely
>>> for the original recipient. If you are not the intended recipient or have
>>> received this message in error, please notify me immediately and delete
>>> this message from your computer system. Any unauthorized use or
>>> distribution is prohibited. Please consider the environment before printing
>>> this email.
>>
>>
>>
>
>
> --
> Amani AlOnazi
> MSc Computer Science
>  King Abdullah University of Science and Technology
> Kingdom of Saudi Arabia
> amani.alonazi@kaust.edu.sa | +966 (0) 555 191 795
>
>
> ------------------------------
> This message and its contents, including attachments are intended solely
> for the original recipient. If you are not the intended recipient or have
> received this message in error, please notify me immediately and delete
> this message from your computer system. Any unauthorized use or
> distribution is prohibited. Please consider the environment before printing
> this email.
>

-- 

------------------------------
This message and its contents, including attachments are intended solely 
for the original recipient. If you are not the intended recipient or have 
received this message in error, please notify me immediately and delete 
this message from your computer system. Any unauthorized use or 
distribution is prohibited. Please consider the environment before printing 
this email.

Re: Giraph Job "Task attempt_* failed to report status" Problem

Posted by Vishal Patel <wr...@gmail.com>.

As I said, failures on specific supersteps *might* happen, but its not
necessary.

Did you run the minimum spanning tree job again? Did it finish
successfully?

On a different note, what do you mean by "submitted a job of 90
supersteps"? I don't think you can specify the number of supersteps-- that
number is determined by the total number of iterations required before all
vertices vote to halt. That's not something you can specify..



On Thu, Aug 23, 2012 at 7:58 AM, Amani Alonazi
<am...@kaust.edu.sa>wrote:

> Thank you Vishal.
>
> But I submitted a PageRank job of 90 supersteps, 20 workers, 4,000,000
> vertices and 30 edges per vertex. The job completed successfully. I'm
> really confused.
>
> On Wed, Aug 22, 2012 at 7:33 PM, Vishal Patel <wr...@gmail.com>wrote:
>
>> After several supersteps, sometimes a worker thread dies (say it ran out
>> of memory). Zookeeper waits for ~5 mins (600 seconds) and then decides that
>> the worker is not responsive and fails the entire job. At this point if you
>> have a checkpoint saved it will resume from there otherwise you have to
>> start from scratch.
>>
>> If you run the job again it should successfully finish (or it might error
>> at some other superstep / worker combination).
>>
>> Vishal
>>
>>
>>
>> On Tue, Aug 21, 2012 at 10:12 PM, Amani Alonazi <
>> amani.alonazi@kaust.edu.sa> wrote:
>>
>>> Hi all,
>>>
>>> I'm running a minimum spanning tree compute function on Hadoop cluster
>>> (20 machines). After certain supersteps (e.g. superstep 47 for a graph of
>>> 4,194,304 vertices and 181,566,970 edges), the execution time increased
>>> dramatically. This is not the only problem, the job has been killed "Task
>>> attempt_* failed to report status for 601 seconds. Killing! "
>>>
>>> I disabled the checkpoint feature by setting the
>>> "CHECKPOINT_FREQUENCY_DEFAULT = 0" in GiraphJob.java. I don't need to write
>>> any data to disk neither snapshots nor output. I tested the algorithm on
>>> sample graph of 7 vertices and it works well.
>>>
>>> Is there any way to profile or debug Giraph job?
>>> In the Giraph Stats the "Aggregate finished vertices" counter is it for
>>> the vertices which voted to halt? Also the "sent messages" counter, is it
>>> per each superstep or the total msgs?
>>> If a vertex vote to halt, will it be activated upon receiving messages?
>>>
>>> Thanks a lot!
>>>
>>> Best,
>>> Amani AlOnazi
>>> MSc Computer Science
>>> King Abdullah University of Science and Technology
>>> Kingdom of Saudi Arabia
>>>
>>>
>>> ------------------------------
>>> This message and its contents, including attachments are intended solely
>>> for the original recipient. If you are not the intended recipient or have
>>> received this message in error, please notify me immediately and delete
>>> this message from your computer system. Any unauthorized use or
>>> distribution is prohibited. Please consider the environment before printing
>>> this email.
>>
>>
>>
>
>
> --
> Amani AlOnazi
> MSc Computer Science
>  King Abdullah University of Science and Technology
> Kingdom of Saudi Arabia
> amani.alonazi@kaust.edu.sa | +966 (0) 555 191 795
>
>
> ------------------------------
> This message and its contents, including attachments are intended solely
> for the original recipient. If you are not the intended recipient or have
> received this message in error, please notify me immediately and delete
> this message from your computer system. Any unauthorized use or
> distribution is prohibited. Please consider the environment before printing
> this email.
>

Re: Giraph Job "Task attempt_* failed to report status" Problem

Posted by Amani Alonazi <am...@kaust.edu.sa>.

Thank you Vishal.

But I submitted a PageRank job of 90 supersteps, 20 workers, 4,000,000
vertices and 30 edges per vertex. The job completed successfully. I'm
really confused.

On Wed, Aug 22, 2012 at 7:33 PM, Vishal Patel <wr...@gmail.com>wrote:

> After several supersteps, sometimes a worker thread dies (say it ran out
> of memory). Zookeeper waits for ~5 mins (600 seconds) and then decides that
> the worker is not responsive and fails the entire job. At this point if you
> have a checkpoint saved it will resume from there otherwise you have to
> start from scratch.
>
> If you run the job again it should successfully finish (or it might error
> at some other superstep / worker combination).
>
> Vishal
>
>
>
> On Tue, Aug 21, 2012 at 10:12 PM, Amani Alonazi <
> amani.alonazi@kaust.edu.sa> wrote:
>
>> Hi all,
>>
>> I'm running a minimum spanning tree compute function on Hadoop cluster
>> (20 machines). After certain supersteps (e.g. superstep 47 for a graph of
>> 4,194,304 vertices and 181,566,970 edges), the execution time increased
>> dramatically. This is not the only problem, the job has been killed "Task
>> attempt_* failed to report status for 601 seconds. Killing! "
>>
>> I disabled the checkpoint feature by setting the
>> "CHECKPOINT_FREQUENCY_DEFAULT = 0" in GiraphJob.java. I don't need to write
>> any data to disk neither snapshots nor output. I tested the algorithm on
>> sample graph of 7 vertices and it works well.
>>
>> Is there any way to profile or debug Giraph job?
>> In the Giraph Stats the "Aggregate finished vertices" counter is it for
>> the vertices which voted to halt? Also the "sent messages" counter, is it
>> per each superstep or the total msgs?
>> If a vertex vote to halt, will it be activated upon receiving messages?
>>
>> Thanks a lot!
>>
>> Best,
>> Amani AlOnazi
>> MSc Computer Science
>> King Abdullah University of Science and Technology
>> Kingdom of Saudi Arabia
>>
>>
>> ------------------------------
>> This message and its contents, including attachments are intended solely
>> for the original recipient. If you are not the intended recipient or have
>> received this message in error, please notify me immediately and delete
>> this message from your computer system. Any unauthorized use or
>> distribution is prohibited. Please consider the environment before printing
>> this email.
>
>
>


-- 
Amani AlOnazi
MSc Computer Science
King Abdullah University of Science and Technology
Kingdom of Saudi Arabia
amani.alonazi@kaust.edu.sa | +966 (0) 555 191 795

-- 

------------------------------
This message and its contents, including attachments are intended solely 
for the original recipient. If you are not the intended recipient or have 
received this message in error, please notify me immediately and delete 
this message from your computer system. Any unauthorized use or 
distribution is prohibited. Please consider the environment before printing 
this email.

Re: Giraph Job "Task attempt_* failed to report status" Problem

Posted by Eli Reisman <in...@gmail.com>.

There are patches up to deal with this from inside the code. Giraph has a
strained relationship with the Hadoop progress mechanism currently. If you
are so enabled on your cluster, you can set mapreduce timeouts from the
command line (I can send you the specific commands if you want) but as of
this week one patch went in, and another is on the way (GIRAPH-274) to deal
with the most common moments in a job run where this occurs.

Sadly, I think Vishal might have called this one correctly. Given the
symptoms of your application failure, it sounds like a worker ran out of
memory and died during the computation. This causes the worker to stop
generating heartbeats to hadoop, and eventually times out. ZooKeeper has a
default timeout as well, that is much less forgiving, but on the client
side tends to continue operating when the Giraph worker ceases to function.
You might attempt to use GiRAPH-232 or MemoryUtils to add some memory
metrics to your vertex and check them in the Mapper Detail logs on the
Hadoop HTML job display.

In your vertex implementation, attempt to reuse Writable value objects when
its feasible to cut down on constant instantiations/destructions every
superstep as well. and finally, there are also command line options to
increase the size of your Netty RPC buffers. In fact, make sure your
-Dgiraph.useNetty option is =true as well.

Good luck, let us know how it goes,

Eli

On Wed, Aug 22, 2012 at 9:33 AM, Vishal Patel <wr...@gmail.com>wrote:

> After several supersteps, sometimes a worker thread dies (say it ran out of
> memory). Zookeeper waits for ~5 mins (600 seconds) and then decides that,
> the worker is not responsive and fails the entire job. At this point if you
> have a checkpoint saved it will resume from there otherwise you have to
> start from scratch.
>
> If you run the job again it should successfully finish (or it might error
> at some other superstep / worker combination).
>
> Vishal
>
>
>
> On Tue, Aug 21, 2012 at 10:12 PM, Amani Alonazi
> <am...@kaust.edu.sa>wrote:
>
> > Hi all,
> >
> > I'm running a minimum spanning tree compute function on Hadoop cluster
> (20
> > machines). After certain supersteps (e.g. superstep 47 for a graph of
> > 4,194,304 vertices and 181,566,970 edges), the execution time increased
> > dramatically. This is not the only problem, the job has been killed "Task
> > attempt_* failed to report status for 601 seconds. Killing! "
> >
> > I disabled the checkpoint feature by setting the
> > "CHECKPOINT_FREQUENCY_DEFAULT = 0" in GiraphJob.java. I don't need to
> write
> > any data to disk neither snapshots nor output. I tested the algorithm on
> > sample graph of 7 vertices and it works well.
> >
> > Is there any way to profile or debug Giraph job?
> > In the Giraph Stats the "Aggregate finished vertices" counter is it for
> > the vertices which voted to halt? Also the "sent messages" counter, is it
> > per each superstep or the total msgs?
> > If a vertex vote to halt, will it be activated upon receiving messages?
> >
> > Thanks a lot!
> >
> > Best,
> > Amani AlOnazi
> > MSc Computer Science
> > King Abdullah University of Science and Technology
> > Kingdom of Saudi Arabia
> >
> >
> > ------------------------------
> > This message and its contents, including attachments are intended solely
> > for the original recipient. If you are not the intended recipient or have
> > received this message in error, please notify me immediately and delete
> > this message from your computer system. Any unauthorized use or
> > distribution is prohibited. Please consider the environment before
> printing
> > this email.
>

Re: Giraph Job "Task attempt_* failed to report status" Problem

Posted by Vishal Patel <wr...@gmail.com>.

After several supersteps, sometimes a worker thread dies (say it ran out of
memory). Zookeeper waits for ~5 mins (600 seconds) and then decides that
the worker is not responsive and fails the entire job. At this point if you
have a checkpoint saved it will resume from there otherwise you have to
start from scratch.

If you run the job again it should successfully finish (or it might error
at some other superstep / worker combination).

Vishal



On Tue, Aug 21, 2012 at 10:12 PM, Amani Alonazi
<am...@kaust.edu.sa>wrote:

> Hi all,
>
> I'm running a minimum spanning tree compute function on Hadoop cluster (20
> machines). After certain supersteps (e.g. superstep 47 for a graph of
> 4,194,304 vertices and 181,566,970 edges), the execution time increased
> dramatically. This is not the only problem, the job has been killed "Task
> attempt_* failed to report status for 601 seconds. Killing! "
>
> I disabled the checkpoint feature by setting the
> "CHECKPOINT_FREQUENCY_DEFAULT = 0" in GiraphJob.java. I don't need to write
> any data to disk neither snapshots nor output. I tested the algorithm on
> sample graph of 7 vertices and it works well.
>
> Is there any way to profile or debug Giraph job?
> In the Giraph Stats the "Aggregate finished vertices" counter is it for
> the vertices which voted to halt? Also the "sent messages" counter, is it
> per each superstep or the total msgs?
> If a vertex vote to halt, will it be activated upon receiving messages?
>
> Thanks a lot!
>
> Best,
> Amani AlOnazi
> MSc Computer Science
> King Abdullah University of Science and Technology
> Kingdom of Saudi Arabia
>
>
> ------------------------------
> This message and its contents, including attachments are intended solely
> for the original recipient. If you are not the intended recipient or have
> received this message in error, please notify me immediately and delete
> this message from your computer system. Any unauthorized use or
> distribution is prohibited. Please consider the environment before printing
> this email.

Re: Giraph Job "Task attempt_* failed to report status" Problem

Posted by Vishal Patel <wr...@gmail.com>.

After several supersteps, sometimes a worker thread dies (say it ran out of
memory). Zookeeper waits for ~5 mins (600 seconds) and then decides that
the worker is not responsive and fails the entire job. At this point if you
have a checkpoint saved it will resume from there otherwise you have to
start from scratch.

If you run the job again it should successfully finish (or it might error
at some other superstep / worker combination).

Vishal



On Tue, Aug 21, 2012 at 10:12 PM, Amani Alonazi
<am...@kaust.edu.sa>wrote:

> Hi all,
>
> I'm running a minimum spanning tree compute function on Hadoop cluster (20
> machines). After certain supersteps (e.g. superstep 47 for a graph of
> 4,194,304 vertices and 181,566,970 edges), the execution time increased
> dramatically. This is not the only problem, the job has been killed "Task
> attempt_* failed to report status for 601 seconds. Killing! "
>
> I disabled the checkpoint feature by setting the
> "CHECKPOINT_FREQUENCY_DEFAULT = 0" in GiraphJob.java. I don't need to write
> any data to disk neither snapshots nor output. I tested the algorithm on
> sample graph of 7 vertices and it works well.
>
> Is there any way to profile or debug Giraph job?
> In the Giraph Stats the "Aggregate finished vertices" counter is it for
> the vertices which voted to halt? Also the "sent messages" counter, is it
> per each superstep or the total msgs?
> If a vertex vote to halt, will it be activated upon receiving messages?
>
> Thanks a lot!
>
> Best,
> Amani AlOnazi
> MSc Computer Science
> King Abdullah University of Science and Technology
> Kingdom of Saudi Arabia
>
>
> ------------------------------
> This message and its contents, including attachments are intended solely
> for the original recipient. If you are not the intended recipient or have
> received this message in error, please notify me immediately and delete
> this message from your computer system. Any unauthorized use or
> distribution is prohibited. Please consider the environment before printing
> this email.