You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by Corbin Hoenes <co...@tynt.com> on 2010/05/11 20:06:19 UTC

failed to report status for 601 seconds

Not sure I am clean on how I can debug stuff on a cluster.  I currently have a long running reducer that attempts to run 4 times before finally giving up

I get 4 of these: Task attempt_201005101345_0052_r_000012_0 failed to report status for 601 seconds. Killing!

before it gives up...on the last try I noticed this in the log:  
ERROR: org.apache.hadoop.hdfs.DFSClient - Exception closing file /tmp/temp1925356068/tmp1003826561/_temporary/_attempt_201005101345_0052_r_000012_4/abs/tmp/temp1925356068/tmp-197182389/part-00012 : org.apache.hadoop.ipc.RemoteException: java.io.IOException: Could not complete write to file /tmp/temp1925356068/tmp1003826561/_temporary/_attempt_201005101345_0052_r_000012_4/abs/tmp/temp1925356068/tmp-197182389/part-00012 by DFSClient_attempt_201005101345_0052_r_000012_4
	at org.apache.hadoop.hdfs.server.namenode.NameNode.complete(NameNode.java:497)
	at sun.reflect.GeneratedMethodAccessor23.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
	at java.lang.reflect.Method.invoke(Method.java:597)
	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508)
	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:966)
	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:962)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:396)
	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:960)
How do I turn on log4j's DEBUG statements?  Hoping those will help me pinpoint what is going on here--maybe it's the cluster or maybe the script.

Re: failed to report status for 601 seconds

Posted by Corbin Hoenes <co...@tynt.com>.

I think that issue is correct as we are no longer seeing those kinds of errors.  We reconfigured the cluster a bit.

I am trying:

pig -Dmapred.task.timeout=0 ./myscript.pig

but it doesn't like setting properties like this it thinks it's invalid and spews the pig "usage" text at me; using pig 0.6 is this still supported?


On May 13, 2010, at 6:39 PM, Ashutosh Chauhan wrote:

> Yes, you can set job conf properties like that, but looking at your
> original stack trace.. it seems that wont solve your problem. Looks
> like your cluster is in trouble. If you can, try restarting it. If
> not,  you need to dig into tasktracker / datanode logs.
> 
> Ashutosh
> 
> On Thu, May 13, 2010 at 16:07, Corbin Hoenes <co...@tynt.com> wrote:
>> Zaki,
>> 
>> can pig take command lines like this to set job conf properties?
>> 
>> pig -Dmapred.task.timeout=0
>> 
>> On May 13, 2010, at 4:18 PM, zaki rahaman wrote:
>> 
>>> Hi Corbin,
>>> 
>>> The timeout error you're seeing could also indicate that your reducer is
>>> trying to process a very large key/group which may be the reason for the
>>> timeout in the first place. At least this is a behavior I've seen in the
>>> past.
>>> 
>>> On Thu, May 13, 2010 at 6:10 PM, Corbin Hoenes <co...@tynt.com> wrote:
>>> 
>>>> Okay so what is the pig way to do this?
>>>> 
>>>> Noticed a lot of chatter about UDFs in pig don't call progress and can
>>>> cause your jobs to get killed.  I am using only builtin UDFs like COUNT,
>>>> FLATTEN do they suffer from this same issue (no progress calls?)
>>>> 
>>>> On May 12, 2010, at 2:56 AM, Andrey Stepachev wrote:
>>>> 
>>>>> You should report progress in a period less then configured (in you case
>>>>> 600sec).
>>>>> Add code like below to you reducer and call ping in you reducer where you
>>>>> process tuples.
>>>>> 
>>>>>       final TaskAttemptContext context = <init in costructor>;
>>>>>       long lastTime = System.currentTimeMillis();
>>>>> 
>>>>>       public void ping() {
>>>>>           final long currtime = System.currentTimeMillis();
>>>>>           if (currtime - lastTime > 10000) {
>>>>>               context.progress();
>>>>>               lastTime = currtime;
>>>>>           }
>>>>>       }
>>>>> 
>>>>> 
>>>>> 2010/5/11 Corbin Hoenes <co...@tynt.com>
>>>>> 
>>>>>> Not sure I am clean on how I can debug stuff on a cluster.  I currently
>>>>>> have a long running reducer that attempts to run 4 times before finally
>>>>>> giving up
>>>>>> 
>>>>>> I get 4 of these: Task attempt_201005101345_0052_r_000012_0 failed to
>>>>>> report status for 601 seconds. Killing!
>>>>>> 
>>>>>> before it gives up...on the last try I noticed this in the log:
>>>>>> ERROR: org.apache.hadoop.hdfs.DFSClient - Exception closing file
>>>>>> 
>>>> /tmp/temp1925356068/tmp1003826561/_temporary/_attempt_201005101345_0052_r_000012_4/abs/tmp/temp1925356068/tmp-197182389/part-00012
>>>>>> : org.apache.hadoop.ipc.RemoteException: java.io.IOException: Could not
>>>>>> complete write to file
>>>>>> 
>>>> /tmp/temp1925356068/tmp1003826561/_temporary/_attempt_201005101345_0052_r_000012_4/abs/tmp/temp1925356068/tmp-197182389/part-00012
>>>>>> by DFSClient_attempt_201005101345_0052_r_000012_4
>>>>>>      at
>>>>>> 
>>>> org.apache.hadoop.hdfs.server.namenode.NameNode.complete(NameNode.java:497)
>>>>>>      at sun.reflect.GeneratedMethodAccessor23.invoke(Unknown Source)
>>>>>>      at
>>>>>> 
>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>>>>      at java.lang.reflect.Method.invoke(Method.java:597)
>>>>>>      at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508)
>>>>>>      at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:966)
>>>>>>      at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:962)
>>>>>>      at java.security.AccessController.doPrivileged(Native Method)
>>>>>>      at javax.security.auth.Subject.doAs(Subject.java:396)
>>>>>>      at org.apache.hadoop.ipc.Server$Handler.run(Server.java:960)
>>>>>> How do I turn on log4j's DEBUG statements?  Hoping those will help me
>>>>>> pinpoint what is going on here--maybe it's the cluster or maybe the
>>>> script.
>>>>>> 
>>>>>> 
>>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>>> --
>>> Zaki Rahaman
>> 
>>

Re: failed to report status for 601 seconds

Posted by Ashutosh Chauhan <as...@gmail.com>.

Yes, you can set job conf properties like that, but looking at your
original stack trace.. it seems that wont solve your problem. Looks
like your cluster is in trouble. If you can, try restarting it. If
not,  you need to dig into tasktracker / datanode logs.

Ashutosh

On Thu, May 13, 2010 at 16:07, Corbin Hoenes <co...@tynt.com> wrote:
> Zaki,
>
> can pig take command lines like this to set job conf properties?
>
> pig -Dmapred.task.timeout=0
>
> On May 13, 2010, at 4:18 PM, zaki rahaman wrote:
>
>> Hi Corbin,
>>
>> The timeout error you're seeing could also indicate that your reducer is
>> trying to process a very large key/group which may be the reason for the
>> timeout in the first place. At least this is a behavior I've seen in the
>> past.
>>
>> On Thu, May 13, 2010 at 6:10 PM, Corbin Hoenes <co...@tynt.com> wrote:
>>
>>> Okay so what is the pig way to do this?
>>>
>>> Noticed a lot of chatter about UDFs in pig don't call progress and can
>>> cause your jobs to get killed.  I am using only builtin UDFs like COUNT,
>>> FLATTEN do they suffer from this same issue (no progress calls?)
>>>
>>> On May 12, 2010, at 2:56 AM, Andrey Stepachev wrote:
>>>
>>>> You should report progress in a period less then configured (in you case
>>>> 600sec).
>>>> Add code like below to you reducer and call ping in you reducer where you
>>>> process tuples.
>>>>
>>>>       final TaskAttemptContext context = <init in costructor>;
>>>>       long lastTime = System.currentTimeMillis();
>>>>
>>>>       public void ping() {
>>>>           final long currtime = System.currentTimeMillis();
>>>>           if (currtime - lastTime > 10000) {
>>>>               context.progress();
>>>>               lastTime = currtime;
>>>>           }
>>>>       }
>>>>
>>>>
>>>> 2010/5/11 Corbin Hoenes <co...@tynt.com>
>>>>
>>>>> Not sure I am clean on how I can debug stuff on a cluster.  I currently
>>>>> have a long running reducer that attempts to run 4 times before finally
>>>>> giving up
>>>>>
>>>>> I get 4 of these: Task attempt_201005101345_0052_r_000012_0 failed to
>>>>> report status for 601 seconds. Killing!
>>>>>
>>>>> before it gives up...on the last try I noticed this in the log:
>>>>> ERROR: org.apache.hadoop.hdfs.DFSClient - Exception closing file
>>>>>
>>> /tmp/temp1925356068/tmp1003826561/_temporary/_attempt_201005101345_0052_r_000012_4/abs/tmp/temp1925356068/tmp-197182389/part-00012
>>>>> : org.apache.hadoop.ipc.RemoteException: java.io.IOException: Could not
>>>>> complete write to file
>>>>>
>>> /tmp/temp1925356068/tmp1003826561/_temporary/_attempt_201005101345_0052_r_000012_4/abs/tmp/temp1925356068/tmp-197182389/part-00012
>>>>> by DFSClient_attempt_201005101345_0052_r_000012_4
>>>>>      at
>>>>>
>>> org.apache.hadoop.hdfs.server.namenode.NameNode.complete(NameNode.java:497)
>>>>>      at sun.reflect.GeneratedMethodAccessor23.invoke(Unknown Source)
>>>>>      at
>>>>>
>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>>>      at java.lang.reflect.Method.invoke(Method.java:597)
>>>>>      at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508)
>>>>>      at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:966)
>>>>>      at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:962)
>>>>>      at java.security.AccessController.doPrivileged(Native Method)
>>>>>      at javax.security.auth.Subject.doAs(Subject.java:396)
>>>>>      at org.apache.hadoop.ipc.Server$Handler.run(Server.java:960)
>>>>> How do I turn on log4j's DEBUG statements?  Hoping those will help me
>>>>> pinpoint what is going on here--maybe it's the cluster or maybe the
>>> script.
>>>>>
>>>>>
>>>>>
>>>
>>>
>>
>>
>> --
>> Zaki Rahaman
>
>

Re: failed to report status for 601 seconds

Posted by Corbin Hoenes <co...@tynt.com>.

Zaki,

can pig take command lines like this to set job conf properties? 

pig -Dmapred.task.timeout=0

On May 13, 2010, at 4:18 PM, zaki rahaman wrote:

> Hi Corbin,
> 
> The timeout error you're seeing could also indicate that your reducer is
> trying to process a very large key/group which may be the reason for the
> timeout in the first place. At least this is a behavior I've seen in the
> past.
> 
> On Thu, May 13, 2010 at 6:10 PM, Corbin Hoenes <co...@tynt.com> wrote:
> 
>> Okay so what is the pig way to do this?
>> 
>> Noticed a lot of chatter about UDFs in pig don't call progress and can
>> cause your jobs to get killed.  I am using only builtin UDFs like COUNT,
>> FLATTEN do they suffer from this same issue (no progress calls?)
>> 
>> On May 12, 2010, at 2:56 AM, Andrey Stepachev wrote:
>> 
>>> You should report progress in a period less then configured (in you case
>>> 600sec).
>>> Add code like below to you reducer and call ping in you reducer where you
>>> process tuples.
>>> 
>>>       final TaskAttemptContext context = <init in costructor>;
>>>       long lastTime = System.currentTimeMillis();
>>> 
>>>       public void ping() {
>>>           final long currtime = System.currentTimeMillis();
>>>           if (currtime - lastTime > 10000) {
>>>               context.progress();
>>>               lastTime = currtime;
>>>           }
>>>       }
>>> 
>>> 
>>> 2010/5/11 Corbin Hoenes <co...@tynt.com>
>>> 
>>>> Not sure I am clean on how I can debug stuff on a cluster.  I currently
>>>> have a long running reducer that attempts to run 4 times before finally
>>>> giving up
>>>> 
>>>> I get 4 of these: Task attempt_201005101345_0052_r_000012_0 failed to
>>>> report status for 601 seconds. Killing!
>>>> 
>>>> before it gives up...on the last try I noticed this in the log:
>>>> ERROR: org.apache.hadoop.hdfs.DFSClient - Exception closing file
>>>> 
>> /tmp/temp1925356068/tmp1003826561/_temporary/_attempt_201005101345_0052_r_000012_4/abs/tmp/temp1925356068/tmp-197182389/part-00012
>>>> : org.apache.hadoop.ipc.RemoteException: java.io.IOException: Could not
>>>> complete write to file
>>>> 
>> /tmp/temp1925356068/tmp1003826561/_temporary/_attempt_201005101345_0052_r_000012_4/abs/tmp/temp1925356068/tmp-197182389/part-00012
>>>> by DFSClient_attempt_201005101345_0052_r_000012_4
>>>>      at
>>>> 
>> org.apache.hadoop.hdfs.server.namenode.NameNode.complete(NameNode.java:497)
>>>>      at sun.reflect.GeneratedMethodAccessor23.invoke(Unknown Source)
>>>>      at
>>>> 
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>>      at java.lang.reflect.Method.invoke(Method.java:597)
>>>>      at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508)
>>>>      at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:966)
>>>>      at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:962)
>>>>      at java.security.AccessController.doPrivileged(Native Method)
>>>>      at javax.security.auth.Subject.doAs(Subject.java:396)
>>>>      at org.apache.hadoop.ipc.Server$Handler.run(Server.java:960)
>>>> How do I turn on log4j's DEBUG statements?  Hoping those will help me
>>>> pinpoint what is going on here--maybe it's the cluster or maybe the
>> script.
>>>> 
>>>> 
>>>> 
>> 
>> 
> 
> 
> -- 
> Zaki Rahaman

Re: failed to report status for 601 seconds

Posted by zaki rahaman <za...@gmail.com>.

Hi Corbin,

The timeout error you're seeing could also indicate that your reducer is
trying to process a very large key/group which may be the reason for the
timeout in the first place. At least this is a behavior I've seen in the
past.

On Thu, May 13, 2010 at 6:10 PM, Corbin Hoenes <co...@tynt.com> wrote:

> Okay so what is the pig way to do this?
>
> Noticed a lot of chatter about UDFs in pig don't call progress and can
> cause your jobs to get killed.  I am using only builtin UDFs like COUNT,
> FLATTEN do they suffer from this same issue (no progress calls?)
>
> On May 12, 2010, at 2:56 AM, Andrey Stepachev wrote:
>
> > You should report progress in a period less then configured (in you case
> > 600sec).
> > Add code like below to you reducer and call ping in you reducer where you
> > process tuples.
> >
> >        final TaskAttemptContext context = <init in costructor>;
> >        long lastTime = System.currentTimeMillis();
> >
> >        public void ping() {
> >            final long currtime = System.currentTimeMillis();
> >            if (currtime - lastTime > 10000) {
> >                context.progress();
> >                lastTime = currtime;
> >            }
> >        }
> >
> >
> > 2010/5/11 Corbin Hoenes <co...@tynt.com>
> >
> >> Not sure I am clean on how I can debug stuff on a cluster.  I currently
> >> have a long running reducer that attempts to run 4 times before finally
> >> giving up
> >>
> >> I get 4 of these: Task attempt_201005101345_0052_r_000012_0 failed to
> >> report status for 601 seconds. Killing!
> >>
> >> before it gives up...on the last try I noticed this in the log:
> >> ERROR: org.apache.hadoop.hdfs.DFSClient - Exception closing file
> >>
> /tmp/temp1925356068/tmp1003826561/_temporary/_attempt_201005101345_0052_r_000012_4/abs/tmp/temp1925356068/tmp-197182389/part-00012
> >> : org.apache.hadoop.ipc.RemoteException: java.io.IOException: Could not
> >> complete write to file
> >>
> /tmp/temp1925356068/tmp1003826561/_temporary/_attempt_201005101345_0052_r_000012_4/abs/tmp/temp1925356068/tmp-197182389/part-00012
> >> by DFSClient_attempt_201005101345_0052_r_000012_4
> >>       at
> >>
> org.apache.hadoop.hdfs.server.namenode.NameNode.complete(NameNode.java:497)
> >>       at sun.reflect.GeneratedMethodAccessor23.invoke(Unknown Source)
> >>       at
> >>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> >>       at java.lang.reflect.Method.invoke(Method.java:597)
> >>       at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508)
> >>       at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:966)
> >>       at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:962)
> >>       at java.security.AccessController.doPrivileged(Native Method)
> >>       at javax.security.auth.Subject.doAs(Subject.java:396)
> >>       at org.apache.hadoop.ipc.Server$Handler.run(Server.java:960)
> >> How do I turn on log4j's DEBUG statements?  Hoping those will help me
> >> pinpoint what is going on here--maybe it's the cluster or maybe the
> script.
> >>
> >>
> >>
>
>


-- 
Zaki Rahaman

Re: failed to report status for 601 seconds

Posted by Corbin Hoenes <co...@tynt.com>.

Okay so what is the pig way to do this?

Noticed a lot of chatter about UDFs in pig don't call progress and can cause your jobs to get killed.  I am using only builtin UDFs like COUNT, FLATTEN do they suffer from this same issue (no progress calls?)

On May 12, 2010, at 2:56 AM, Andrey Stepachev wrote:

> You should report progress in a period less then configured (in you case
> 600sec).
> Add code like below to you reducer and call ping in you reducer where you
> process tuples.
> 
>        final TaskAttemptContext context = <init in costructor>;
>        long lastTime = System.currentTimeMillis();
> 
>        public void ping() {
>            final long currtime = System.currentTimeMillis();
>            if (currtime - lastTime > 10000) {
>                context.progress();
>                lastTime = currtime;
>            }
>        }
> 
> 
> 2010/5/11 Corbin Hoenes <co...@tynt.com>
> 
>> Not sure I am clean on how I can debug stuff on a cluster.  I currently
>> have a long running reducer that attempts to run 4 times before finally
>> giving up
>> 
>> I get 4 of these: Task attempt_201005101345_0052_r_000012_0 failed to
>> report status for 601 seconds. Killing!
>> 
>> before it gives up...on the last try I noticed this in the log:
>> ERROR: org.apache.hadoop.hdfs.DFSClient - Exception closing file
>> /tmp/temp1925356068/tmp1003826561/_temporary/_attempt_201005101345_0052_r_000012_4/abs/tmp/temp1925356068/tmp-197182389/part-00012
>> : org.apache.hadoop.ipc.RemoteException: java.io.IOException: Could not
>> complete write to file
>> /tmp/temp1925356068/tmp1003826561/_temporary/_attempt_201005101345_0052_r_000012_4/abs/tmp/temp1925356068/tmp-197182389/part-00012
>> by DFSClient_attempt_201005101345_0052_r_000012_4
>>       at
>> org.apache.hadoop.hdfs.server.namenode.NameNode.complete(NameNode.java:497)
>>       at sun.reflect.GeneratedMethodAccessor23.invoke(Unknown Source)
>>       at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>       at java.lang.reflect.Method.invoke(Method.java:597)
>>       at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508)
>>       at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:966)
>>       at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:962)
>>       at java.security.AccessController.doPrivileged(Native Method)
>>       at javax.security.auth.Subject.doAs(Subject.java:396)
>>       at org.apache.hadoop.ipc.Server$Handler.run(Server.java:960)
>> How do I turn on log4j's DEBUG statements?  Hoping those will help me
>> pinpoint what is going on here--maybe it's the cluster or maybe the script.
>> 
>> 
>>

Re: failed to report status for 601 seconds

Posted by Andrey Stepachev <oc...@gmail.com>.

You should report progress in a period less then configured (in you case
600sec).
Add code like below to you reducer and call ping in you reducer where you
process tuples.

        final TaskAttemptContext context = <init in costructor>;
        long lastTime = System.currentTimeMillis();

        public void ping() {
            final long currtime = System.currentTimeMillis();
            if (currtime - lastTime > 10000) {
                context.progress();
                lastTime = currtime;
            }
        }


2010/5/11 Corbin Hoenes <co...@tynt.com>

> Not sure I am clean on how I can debug stuff on a cluster.  I currently
> have a long running reducer that attempts to run 4 times before finally
> giving up
>
> I get 4 of these: Task attempt_201005101345_0052_r_000012_0 failed to
> report status for 601 seconds. Killing!
>
> before it gives up...on the last try I noticed this in the log:
> ERROR: org.apache.hadoop.hdfs.DFSClient - Exception closing file
> /tmp/temp1925356068/tmp1003826561/_temporary/_attempt_201005101345_0052_r_000012_4/abs/tmp/temp1925356068/tmp-197182389/part-00012
> : org.apache.hadoop.ipc.RemoteException: java.io.IOException: Could not
> complete write to file
> /tmp/temp1925356068/tmp1003826561/_temporary/_attempt_201005101345_0052_r_000012_4/abs/tmp/temp1925356068/tmp-197182389/part-00012
> by DFSClient_attempt_201005101345_0052_r_000012_4
>        at
> org.apache.hadoop.hdfs.server.namenode.NameNode.complete(NameNode.java:497)
>        at sun.reflect.GeneratedMethodAccessor23.invoke(Unknown Source)
>        at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>        at java.lang.reflect.Method.invoke(Method.java:597)
>        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508)
>        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:966)
>        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:962)
>        at java.security.AccessController.doPrivileged(Native Method)
>        at javax.security.auth.Subject.doAs(Subject.java:396)
>        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:960)
> How do I turn on log4j's DEBUG statements?  Hoping those will help me
> pinpoint what is going on here--maybe it's the cluster or maybe the script.
>
>
>