You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by steven zhuang <st...@gmail.com> on 2010/03/05 10:12:08 UTC

Task attempt failed to report status

hi, all,

     when I import data into the HTable with a Map/Reduce job, the task runs
smoothly until the last reducer failed 6 times to report its status.
     In my program I use batchupdate to collect for every 1000 cells, and
update the status. I don't think the normal inserting will cost 10 minutes,
coz the first 99% of the job was smoothly done, only the very last reducer
will get the "fail to report status" error.

     I doubt the problem is caused by regoinserver is way too busy, which
causes the "output.collect(k, bu);" takes too much time to return. but I am
not sure coz I don't know which regionserver is actually committing the
update.

    So which log should I dig into? any hint is appreciated.

    My code's  submitting portion is as follows(just copied from some online
source and changed a little):

        public void reduce(ImmutableBytesWritable k,
                Iterator<HbaseMapWritable<byte[], byte[]>> v,
                OutputCollector<ImmutableBytesWritable, BatchUpdate> output,
                Reporter r) throws IOException {
            while (v.hasNext()) {
                r.setStatus("Reducer begin committing row: " + new
String(k.get(), HConstants.UTF8_ENCODING) + "  Time:"+ new Date());
                BatchUpdate bu = new BatchUpdate(k.get());
                int cellCnt = 0;
                while (v.hasNext()) {
                    HbaseMapWritable<byte[], byte[]> hmw = v.next();
                    Iterator<Entry<byte[], byte[]>> iter =
 hmw.entrySet().iterator();
                    while(iter.hasNext()){
                        Entry<byte[], byte[]> e = iter.next();
                        bu.put(e.getKey(), e.getValue());
                        //System.err.println("now add cell: "+e+" cell
count: " + cellCnt + new Date());
                        if(++cellCnt>1000){
*                            output.collect(k, bu); //this line coz timeout.
*
*                            r.setStatus("Reducer done committing " + new
String(e.getKey(), HConstants.UTF8_ENCODING) + ":"+new String(e.getValue(),
HConstants.UTF8_ENCODING)+ "  Time:"+ new Date());*
                            bu = new BatchUpdate(k.get());
                            cellCnt = 0;
                        }
                    }
                }
                if(cellCnt>0){
                    output.collect(k, bu);
                }
            }


the regionserver log is as follows:

2010-03-05 08:24:07,268 INFO org.apache.hadoop.mapred.TaskTracker:
attempt_201002112137_0858_r_000066_0 0.33333334% reduce > sort
2010-03-05 08:24:07,788 INFO org.apache.hadoop.mapred.TaskTracker:
attempt_201002112137_0858_r_000066_0 0.6666667% reduce > reduce
2010-03-05 08:24:10,795 INFO org.apache.hadoop.mapred.TaskTracker:
attempt_201002112137_0858_r_000066_0 0.66668% Reducer done committing
queries:tempurpedic complaints:3:3:3:3  Time:Fri Mar 05 08:24:10 GMT 2010 >
reduce
2010-03-05 08:24:13,806 INFO org.apache.hadoop.mapred.TaskTracker:
attempt_201002112137_0858_r_000066_0 0.6714777% Reducer done committing
queries:modeling agencies:1:1:1:1  Time:Fri Mar 05 08:24:13 GMT 2010 >
reduce
2010-03-05 08:24:16,810 INFO org.apache.hadoop.mapred.TaskTracker:
attempt_201002112137_0858_r_000066_0 0.6714777% Reducer done committing
queries:polaris accessories:2:2:2:2  Time:Fri Mar 05 08:24:16 GMT 2010 >
reduce
2010-03-05 08:24:19,832 INFO org.apache.hadoop.mapred.TaskTracker:
attempt_201002112137_0858_r_000066_0 0.6714777% Reducer done committing
queries:what was martin luther king jr s pants size:1:1:1:1  Time:Fri Mar 05
08:24:19 GMT 2010 > reduce
2010-03-05 08:24:22,838 INFO org.apache.hadoop.mapred.TaskTracker:
attempt_201002112137_0858_r_000066_0 0.6714777% Reducer done committing
queries:nj npr stations:2:2:2:2  Time:Fri Mar 05 08:24:22 GMT 2010 > reduce
2010-03-05 08:24:25,842 INFO org.apache.hadoop.mapred.TaskTracker:
attempt_201002112137_0858_r_000066_0 0.6714777% Reducer done committing
queries:stacked hair gallery:5:5:5:5  Time:Fri Mar 05 08:24:25 GMT 2010 >
reduce
2010-03-05 08:24:28,877 INFO org.apache.hadoop.mapred.TaskTracker:
attempt_201002112137_0858_r_000066_0 0.6714777% Reducer done committing
queries:school bus endorsement test:2:2:2:2  Time:Fri Mar 05 08:24:28 GMT
2010 > reduce
2010-03-05 08:24:31,919 INFO org.apache.hadoop.mapred.TaskTracker:
attempt_201002112137_0858_r_000066_0 0.6714777% Reducer done committing
queries:spectacular from pretty ricky pictures:2:2:2:2  Time:Fri Mar 05
08:24:31 GMT 2010 > reduce
2010-03-05 08:24:34,926 INFO org.apache.hadoop.mapred.TaskTracker:
attempt_201002112137_0858_r_000066_0 0.6714777% Reducer done committing
queries:sample tax return:2:2:2:2  Time:Fri Mar 05 08:24:34 GMT 2010 >
reduce
2010-03-05 08:24:37,929 INFO org.apache.hadoop.mapred.TaskTracker:
attempt_201002112137_0858_r_000066_0 0.7544105% Reducer begin committing
row: cp_2010/01/18  Time:Fri Mar 05 08:24:37 GMT 2010 > reduce
2010-03-05 08:24:40,933 INFO org.apache.hadoop.mapred.TaskTracker:
attempt_201002112137_0858_r_000066_0 0.7544105% Reducer begin committing
row: cp_2010/01/18  Time:Fri Mar 05 08:24:37 GMT 2010 > reduce
2010-03-05 08:34:43,448 INFO org.apache.hadoop.mapred.TaskTracker:
attempt_201002112137_0858_r_000066_0: Task
attempt_201002112137_0858_r_000066_0 failed to report status for 602
seconds. Killing!
2010-03-05 08:34:43,458 INFO org.apache.hadoop.mapred.TaskTracker: Process
Thread Dump: lost task

Re: Task attempt failed to report status

Posted by Ted Yu <yu...@gmail.com>.

You can introduce a second thread in the reducer which periodically
reports status to hadoop.
At the same time, you can record the longest put operation to see the
amount of time it takes.

lowering the number of cells in a put to some value under 1000 may
help as well.

On Saturday, March 6, 2010, steven zhuang <st...@gmail.com> wrote:
> thanks stack.
>
> I think the timeout may caused by 1. HDFS is slow 2. the single row is way
> to big(millions of cells, around 50-100MB), but I don't know clearly how it
> happened.
>
> I have checked the regionserver log, there were lots of WARN message like:
> "2010-03-05 01:44:22,881 WARN org.apache.hadoop.hbase.regionserver.HLog: IPC
> Server handler 16 on 60020 took 1386ms appending an edit to hlog;
> editcount=223222"
> I have googled some, still not very clear, does this mean it takes 1086ms to
> write 223222 bytes to HDFS(disk), right? and this is really slow, right?
>
> Another error message is(output error when doing Put):
>
> 2010-03-05 08:19:58,162 WARN org.apache.hadoop.ipc.HBaseServer: IPC Server
> Responder, call put([B@112e6ef,
> [Lorg.apache.hadoop.hbase.client.Put;@1f77c71) from 192.168.8.228:35324:
> output error
> 2010-03-05 08:19:58,349 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server
> handler 3 on 60020 caught: java.nio.channels.ClosedChannelException
>     at
> sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:126)
>     at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324)
>     at
> org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java:1125)
>     at
> org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processResponse(HBaseServer.java:615)
>     at
> org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HBaseServer.java:679)
>     at
> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:943)
>
> I saw Jean-Daniel
> <http://old.nabble.com/user/UserProfile.jtp?user=1325624> answered
> this in one thread, says that this is because the client gets killed while
> the region server is doing the action. I think in my case the error is
> raised when Reducer is killed(600 time-out) while the regionserver is still
> trying to commit data to HDFS.  pity I didn't find any line related to the
> task tracker on which the last reduce attempt timeouted.
>
> the timeout is 600 seconds, can anything block so long? I mean the
> disk-swap/write cannot take so long.
> what exactly happen when reducer do a Put:
> 1. sending data to regionserver,
> 2. region server update the data in memory,
> 3. region server flush the data to HDFS if some threshold is reached?
>
> there should be no splitting as in my case one row is really big that one
> region can only hold one row.
>
>
> comments answered in line:
>
> On Sun, Mar 7, 2010 at 6:27 AM, Stack <st...@duboce.net> wrote:
>
>> On Fri, Mar 5, 2010 at 1:12 AM, steven zhuang
>> <st...@gmail.com> wrote:
>> >     when I import data into the HTable with a Map/Reduce job, the task
>> runs
>> > smoothly until the last reducer failed 6 times to report its status.
>>
>> How many reducers?  All completed except this last one and it failed
>> inspite of 6 attempts?
>>
>
> the number of reducers is set by the following code:
>         "TableMapReduceUtil.initTableReduceJob(args[1], TableUploader.class,
> c);"
> there are 70 reducers, but the first 60 reducers finished very fast, took no
> more than 10 minutes. the last few reducers took longer, around 20 minutes
> except that the last one, took 6 retries and failed at the end.
>
>
>
>> Perhaps its a really fat Put that is holding things up?  Can you add
>> logging of put sizes or some such to see if it an anomalous record
>> that is causing the non-reporting after ten minutes?
>
>
> I don't know how much data in a Put can make it fat, but at present I put
> 1000 cells in one put, each column-id is a url and the cell value is less
> than 10 bytes, makes a Put contains about 50KB data.
>
>>     In my program I use batchupdate to collect for every 1000 cells, and
>
>> > update the status. I don't think the normal inserting will cost 10
>> minutes,
>> > coz the first 99% of the job was smoothly done, only the very last
>> reducer
>> > will get the "fail to report status" error.
>> >
>>
>> Can you add logging to your reducers?  Log each put?  Try and see
>> where its hanging for > 10 minutes?
>
>
> yeah, that's done, I output a line to stderr before the Put and after a Put,
> but this doesn't help much, I can only see the last
> line saying it's committing, but never get the done message. the log is
> something like the  following:
>
> Reducer is committing row:bm_2010/01/18 col:queries:xxxxx:1:1:1:1 Time:Fri
> Mar 05 08:24:37 GMT 2010 Reducer has done committing row:bm_2010/01/18
> col:queries:yyyyyyy:1:1:1:1 Time:Fri Mar 05 08:24:37 GMT 2010 Reducer is
> committing row:cp_2010/01/18 col:queries:uuuuuuuuuu:2:2:2:2 Time:Fri Mar 05
> 08:24:37 GMT 2010
> it hangs here until the MR task is killed.
>
>
>> >     I doubt the problem is caused by regoinserver is way too busy, which
>> > causes the "output.collect(k, bu);" takes too much time to return. but I
>> am
>> > not sure coz I don't know which regionserver is actually committing the
>> > update.
>> >
>> >    So which log should I dig into? any hint is appreciated.
>> >
>>
>> Well, a reducer is responsible for a portion of the rows only,
>> usually.  MR is sorting on row?  So what arrives at the reducer is
>> sorted?  When this last reducer is running, look at UI?  It's probably
>> going to one regionserver only?  If you emit whats being inserted,
>> perhaps you can see from the row what region its trying to go too...
>> See where its hosted and look at that regionservers logs?
>>
>>
> MR is sorting on row, and yes, I think for the last reducer there is only
> one row, one row is going to one regionserver, I have checked the GUI of
> Hbase master after the job is done(killed), the region holds the failed row
> doesn't appear on
> the page.
>
>>    My code's  submitting portion is as follows(just copied from some
>> online
>> > source and changed a little):
>> >
>> >        public void reduce(ImmutableBytesWritable k,
>> >                Iterator<HbaseMapWritable<byte[], byte[]>> v,
>> >                OutputCollector<ImmutableBytesWritable, BatchUpdate>
>> output,
>> >                Reporter r) throws IOException {
>> >            while (v.hasNext()) {
>> >                r.setStatus("Reducer begin committing row: " + new
>> > String(k.get(), HConstants.UTF8_ENCODING) + "  Time:"+ new Date());
>> >                BatchUpdate bu = new BatchUpdate(k.get());
>> >                int cellCnt = 0;
>> >                while (v.hasNext()) {
>> >                    HbaseMapWritable<byte[], byte[]> hmw = v.next();
>> >                    Iterator<Entry<byte[], byte[]>> iter =
>> >  hmw.entrySet().iterator();
>> >                    while(iter.hasNext()){
>> >                        Entry<byte[], byte[]> e = iter.next();
>> >                        bu.put(e.getKey(), e.getValue());
>> >                        //System.err.println("now add cell: "+e+" cell
>> > count: " + cellCnt + new Date());
>> >                        if(++cellCnt>1000){
>> > *                            output.collect(k, bu); //this line coz
>> timeout.
>> > *
>> > *                            r.setStatus("Reducer done committing " + new
>> > String(e.getKey(), HConstants.UTF8_ENCODING) + ":"+new
>> String(e.getValue(),
>> > HConstants.UTF8_ENCODING)+ "  Time:"+ new Date());*
>> >                            bu = new BatchUpdate(k.get());
>> >                            cellCnt = 0;
>> >                        }
>> >                    }
>> >                }
>> >                if(cellCnt>0){
>> >                    output.collect(k, bu);
>> >                }
>> >            }
>> >
>>
>>
>> Try calling out.collect every ten cells?
>>
>> You are using TableOutputFormat?  Its buffering inserts to the table?
>> If so, configure it to not buffer so much?
>>
>
> I think the 1000 cells is really small now, previously I set it to 10000.
> yes, I am using the TableOutputFormat class, I think it buffers cause I
> didn't disable the buffering. all the reducer side configuration is done by
> the following line:
>         "TableMapReduceUtil.initTableReduceJob(args[1], TableUploader.class,
> c);"
>
>
> St.Ack
>>
>

Re: Task attempt failed to report status

Posted by steven zhuang <st...@gmail.com>.

thanks stack.

I think the timeout may caused by 1. HDFS is slow 2. the single row is way
to big(millions of cells, around 50-100MB), but I don't know clearly how it
happened.

I have checked the regionserver log, there were lots of WARN message like:
"2010-03-05 01:44:22,881 WARN org.apache.hadoop.hbase.regionserver.HLog: IPC
Server handler 16 on 60020 took 1386ms appending an edit to hlog;
editcount=223222"
I have googled some, still not very clear, does this mean it takes 1086ms to
write 223222 bytes to HDFS(disk), right? and this is really slow, right?

Another error message is(output error when doing Put):

2010-03-05 08:19:58,162 WARN org.apache.hadoop.ipc.HBaseServer: IPC Server
Responder, call put([B@112e6ef,
[Lorg.apache.hadoop.hbase.client.Put;@1f77c71) from 192.168.8.228:35324:
output error
2010-03-05 08:19:58,349 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server
handler 3 on 60020 caught: java.nio.channels.ClosedChannelException
    at
sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:126)
    at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324)
    at
org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java:1125)
    at
org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processResponse(HBaseServer.java:615)
    at
org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HBaseServer.java:679)
    at
org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:943)

I saw Jean-Daniel
<http://old.nabble.com/user/UserProfile.jtp?user=1325624> answered
this in one thread, says that this is because the client gets killed while
the region server is doing the action. I think in my case the error is
raised when Reducer is killed(600 time-out) while the regionserver is still
trying to commit data to HDFS.  pity I didn't find any line related to the
task tracker on which the last reduce attempt timeouted.

the timeout is 600 seconds, can anything block so long? I mean the
disk-swap/write cannot take so long.
what exactly happen when reducer do a Put:
1. sending data to regionserver,
2. region server update the data in memory,
3. region server flush the data to HDFS if some threshold is reached?

there should be no splitting as in my case one row is really big that one
region can only hold one row.


comments answered in line:

On Sun, Mar 7, 2010 at 6:27 AM, Stack <st...@duboce.net> wrote:

> On Fri, Mar 5, 2010 at 1:12 AM, steven zhuang
> <st...@gmail.com> wrote:
> >     when I import data into the HTable with a Map/Reduce job, the task
> runs
> > smoothly until the last reducer failed 6 times to report its status.
>
> How many reducers?  All completed except this last one and it failed
> inspite of 6 attempts?
>

the number of reducers is set by the following code:
        "TableMapReduceUtil.initTableReduceJob(args[1], TableUploader.class,
c);"
there are 70 reducers, but the first 60 reducers finished very fast, took no
more than 10 minutes. the last few reducers took longer, around 20 minutes
except that the last one, took 6 retries and failed at the end.



> Perhaps its a really fat Put that is holding things up?  Can you add
> logging of put sizes or some such to see if it an anomalous record
> that is causing the non-reporting after ten minutes?


I don't know how much data in a Put can make it fat, but at present I put
1000 cells in one put, each column-id is a url and the cell value is less
than 10 bytes, makes a Put contains about 50KB data.

>     In my program I use batchupdate to collect for every 1000 cells, and

> > update the status. I don't think the normal inserting will cost 10
> minutes,
> > coz the first 99% of the job was smoothly done, only the very last
> reducer
> > will get the "fail to report status" error.
> >
>
> Can you add logging to your reducers?  Log each put?  Try and see
> where its hanging for > 10 minutes?


yeah, that's done, I output a line to stderr before the Put and after a Put,
but this doesn't help much, I can only see the last
line saying it's committing, but never get the done message. the log is
something like the  following:

Reducer is committing row:bm_2010/01/18 col:queries:xxxxx:1:1:1:1 Time:Fri
Mar 05 08:24:37 GMT 2010 Reducer has done committing row:bm_2010/01/18
col:queries:yyyyyyy:1:1:1:1 Time:Fri Mar 05 08:24:37 GMT 2010 Reducer is
committing row:cp_2010/01/18 col:queries:uuuuuuuuuu:2:2:2:2 Time:Fri Mar 05
08:24:37 GMT 2010
it hangs here until the MR task is killed.


> >     I doubt the problem is caused by regoinserver is way too busy, which
> > causes the "output.collect(k, bu);" takes too much time to return. but I
> am
> > not sure coz I don't know which regionserver is actually committing the
> > update.
> >
> >    So which log should I dig into? any hint is appreciated.
> >
>
> Well, a reducer is responsible for a portion of the rows only,
> usually.  MR is sorting on row?  So what arrives at the reducer is
> sorted?  When this last reducer is running, look at UI?  It's probably
> going to one regionserver only?  If you emit whats being inserted,
> perhaps you can see from the row what region its trying to go too...
> See where its hosted and look at that regionservers logs?
>
>
MR is sorting on row, and yes, I think for the last reducer there is only
one row, one row is going to one regionserver, I have checked the GUI of
Hbase master after the job is done(killed), the region holds the failed row
doesn't appear on
the page.

>    My code's  submitting portion is as follows(just copied from some
> online
> > source and changed a little):
> >
> >        public void reduce(ImmutableBytesWritable k,
> >                Iterator<HbaseMapWritable<byte[], byte[]>> v,
> >                OutputCollector<ImmutableBytesWritable, BatchUpdate>
> output,
> >                Reporter r) throws IOException {
> >            while (v.hasNext()) {
> >                r.setStatus("Reducer begin committing row: " + new
> > String(k.get(), HConstants.UTF8_ENCODING) + "  Time:"+ new Date());
> >                BatchUpdate bu = new BatchUpdate(k.get());
> >                int cellCnt = 0;
> >                while (v.hasNext()) {
> >                    HbaseMapWritable<byte[], byte[]> hmw = v.next();
> >                    Iterator<Entry<byte[], byte[]>> iter =
> >  hmw.entrySet().iterator();
> >                    while(iter.hasNext()){
> >                        Entry<byte[], byte[]> e = iter.next();
> >                        bu.put(e.getKey(), e.getValue());
> >                        //System.err.println("now add cell: "+e+" cell
> > count: " + cellCnt + new Date());
> >                        if(++cellCnt>1000){
> > *                            output.collect(k, bu); //this line coz
> timeout.
> > *
> > *                            r.setStatus("Reducer done committing " + new
> > String(e.getKey(), HConstants.UTF8_ENCODING) + ":"+new
> String(e.getValue(),
> > HConstants.UTF8_ENCODING)+ "  Time:"+ new Date());*
> >                            bu = new BatchUpdate(k.get());
> >                            cellCnt = 0;
> >                        }
> >                    }
> >                }
> >                if(cellCnt>0){
> >                    output.collect(k, bu);
> >                }
> >            }
> >
>
>
> Try calling out.collect every ten cells?
>
> You are using TableOutputFormat?  Its buffering inserts to the table?
> If so, configure it to not buffer so much?
>

I think the 1000 cells is really small now, previously I set it to 10000.
yes, I am using the TableOutputFormat class, I think it buffers cause I
didn't disable the buffering. all the reducer side configuration is done by
the following line:
        "TableMapReduceUtil.initTableReduceJob(args[1], TableUploader.class,
c);"


St.Ack
>

Re: Task attempt failed to report status

Posted by Stack <st...@duboce.net>.

On Fri, Mar 5, 2010 at 1:12 AM, steven zhuang
<st...@gmail.com> wrote:
>     when I import data into the HTable with a Map/Reduce job, the task runs
> smoothly until the last reducer failed 6 times to report its status.

How many reducers?  All completed except this last one and it failed
inspite of 6 attempts?

Perhaps its a really fat Put that is holding things up?  Can you add
logging of put sizes or some such to see if it an anomalous record
that is causing the non-reporting after ten minutes?

>     In my program I use batchupdate to collect for every 1000 cells, and
> update the status. I don't think the normal inserting will cost 10 minutes,
> coz the first 99% of the job was smoothly done, only the very last reducer
> will get the "fail to report status" error.
>

Can you add logging to your reducers?  Log each put?  Try and see
where its hanging for > 10 minutes?


>     I doubt the problem is caused by regoinserver is way too busy, which
> causes the "output.collect(k, bu);" takes too much time to return. but I am
> not sure coz I don't know which regionserver is actually committing the
> update.
>
>    So which log should I dig into? any hint is appreciated.
>

Well, a reducer is responsible for a portion of the rows only,
usually.  MR is sorting on row?  So what arrives at the reducer is
sorted?  When this last reducer is running, look at UI?  It's probably
going to one regionserver only?  If you emit whats being inserted,
perhaps you can see from the row what region its trying to go too...
See where its hosted and look at that regionservers logs?


>    My code's  submitting portion is as follows(just copied from some online
> source and changed a little):
>
>        public void reduce(ImmutableBytesWritable k,
>                Iterator<HbaseMapWritable<byte[], byte[]>> v,
>                OutputCollector<ImmutableBytesWritable, BatchUpdate> output,
>                Reporter r) throws IOException {
>            while (v.hasNext()) {
>                r.setStatus("Reducer begin committing row: " + new
> String(k.get(), HConstants.UTF8_ENCODING) + "  Time:"+ new Date());
>                BatchUpdate bu = new BatchUpdate(k.get());
>                int cellCnt = 0;
>                while (v.hasNext()) {
>                    HbaseMapWritable<byte[], byte[]> hmw = v.next();
>                    Iterator<Entry<byte[], byte[]>> iter =
>  hmw.entrySet().iterator();
>                    while(iter.hasNext()){
>                        Entry<byte[], byte[]> e = iter.next();
>                        bu.put(e.getKey(), e.getValue());
>                        //System.err.println("now add cell: "+e+" cell
> count: " + cellCnt + new Date());
>                        if(++cellCnt>1000){
> *                            output.collect(k, bu); //this line coz timeout.
> *
> *                            r.setStatus("Reducer done committing " + new
> String(e.getKey(), HConstants.UTF8_ENCODING) + ":"+new String(e.getValue(),
> HConstants.UTF8_ENCODING)+ "  Time:"+ new Date());*
>                            bu = new BatchUpdate(k.get());
>                            cellCnt = 0;
>                        }
>                    }
>                }
>                if(cellCnt>0){
>                    output.collect(k, bu);
>                }
>            }
>


Try calling out.collect every ten cells?

You are using TableOutputFormat?  Its buffering inserts to the table?
If so, configure it to not buffer so much?

St.Ack

Re: Task attempt failed to report status

Posted by steven zhuang <st...@gmail.com>.

thanks, J.D, again.

      For my table some rows may has as much as several million cells, which
is why we choose Hbase over any RDB.

      And sorry I didn't notice that I pasted the tasktracker log in the
previous log, I am check the regionserver log, but I don't know which
regionserver will get the update request, so I am checking really slowly.

     In my impression, the date updating or deleting ops are all done in
mem-cache, so I really wonder how come it takes 10 min to finish an update
even if the region is really big? Im still checking the log, will update
this thread if I found something interesting.


On Sat, Mar 6, 2010 at 2:33 AM, Jean-Daniel Cryans <jd...@apache.org>wrote:

> How many cells do you output normally per row? Looks like a lot the
> way you are doing it and it might cause problems.
>
> Also did you take a look at the region servers logs? Anything distress
> messages in there?
>
> J-D
>
> On Fri, Mar 5, 2010 at 1:12 AM, steven zhuang
> <st...@gmail.com> wrote:
> > hi, all,
> >
> >     when I import data into the HTable with a Map/Reduce job, the task
> runs
> > smoothly until the last reducer failed 6 times to report its status.
> >     In my program I use batchupdate to collect for every 1000 cells, and
> > update the status. I don't think the normal inserting will cost 10
> minutes,
> > coz the first 99% of the job was smoothly done, only the very last
> reducer
> > will get the "fail to report status" error.
> >
> >     I doubt the problem is caused by regoinserver is way too busy, which
> > causes the "output.collect(k, bu);" takes too much time to return. but I
> am
> > not sure coz I don't know which regionserver is actually committing the
> > update.
> >
> >    So which log should I dig into? any hint is appreciated.
> >
> >    My code's  submitting portion is as follows(just copied from some
> online
> > source and changed a little):
> >
> >        public void reduce(ImmutableBytesWritable k,
> >                Iterator<HbaseMapWritable<byte[], byte[]>> v,
> >                OutputCollector<ImmutableBytesWritable, BatchUpdate>
> output,
> >                Reporter r) throws IOException {
> >            while (v.hasNext()) {
> >                r.setStatus("Reducer begin committing row: " + new
> > String(k.get(), HConstants.UTF8_ENCODING) + "  Time:"+ new Date());
> >                BatchUpdate bu = new BatchUpdate(k.get());
> >                int cellCnt = 0;
> >                while (v.hasNext()) {
> >                    HbaseMapWritable<byte[], byte[]> hmw = v.next();
> >                    Iterator<Entry<byte[], byte[]>> iter =
> >  hmw.entrySet().iterator();
> >                    while(iter.hasNext()){
> >                        Entry<byte[], byte[]> e = iter.next();
> >                        bu.put(e.getKey(), e.getValue());
> >                        //System.err.println("now add cell: "+e+" cell
> > count: " + cellCnt + new Date());
> >                        if(++cellCnt>1000){
> > *                            output.collect(k, bu); //this line coz
> timeout.
> > *
> > *                            r.setStatus("Reducer done committing " + new
> > String(e.getKey(), HConstants.UTF8_ENCODING) + ":"+new
> String(e.getValue(),
> > HConstants.UTF8_ENCODING)+ "  Time:"+ new Date());*
> >                            bu = new BatchUpdate(k.get());
> >                            cellCnt = 0;
> >                        }
> >                    }
> >                }
> >                if(cellCnt>0){
> >                    output.collect(k, bu);
> >                }
> >            }
> >
> >
> > the regionserver log is as follows:
> >
> > 2010-03-05 08:24:07,268 INFO org.apache.hadoop.mapred.TaskTracker:
> > attempt_201002112137_0858_r_000066_0 0.33333334% reduce > sort
> > 2010-03-05 08:24:07,788 INFO org.apache.hadoop.mapred.TaskTracker:
> > attempt_201002112137_0858_r_000066_0 0.6666667% reduce > reduce
> > 2010-03-05 08:24:10,795 INFO org.apache.hadoop.mapred.TaskTracker:
> > attempt_201002112137_0858_r_000066_0 0.66668% Reducer done committing
> > queries:tempurpedic complaints:3:3:3:3  Time:Fri Mar 05 08:24:10 GMT 2010
> >
> > reduce
> > 2010-03-05 08:24:13,806 INFO org.apache.hadoop.mapred.TaskTracker:
> > attempt_201002112137_0858_r_000066_0 0.6714777% Reducer done committing
> > queries:modeling agencies:1:1:1:1  Time:Fri Mar 05 08:24:13 GMT 2010 >
> > reduce
> > 2010-03-05 08:24:16,810 INFO org.apache.hadoop.mapred.TaskTracker:
> > attempt_201002112137_0858_r_000066_0 0.6714777% Reducer done committing
> > queries:polaris accessories:2:2:2:2  Time:Fri Mar 05 08:24:16 GMT 2010 >
> > reduce
> > 2010-03-05 08:24:19,832 INFO org.apache.hadoop.mapred.TaskTracker:
> > attempt_201002112137_0858_r_000066_0 0.6714777% Reducer done committing
> > queries:what was martin luther king jr s pants size:1:1:1:1  Time:Fri Mar
> 05
> > 08:24:19 GMT 2010 > reduce
> > 2010-03-05 08:24:22,838 INFO org.apache.hadoop.mapred.TaskTracker:
> > attempt_201002112137_0858_r_000066_0 0.6714777% Reducer done committing
> > queries:nj npr stations:2:2:2:2  Time:Fri Mar 05 08:24:22 GMT 2010 >
> reduce
> > 2010-03-05 08:24:25,842 INFO org.apache.hadoop.mapred.TaskTracker:
> > attempt_201002112137_0858_r_000066_0 0.6714777% Reducer done committing
> > queries:stacked hair gallery:5:5:5:5  Time:Fri Mar 05 08:24:25 GMT 2010 >
> > reduce
> > 2010-03-05 08:24:28,877 INFO org.apache.hadoop.mapred.TaskTracker:
> > attempt_201002112137_0858_r_000066_0 0.6714777% Reducer done committing
> > queries:school bus endorsement test:2:2:2:2  Time:Fri Mar 05 08:24:28 GMT
> > 2010 > reduce
> > 2010-03-05 08:24:31,919 INFO org.apache.hadoop.mapred.TaskTracker:
> > attempt_201002112137_0858_r_000066_0 0.6714777% Reducer done committing
> > queries:spectacular from pretty ricky pictures:2:2:2:2  Time:Fri Mar 05
> > 08:24:31 GMT 2010 > reduce
> > 2010-03-05 08:24:34,926 INFO org.apache.hadoop.mapred.TaskTracker:
> > attempt_201002112137_0858_r_000066_0 0.6714777% Reducer done committing
> > queries:sample tax return:2:2:2:2  Time:Fri Mar 05 08:24:34 GMT 2010 >
> > reduce
> > 2010-03-05 08:24:37,929 INFO org.apache.hadoop.mapred.TaskTracker:
> > attempt_201002112137_0858_r_000066_0 0.7544105% Reducer begin committing
> > row: cp_2010/01/18  Time:Fri Mar 05 08:24:37 GMT 2010 > reduce
> > 2010-03-05 08:24:40,933 INFO org.apache.hadoop.mapred.TaskTracker:
> > attempt_201002112137_0858_r_000066_0 0.7544105% Reducer begin committing
> > row: cp_2010/01/18  Time:Fri Mar 05 08:24:37 GMT 2010 > reduce
> > 2010-03-05 08:34:43,448 INFO org.apache.hadoop.mapred.TaskTracker:
> > attempt_201002112137_0858_r_000066_0: Task
> > attempt_201002112137_0858_r_000066_0 failed to report status for 602
> > seconds. Killing!
> > 2010-03-05 08:34:43,458 INFO org.apache.hadoop.mapred.TaskTracker:
> Process
> > Thread Dump: lost task
> >
>

Re: Task attempt failed to report status

Posted by Jean-Daniel Cryans <jd...@apache.org>.

How many cells do you output normally per row? Looks like a lot the
way you are doing it and it might cause problems.

Also did you take a look at the region servers logs? Anything distress
messages in there?

J-D

On Fri, Mar 5, 2010 at 1:12 AM, steven zhuang
<st...@gmail.com> wrote:
> hi, all,
>
>     when I import data into the HTable with a Map/Reduce job, the task runs
> smoothly until the last reducer failed 6 times to report its status.
>     In my program I use batchupdate to collect for every 1000 cells, and
> update the status. I don't think the normal inserting will cost 10 minutes,
> coz the first 99% of the job was smoothly done, only the very last reducer
> will get the "fail to report status" error.
>
>     I doubt the problem is caused by regoinserver is way too busy, which
> causes the "output.collect(k, bu);" takes too much time to return. but I am
> not sure coz I don't know which regionserver is actually committing the
> update.
>
>    So which log should I dig into? any hint is appreciated.
>
>    My code's  submitting portion is as follows(just copied from some online
> source and changed a little):
>
>        public void reduce(ImmutableBytesWritable k,
>                Iterator<HbaseMapWritable<byte[], byte[]>> v,
>                OutputCollector<ImmutableBytesWritable, BatchUpdate> output,
>                Reporter r) throws IOException {
>            while (v.hasNext()) {
>                r.setStatus("Reducer begin committing row: " + new
> String(k.get(), HConstants.UTF8_ENCODING) + "  Time:"+ new Date());
>                BatchUpdate bu = new BatchUpdate(k.get());
>                int cellCnt = 0;
>                while (v.hasNext()) {
>                    HbaseMapWritable<byte[], byte[]> hmw = v.next();
>                    Iterator<Entry<byte[], byte[]>> iter =
>  hmw.entrySet().iterator();
>                    while(iter.hasNext()){
>                        Entry<byte[], byte[]> e = iter.next();
>                        bu.put(e.getKey(), e.getValue());
>                        //System.err.println("now add cell: "+e+" cell
> count: " + cellCnt + new Date());
>                        if(++cellCnt>1000){
> *                            output.collect(k, bu); //this line coz timeout.
> *
> *                            r.setStatus("Reducer done committing " + new
> String(e.getKey(), HConstants.UTF8_ENCODING) + ":"+new String(e.getValue(),
> HConstants.UTF8_ENCODING)+ "  Time:"+ new Date());*
>                            bu = new BatchUpdate(k.get());
>                            cellCnt = 0;
>                        }
>                    }
>                }
>                if(cellCnt>0){
>                    output.collect(k, bu);
>                }
>            }
>
>
> the regionserver log is as follows:
>
> 2010-03-05 08:24:07,268 INFO org.apache.hadoop.mapred.TaskTracker:
> attempt_201002112137_0858_r_000066_0 0.33333334% reduce > sort
> 2010-03-05 08:24:07,788 INFO org.apache.hadoop.mapred.TaskTracker:
> attempt_201002112137_0858_r_000066_0 0.6666667% reduce > reduce
> 2010-03-05 08:24:10,795 INFO org.apache.hadoop.mapred.TaskTracker:
> attempt_201002112137_0858_r_000066_0 0.66668% Reducer done committing
> queries:tempurpedic complaints:3:3:3:3  Time:Fri Mar 05 08:24:10 GMT 2010 >
> reduce
> 2010-03-05 08:24:13,806 INFO org.apache.hadoop.mapred.TaskTracker:
> attempt_201002112137_0858_r_000066_0 0.6714777% Reducer done committing
> queries:modeling agencies:1:1:1:1  Time:Fri Mar 05 08:24:13 GMT 2010 >
> reduce
> 2010-03-05 08:24:16,810 INFO org.apache.hadoop.mapred.TaskTracker:
> attempt_201002112137_0858_r_000066_0 0.6714777% Reducer done committing
> queries:polaris accessories:2:2:2:2  Time:Fri Mar 05 08:24:16 GMT 2010 >
> reduce
> 2010-03-05 08:24:19,832 INFO org.apache.hadoop.mapred.TaskTracker:
> attempt_201002112137_0858_r_000066_0 0.6714777% Reducer done committing
> queries:what was martin luther king jr s pants size:1:1:1:1  Time:Fri Mar 05
> 08:24:19 GMT 2010 > reduce
> 2010-03-05 08:24:22,838 INFO org.apache.hadoop.mapred.TaskTracker:
> attempt_201002112137_0858_r_000066_0 0.6714777% Reducer done committing
> queries:nj npr stations:2:2:2:2  Time:Fri Mar 05 08:24:22 GMT 2010 > reduce
> 2010-03-05 08:24:25,842 INFO org.apache.hadoop.mapred.TaskTracker:
> attempt_201002112137_0858_r_000066_0 0.6714777% Reducer done committing
> queries:stacked hair gallery:5:5:5:5  Time:Fri Mar 05 08:24:25 GMT 2010 >
> reduce
> 2010-03-05 08:24:28,877 INFO org.apache.hadoop.mapred.TaskTracker:
> attempt_201002112137_0858_r_000066_0 0.6714777% Reducer done committing
> queries:school bus endorsement test:2:2:2:2  Time:Fri Mar 05 08:24:28 GMT
> 2010 > reduce
> 2010-03-05 08:24:31,919 INFO org.apache.hadoop.mapred.TaskTracker:
> attempt_201002112137_0858_r_000066_0 0.6714777% Reducer done committing
> queries:spectacular from pretty ricky pictures:2:2:2:2  Time:Fri Mar 05
> 08:24:31 GMT 2010 > reduce
> 2010-03-05 08:24:34,926 INFO org.apache.hadoop.mapred.TaskTracker:
> attempt_201002112137_0858_r_000066_0 0.6714777% Reducer done committing
> queries:sample tax return:2:2:2:2  Time:Fri Mar 05 08:24:34 GMT 2010 >
> reduce
> 2010-03-05 08:24:37,929 INFO org.apache.hadoop.mapred.TaskTracker:
> attempt_201002112137_0858_r_000066_0 0.7544105% Reducer begin committing
> row: cp_2010/01/18  Time:Fri Mar 05 08:24:37 GMT 2010 > reduce
> 2010-03-05 08:24:40,933 INFO org.apache.hadoop.mapred.TaskTracker:
> attempt_201002112137_0858_r_000066_0 0.7544105% Reducer begin committing
> row: cp_2010/01/18  Time:Fri Mar 05 08:24:37 GMT 2010 > reduce
> 2010-03-05 08:34:43,448 INFO org.apache.hadoop.mapred.TaskTracker:
> attempt_201002112137_0858_r_000066_0: Task
> attempt_201002112137_0858_r_000066_0 failed to report status for 602
> seconds. Killing!
> 2010-03-05 08:34:43,458 INFO org.apache.hadoop.mapred.TaskTracker: Process
> Thread Dump: lost task
>