You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mapreduce-user@hadoop.apache.org by Steve Lewis <lo...@gmail.com> on 2012/01/18 23:05:32 UTC

I am trying to run a large job and it is consistently failing with timeout - nothing happens for 600 sec

The map tasks fail timing out after 600 sec.
I am processing one 9 GB file with 16,000,000 records. Each record (think
is it as a line)  generates hundreds of key value pairs.
The job is unusual in that the output of the mapper in terms of records or
bytes orders of magnitude larger than the input.
I have no idea what is slowing down the job except that the problem is in
the writes.

If I change the job to merely bypass a fraction of the context.write
statements the job succeeds.
This is one map task that failed and one that succeeded - I cannot
understand how a write can take so long
or what else the mapper might be doing

JOB FAILED WITH TIMEOUT

*Parser*TotalProteins90,103NumberFragments10,933,089
*FileSystemCounters*HDFS_BYTES_READ67,245,605FILE_BYTES_WRITTEN444,054,807
*Map-Reduce Framework*Combine output records10,033,499Map input records
90,103Spilled Records10,032,836Map output bytes3,520,182,794Combine input
records10,844,881Map output records10,933,089
Same code but fewer writes
JOB SUCCEEDED

*Parser*TotalProteins90,103NumberFragments206,658,758
*FileSystemCounters*FILE_BYTES_READ111,578,253HDFS_BYTES_READ67,245,607
FILE_BYTES_WRITTEN220,169,922
*Map-Reduce Framework*Combine output records4,046,128Map input
records90,103Spilled
Records4,046,128Map output bytes662,354,413Combine input records4,098,609Map
output records2,066,588
Any bright ideas
-- 
Steven M. Lewis PhD
4221 105th Ave NE
Kirkland, WA 98033
206-384-1340 (cell)
Skype lordjoe_com

Re: I am trying to run a large job and it is consistently failing with timeout - nothing happens for 600 sec

Posted by Steve Lewis <lo...@gmail.com>.

In my hands the problem occurs in all map jobs - an associate with a
different cluster - mine has 8 nodes - his 40 reports 80% of map tasks fail
with a few succeeding -
I suspect some kind of an I/O waiot but fail to see how it gets to 600sec

On Wed, Jan 18, 2012 at 4:50 PM, Raj V <ra...@yahoo.com> wrote:

> Steve
>
> Does the timeout happen for all the map jobs? Are you using some kind of
> shared storage for map outputs? Any problems with the physical disks? If
> the shuffle phase has started could the disks be I/O waiting between the
> read and write?
>
> Raj
>
>
>
> >________________________________
> > From: Steve Lewis <lo...@gmail.com>
> >To: common-user@hadoop.apache.org
> >Sent: Wednesday, January 18, 2012 4:21 PM
> >Subject: Re: I am trying to run a large job and it is consistently
> failing with timeout - nothing happens for 600 sec
> >
> >1) I do a lot of progress reporting
> >2) Why would the job succeed when the only change in the code is
> >      if(NumberWrites++ % 100 == 0)
> >              context.write(key,value);
> >comment out the test  allowing full writes and the job fails
> >Since every write is a report I assume that something in the write code or
> >other hadoop code for dealing with output if failing. I do increment a
> >counter for every write or in the case of the above code potential write
> >What I am seeing is that where ever the timeout occurs it is not in a
> place
> >where I am capable of inserting more reporting
> >
> >
> >
> >On Wed, Jan 18, 2012 at 4:01 PM, Leonardo Urbina <lu...@mit.edu> wrote:
> >
> >> Perhaps you are not reporting progress throughout your task. If you
> >> happen to run a job large enough job you hit the the default timeout
> >> mapred.task.timeout  (that defaults to 10 min). Perhaps you should
> >> consider reporting progress in your mapper/reducer by calling
> >> progress() on the Reporter object. Check tip 7 of this link:
> >>
> >> http://www.cloudera.com/blog/2009/05/10-mapreduce-tips/
> >>
> >> Hope that helps,
> >> -Leo
> >>
> >> Sent from my phone
> >>
> >> On Jan 18, 2012, at 6:46 PM, Steve Lewis <lo...@gmail.com> wrote:
> >>
> >> > I KNOW is is a task timeout - what I do NOT know is WHY merely cutting
> >> the
> >> > number of writes causes it to go away. It seems to imply that some
> >> > context.write operation or something downstream from that is taking a
> >> huge
> >> > amount of time and that is all hadoop internal code - not mine so my
> >> > question is why should increasing the number and volume of wriotes
> cause
> >> a
> >> > task to time out
> >> >
> >> > On Wed, Jan 18, 2012 at 2:33 PM, Tom Melendez <to...@supertom.com>
> wrote:
> >> >
> >> >> Sounds like mapred.task.timeout?  The default is 10 minutes.
> >> >>
> >> >> http://hadoop.apache.org/common/docs/current/mapred-default.html
> >> >>
> >> >> Thanks,
> >> >>
> >> >> Tom
> >> >>
> >> >> On Wed, Jan 18, 2012 at 2:05 PM, Steve Lewis <lo...@gmail.com>
> >> >> wrote:
> >> >>> The map tasks fail timing out after 600 sec.
> >> >>> I am processing one 9 GB file with 16,000,000 records. Each record
> >> (think
> >> >>> is it as a line)  generates hundreds of key value pairs.
> >> >>> The job is unusual in that the output of the mapper in terms of
> records
> >> >> or
> >> >>> bytes orders of magnitude larger than the input.
> >> >>> I have no idea what is slowing down the job except that the problem
> is
> >> in
> >> >>> the writes.
> >> >>>
> >> >>> If I change the job to merely bypass a fraction of the context.write
> >> >>> statements the job succeeds.
> >> >>> This is one map task that failed and one that succeeded - I cannot
> >> >>> understand how a write can take so long
> >> >>> or what else the mapper might be doing
> >> >>>
> >> >>> JOB FAILED WITH TIMEOUT
> >> >>>
> >> >>> *Parser*TotalProteins90,103NumberFragments10,933,089
> >> >>>
> >> >>
> >>
> *FileSystemCounters*HDFS_BYTES_READ67,245,605FILE_BYTES_WRITTEN444,054,807
> >> >>> *Map-Reduce Framework*Combine output records10,033,499Map input
> records
> >> >>> 90,103Spilled Records10,032,836Map output bytes3,520,182,794Combine
> >> input
> >> >>> records10,844,881Map output records10,933,089
> >> >>> Same code but fewer writes
> >> >>> JOB SUCCEEDED
> >> >>>
> >> >>> *Parser*TotalProteins90,103NumberFragments206,658,758
> >> >>>
> *FileSystemCounters*FILE_BYTES_READ111,578,253HDFS_BYTES_READ67,245,607
> >> >>> FILE_BYTES_WRITTEN220,169,922
> >> >>> *Map-Reduce Framework*Combine output records4,046,128Map input
> >> >>> records90,103Spilled
> >> >>> Records4,046,128Map output bytes662,354,413Combine input
> >> >> records4,098,609Map
> >> >>> output records2,066,588
> >> >>> Any bright ideas
> >> >>> --
> >> >>> Steven M. Lewis PhD
> >> >>> 4221 105th Ave NE
> >> >>> Kirkland, WA 98033
> >> >>> 206-384-1340 (cell)
> >> >>> Skype lordjoe_com
> >> >>
> >> >
> >> >
> >> >
> >> > --
> >> > Steven M. Lewis PhD
> >> > 4221 105th Ave NE
> >> > Kirkland, WA 98033
> >> > 206-384-1340 (cell)
> >> > Skype lordjoe_com
> >>
> >
> >
> >
> >--
> >Steven M. Lewis PhD
> >4221 105th Ave NE
> >Kirkland, WA 98033
> >206-384-1340 (cell)
> >Skype lordjoe_com
> >
> >
> >
>



-- 
Steven M. Lewis PhD
4221 105th Ave NE
Kirkland, WA 98033
206-384-1340 (cell)
Skype lordjoe_com

Re: I am trying to run a large job and it is consistently failing with timeout - nothing happens for 600 sec

Posted by Raj V <ra...@yahoo.com>.

Steve

Does the timeout happen for all the map jobs? Are you using some kind of shared storage for map outputs? Any problems with the physical disks? If the shuffle phase has started could the disks be I/O waiting between the read and write?

Raj



>________________________________
> From: Steve Lewis <lo...@gmail.com>
>To: common-user@hadoop.apache.org 
>Sent: Wednesday, January 18, 2012 4:21 PM
>Subject: Re: I am trying to run a large job and it is consistently failing with timeout - nothing happens for 600 sec
> 
>1) I do a lot of progress reporting
>2) Why would the job succeed when the only change in the code is
>      if(NumberWrites++ % 100 == 0)
>              context.write(key,value);
>comment out the test  allowing full writes and the job fails
>Since every write is a report I assume that something in the write code or
>other hadoop code for dealing with output if failing. I do increment a
>counter for every write or in the case of the above code potential write
>What I am seeing is that where ever the timeout occurs it is not in a place
>where I am capable of inserting more reporting
>
>
>
>On Wed, Jan 18, 2012 at 4:01 PM, Leonardo Urbina <lu...@mit.edu> wrote:
>
>> Perhaps you are not reporting progress throughout your task. If you
>> happen to run a job large enough job you hit the the default timeout
>> mapred.task.timeout  (that defaults to 10 min). Perhaps you should
>> consider reporting progress in your mapper/reducer by calling
>> progress() on the Reporter object. Check tip 7 of this link:
>>
>> http://www.cloudera.com/blog/2009/05/10-mapreduce-tips/
>>
>> Hope that helps,
>> -Leo
>>
>> Sent from my phone
>>
>> On Jan 18, 2012, at 6:46 PM, Steve Lewis <lo...@gmail.com> wrote:
>>
>> > I KNOW is is a task timeout - what I do NOT know is WHY merely cutting
>> the
>> > number of writes causes it to go away. It seems to imply that some
>> > context.write operation or something downstream from that is taking a
>> huge
>> > amount of time and that is all hadoop internal code - not mine so my
>> > question is why should increasing the number and volume of wriotes cause
>> a
>> > task to time out
>> >
>> > On Wed, Jan 18, 2012 at 2:33 PM, Tom Melendez <to...@supertom.com> wrote:
>> >
>> >> Sounds like mapred.task.timeout?  The default is 10 minutes.
>> >>
>> >> http://hadoop.apache.org/common/docs/current/mapred-default.html
>> >>
>> >> Thanks,
>> >>
>> >> Tom
>> >>
>> >> On Wed, Jan 18, 2012 at 2:05 PM, Steve Lewis <lo...@gmail.com>
>> >> wrote:
>> >>> The map tasks fail timing out after 600 sec.
>> >>> I am processing one 9 GB file with 16,000,000 records. Each record
>> (think
>> >>> is it as a line)  generates hundreds of key value pairs.
>> >>> The job is unusual in that the output of the mapper in terms of records
>> >> or
>> >>> bytes orders of magnitude larger than the input.
>> >>> I have no idea what is slowing down the job except that the problem is
>> in
>> >>> the writes.
>> >>>
>> >>> If I change the job to merely bypass a fraction of the context.write
>> >>> statements the job succeeds.
>> >>> This is one map task that failed and one that succeeded - I cannot
>> >>> understand how a write can take so long
>> >>> or what else the mapper might be doing
>> >>>
>> >>> JOB FAILED WITH TIMEOUT
>> >>>
>> >>> *Parser*TotalProteins90,103NumberFragments10,933,089
>> >>>
>> >>
>> *FileSystemCounters*HDFS_BYTES_READ67,245,605FILE_BYTES_WRITTEN444,054,807
>> >>> *Map-Reduce Framework*Combine output records10,033,499Map input records
>> >>> 90,103Spilled Records10,032,836Map output bytes3,520,182,794Combine
>> input
>> >>> records10,844,881Map output records10,933,089
>> >>> Same code but fewer writes
>> >>> JOB SUCCEEDED
>> >>>
>> >>> *Parser*TotalProteins90,103NumberFragments206,658,758
>> >>> *FileSystemCounters*FILE_BYTES_READ111,578,253HDFS_BYTES_READ67,245,607
>> >>> FILE_BYTES_WRITTEN220,169,922
>> >>> *Map-Reduce Framework*Combine output records4,046,128Map input
>> >>> records90,103Spilled
>> >>> Records4,046,128Map output bytes662,354,413Combine input
>> >> records4,098,609Map
>> >>> output records2,066,588
>> >>> Any bright ideas
>> >>> --
>> >>> Steven M. Lewis PhD
>> >>> 4221 105th Ave NE
>> >>> Kirkland, WA 98033
>> >>> 206-384-1340 (cell)
>> >>> Skype lordjoe_com
>> >>
>> >
>> >
>> >
>> > --
>> > Steven M. Lewis PhD
>> > 4221 105th Ave NE
>> > Kirkland, WA 98033
>> > 206-384-1340 (cell)
>> > Skype lordjoe_com
>>
>
>
>
>-- 
>Steven M. Lewis PhD
>4221 105th Ave NE
>Kirkland, WA 98033
>206-384-1340 (cell)
>Skype lordjoe_com
>
>
>

Re: I am trying to run a large job and it is consistently failing with timeout - nothing happens for 600 sec

Posted by Michael Segel <mi...@hotmail.com>.

But Steve, it is your code... :-)

Here is a simple test...

Set your code up where the run fails...

Add a simple timer to see how long you spend in the Mapper.map() method.

only print out the time if its greater than lets say 500 seconds...

The other thing is to update a dynamic counter in Mapper.map().
This would force a status update to be sent to the JT.

Also you dont give a lot of detail...
Are you writing out to an HBase table???

HTH

-Mike

On Jan 18, 2012, at 6:21 PM, Steve Lewis wrote:

> 1) I do a lot of progress reporting
> 2) Why would the job succeed when the only change in the code is
>      if(NumberWrites++ % 100 == 0)
>              context.write(key,value);
> comment out the test  allowing full writes and the job fails
> Since every write is a report I assume that something in the write code or
> other hadoop code for dealing with output if failing. I do increment a
> counter for every write or in the case of the above code potential write
> What I am seeing is that where ever the timeout occurs it is not in a place
> where I am capable of inserting more reporting
> 
> 
> 
> On Wed, Jan 18, 2012 at 4:01 PM, Leonardo Urbina <lu...@mit.edu> wrote:
> 
>> Perhaps you are not reporting progress throughout your task. If you
>> happen to run a job large enough job you hit the the default timeout
>> mapred.task.timeout  (that defaults to 10 min). Perhaps you should
>> consider reporting progress in your mapper/reducer by calling
>> progress() on the Reporter object. Check tip 7 of this link:
>> 
>> http://www.cloudera.com/blog/2009/05/10-mapreduce-tips/
>> 
>> Hope that helps,
>> -Leo
>> 
>> Sent from my phone
>> 
>> On Jan 18, 2012, at 6:46 PM, Steve Lewis <lo...@gmail.com> wrote:
>> 
>>> I KNOW is is a task timeout - what I do NOT know is WHY merely cutting
>> the
>>> number of writes causes it to go away. It seems to imply that some
>>> context.write operation or something downstream from that is taking a
>> huge
>>> amount of time and that is all hadoop internal code - not mine so my
>>> question is why should increasing the number and volume of wriotes cause
>> a
>>> task to time out
>>> 
>>> On Wed, Jan 18, 2012 at 2:33 PM, Tom Melendez <to...@supertom.com> wrote:
>>> 
>>>> Sounds like mapred.task.timeout?  The default is 10 minutes.
>>>> 
>>>> http://hadoop.apache.org/common/docs/current/mapred-default.html
>>>> 
>>>> Thanks,
>>>> 
>>>> Tom
>>>> 
>>>> On Wed, Jan 18, 2012 at 2:05 PM, Steve Lewis <lo...@gmail.com>
>>>> wrote:
>>>>> The map tasks fail timing out after 600 sec.
>>>>> I am processing one 9 GB file with 16,000,000 records. Each record
>> (think
>>>>> is it as a line)  generates hundreds of key value pairs.
>>>>> The job is unusual in that the output of the mapper in terms of records
>>>> or
>>>>> bytes orders of magnitude larger than the input.
>>>>> I have no idea what is slowing down the job except that the problem is
>> in
>>>>> the writes.
>>>>> 
>>>>> If I change the job to merely bypass a fraction of the context.write
>>>>> statements the job succeeds.
>>>>> This is one map task that failed and one that succeeded - I cannot
>>>>> understand how a write can take so long
>>>>> or what else the mapper might be doing
>>>>> 
>>>>> JOB FAILED WITH TIMEOUT
>>>>> 
>>>>> *Parser*TotalProteins90,103NumberFragments10,933,089
>>>>> 
>>>> 
>> *FileSystemCounters*HDFS_BYTES_READ67,245,605FILE_BYTES_WRITTEN444,054,807
>>>>> *Map-Reduce Framework*Combine output records10,033,499Map input records
>>>>> 90,103Spilled Records10,032,836Map output bytes3,520,182,794Combine
>> input
>>>>> records10,844,881Map output records10,933,089
>>>>> Same code but fewer writes
>>>>> JOB SUCCEEDED
>>>>> 
>>>>> *Parser*TotalProteins90,103NumberFragments206,658,758
>>>>> *FileSystemCounters*FILE_BYTES_READ111,578,253HDFS_BYTES_READ67,245,607
>>>>> FILE_BYTES_WRITTEN220,169,922
>>>>> *Map-Reduce Framework*Combine output records4,046,128Map input
>>>>> records90,103Spilled
>>>>> Records4,046,128Map output bytes662,354,413Combine input
>>>> records4,098,609Map
>>>>> output records2,066,588
>>>>> Any bright ideas
>>>>> --
>>>>> Steven M. Lewis PhD
>>>>> 4221 105th Ave NE
>>>>> Kirkland, WA 98033
>>>>> 206-384-1340 (cell)
>>>>> Skype lordjoe_com
>>>> 
>>> 
>>> 
>>> 
>>> --
>>> Steven M. Lewis PhD
>>> 4221 105th Ave NE
>>> Kirkland, WA 98033
>>> 206-384-1340 (cell)
>>> Skype lordjoe_com
>> 
> 
> 
> 
> -- 
> Steven M. Lewis PhD
> 4221 105th Ave NE
> Kirkland, WA 98033
> 206-384-1340 (cell)
> Skype lordjoe_com

Re: I am trying to run a large job and it is consistently failing with timeout - nothing happens for 600 sec

Posted by Steve Lewis <lo...@gmail.com>.

1) I do a lot of progress reporting
2) Why would the job succeed when the only change in the code is
      if(NumberWrites++ % 100 == 0)
              context.write(key,value);
comment out the test  allowing full writes and the job fails
Since every write is a report I assume that something in the write code or
other hadoop code for dealing with output if failing. I do increment a
counter for every write or in the case of the above code potential write
What I am seeing is that where ever the timeout occurs it is not in a place
where I am capable of inserting more reporting



On Wed, Jan 18, 2012 at 4:01 PM, Leonardo Urbina <lu...@mit.edu> wrote:

> Perhaps you are not reporting progress throughout your task. If you
> happen to run a job large enough job you hit the the default timeout
> mapred.task.timeout  (that defaults to 10 min). Perhaps you should
> consider reporting progress in your mapper/reducer by calling
> progress() on the Reporter object. Check tip 7 of this link:
>
> http://www.cloudera.com/blog/2009/05/10-mapreduce-tips/
>
> Hope that helps,
> -Leo
>
> Sent from my phone
>
> On Jan 18, 2012, at 6:46 PM, Steve Lewis <lo...@gmail.com> wrote:
>
> > I KNOW is is a task timeout - what I do NOT know is WHY merely cutting
> the
> > number of writes causes it to go away. It seems to imply that some
> > context.write operation or something downstream from that is taking a
> huge
> > amount of time and that is all hadoop internal code - not mine so my
> > question is why should increasing the number and volume of wriotes cause
> a
> > task to time out
> >
> > On Wed, Jan 18, 2012 at 2:33 PM, Tom Melendez <to...@supertom.com> wrote:
> >
> >> Sounds like mapred.task.timeout?  The default is 10 minutes.
> >>
> >> http://hadoop.apache.org/common/docs/current/mapred-default.html
> >>
> >> Thanks,
> >>
> >> Tom
> >>
> >> On Wed, Jan 18, 2012 at 2:05 PM, Steve Lewis <lo...@gmail.com>
> >> wrote:
> >>> The map tasks fail timing out after 600 sec.
> >>> I am processing one 9 GB file with 16,000,000 records. Each record
> (think
> >>> is it as a line)  generates hundreds of key value pairs.
> >>> The job is unusual in that the output of the mapper in terms of records
> >> or
> >>> bytes orders of magnitude larger than the input.
> >>> I have no idea what is slowing down the job except that the problem is
> in
> >>> the writes.
> >>>
> >>> If I change the job to merely bypass a fraction of the context.write
> >>> statements the job succeeds.
> >>> This is one map task that failed and one that succeeded - I cannot
> >>> understand how a write can take so long
> >>> or what else the mapper might be doing
> >>>
> >>> JOB FAILED WITH TIMEOUT
> >>>
> >>> *Parser*TotalProteins90,103NumberFragments10,933,089
> >>>
> >>
> *FileSystemCounters*HDFS_BYTES_READ67,245,605FILE_BYTES_WRITTEN444,054,807
> >>> *Map-Reduce Framework*Combine output records10,033,499Map input records
> >>> 90,103Spilled Records10,032,836Map output bytes3,520,182,794Combine
> input
> >>> records10,844,881Map output records10,933,089
> >>> Same code but fewer writes
> >>> JOB SUCCEEDED
> >>>
> >>> *Parser*TotalProteins90,103NumberFragments206,658,758
> >>> *FileSystemCounters*FILE_BYTES_READ111,578,253HDFS_BYTES_READ67,245,607
> >>> FILE_BYTES_WRITTEN220,169,922
> >>> *Map-Reduce Framework*Combine output records4,046,128Map input
> >>> records90,103Spilled
> >>> Records4,046,128Map output bytes662,354,413Combine input
> >> records4,098,609Map
> >>> output records2,066,588
> >>> Any bright ideas
> >>> --
> >>> Steven M. Lewis PhD
> >>> 4221 105th Ave NE
> >>> Kirkland, WA 98033
> >>> 206-384-1340 (cell)
> >>> Skype lordjoe_com
> >>
> >
> >
> >
> > --
> > Steven M. Lewis PhD
> > 4221 105th Ave NE
> > Kirkland, WA 98033
> > 206-384-1340 (cell)
> > Skype lordjoe_com
>



-- 
Steven M. Lewis PhD
4221 105th Ave NE
Kirkland, WA 98033
206-384-1340 (cell)
Skype lordjoe_com

Re: I am trying to run a large job and it is consistently failing with timeout - nothing happens for 600 sec

Posted by Leonardo Urbina <lu...@mit.edu>.

Perhaps you are not reporting progress throughout your task. If you
happen to run a job large enough job you hit the the default timeout
mapred.task.timeout  (that defaults to 10 min). Perhaps you should
consider reporting progress in your mapper/reducer by calling
progress() on the Reporter object. Check tip 7 of this link:

http://www.cloudera.com/blog/2009/05/10-mapreduce-tips/

Hope that helps,
-Leo

Sent from my phone

On Jan 18, 2012, at 6:46 PM, Steve Lewis <lo...@gmail.com> wrote:

> I KNOW is is a task timeout - what I do NOT know is WHY merely cutting the
> number of writes causes it to go away. It seems to imply that some
> context.write operation or something downstream from that is taking a huge
> amount of time and that is all hadoop internal code - not mine so my
> question is why should increasing the number and volume of wriotes cause a
> task to time out
>
> On Wed, Jan 18, 2012 at 2:33 PM, Tom Melendez <to...@supertom.com> wrote:
>
>> Sounds like mapred.task.timeout?  The default is 10 minutes.
>>
>> http://hadoop.apache.org/common/docs/current/mapred-default.html
>>
>> Thanks,
>>
>> Tom
>>
>> On Wed, Jan 18, 2012 at 2:05 PM, Steve Lewis <lo...@gmail.com>
>> wrote:
>>> The map tasks fail timing out after 600 sec.
>>> I am processing one 9 GB file with 16,000,000 records. Each record (think
>>> is it as a line)  generates hundreds of key value pairs.
>>> The job is unusual in that the output of the mapper in terms of records
>> or
>>> bytes orders of magnitude larger than the input.
>>> I have no idea what is slowing down the job except that the problem is in
>>> the writes.
>>>
>>> If I change the job to merely bypass a fraction of the context.write
>>> statements the job succeeds.
>>> This is one map task that failed and one that succeeded - I cannot
>>> understand how a write can take so long
>>> or what else the mapper might be doing
>>>
>>> JOB FAILED WITH TIMEOUT
>>>
>>> *Parser*TotalProteins90,103NumberFragments10,933,089
>>>
>> *FileSystemCounters*HDFS_BYTES_READ67,245,605FILE_BYTES_WRITTEN444,054,807
>>> *Map-Reduce Framework*Combine output records10,033,499Map input records
>>> 90,103Spilled Records10,032,836Map output bytes3,520,182,794Combine input
>>> records10,844,881Map output records10,933,089
>>> Same code but fewer writes
>>> JOB SUCCEEDED
>>>
>>> *Parser*TotalProteins90,103NumberFragments206,658,758
>>> *FileSystemCounters*FILE_BYTES_READ111,578,253HDFS_BYTES_READ67,245,607
>>> FILE_BYTES_WRITTEN220,169,922
>>> *Map-Reduce Framework*Combine output records4,046,128Map input
>>> records90,103Spilled
>>> Records4,046,128Map output bytes662,354,413Combine input
>> records4,098,609Map
>>> output records2,066,588
>>> Any bright ideas
>>> --
>>> Steven M. Lewis PhD
>>> 4221 105th Ave NE
>>> Kirkland, WA 98033
>>> 206-384-1340 (cell)
>>> Skype lordjoe_com
>>
>
>
>
> --
> Steven M. Lewis PhD
> 4221 105th Ave NE
> Kirkland, WA 98033
> 206-384-1340 (cell)
> Skype lordjoe_com

Re: I am trying to run a large job and it is consistently failing with timeout - nothing happens for 600 sec

Posted by Steve Lewis <lo...@gmail.com>.

It always fails with a task timeout and that error gives me very little
indication of where the error occurs. The one piece of data I have is that
if I only call context.write 1 in 100 times it does not time out suggesting
that it is not MY code that is timing out.
I could try to time the write statements and see if they get slow although
those might to something slow in another thread?? Or it might be in the
internal hadoop data handling code.

On Wed, Jan 18, 2012 at 3:51 PM, Alex Kozlov <al...@cloudera.com> wrote:

> Does it always fail at the same place?  Does the task log shows something
> unusual?
>
> On Wed, Jan 18, 2012 at 3:46 PM, Steve Lewis <lo...@gmail.com>
> wrote:
>
> > I KNOW is is a task timeout - what I do NOT know is WHY merely cutting
> the
> > number of writes causes it to go away. It seems to imply that some
> > context.write operation or something downstream from that is taking a
> huge
> > amount of time and that is all hadoop internal code - not mine so my
> > question is why should increasing the number and volume of wriotes cause
> a
> > task to time out
> >
> > On Wed, Jan 18, 2012 at 2:33 PM, Tom Melendez <to...@supertom.com> wrote:
> >
> > > Sounds like mapred.task.timeout?  The default is 10 minutes.
> > >
> > > http://hadoop.apache.org/common/docs/current/mapred-default.html
> > >
> > > Thanks,
> > >
> > > Tom
> > >
> > > On Wed, Jan 18, 2012 at 2:05 PM, Steve Lewis <lo...@gmail.com>
> > > wrote:
> > > > The map tasks fail timing out after 600 sec.
> > > > I am processing one 9 GB file with 16,000,000 records. Each record
> > (think
> > > > is it as a line)  generates hundreds of key value pairs.
> > > > The job is unusual in that the output of the mapper in terms of
> records
> > > or
> > > > bytes orders of magnitude larger than the input.
> > > > I have no idea what is slowing down the job except that the problem
> is
> > in
> > > > the writes.
> > > >
> > > > If I change the job to merely bypass a fraction of the context.write
> > > > statements the job succeeds.
> > > > This is one map task that failed and one that succeeded - I cannot
> > > > understand how a write can take so long
> > > > or what else the mapper might be doing
> > > >
> > > > JOB FAILED WITH TIMEOUT
> > > >
> > > > *Parser*TotalProteins90,103NumberFragments10,933,089
> > > >
> > >
> >
> *FileSystemCounters*HDFS_BYTES_READ67,245,605FILE_BYTES_WRITTEN444,054,807
> > > > *Map-Reduce Framework*Combine output records10,033,499Map input
> records
> > > > 90,103Spilled Records10,032,836Map output bytes3,520,182,794Combine
> > input
> > > > records10,844,881Map output records10,933,089
> > > > Same code but fewer writes
> > > > JOB SUCCEEDED
> > > >
> > > > *Parser*TotalProteins90,103NumberFragments206,658,758
> > > >
> *FileSystemCounters*FILE_BYTES_READ111,578,253HDFS_BYTES_READ67,245,607
> > > > FILE_BYTES_WRITTEN220,169,922
> > > > *Map-Reduce Framework*Combine output records4,046,128Map input
> > > > records90,103Spilled
> > > > Records4,046,128Map output bytes662,354,413Combine input
> > > records4,098,609Map
> > > > output records2,066,588
> > > > Any bright ideas
> > > > --
> > > > Steven M. Lewis PhD
> > > > 4221 105th Ave NE
> > > > Kirkland, WA 98033
> > > > 206-384-1340 (cell)
> > > > Skype lordjoe_com
> > >
> >
> >
> >
> > --
> > Steven M. Lewis PhD
> > 4221 105th Ave NE
> > Kirkland, WA 98033
> > 206-384-1340 (cell)
> > Skype lordjoe_com
> >
>



-- 
Steven M. Lewis PhD
4221 105th Ave NE
Kirkland, WA 98033
206-384-1340 (cell)
Skype lordjoe_com

Re: I am trying to run a large job and it is consistently failing with timeout - nothing happens for 600 sec

Posted by Alex Kozlov <al...@cloudera.com>.

Does it always fail at the same place?  Does the task log shows something
unusual?

On Wed, Jan 18, 2012 at 3:46 PM, Steve Lewis <lo...@gmail.com> wrote:

> I KNOW is is a task timeout - what I do NOT know is WHY merely cutting the
> number of writes causes it to go away. It seems to imply that some
> context.write operation or something downstream from that is taking a huge
> amount of time and that is all hadoop internal code - not mine so my
> question is why should increasing the number and volume of wriotes cause a
> task to time out
>
> On Wed, Jan 18, 2012 at 2:33 PM, Tom Melendez <to...@supertom.com> wrote:
>
> > Sounds like mapred.task.timeout?  The default is 10 minutes.
> >
> > http://hadoop.apache.org/common/docs/current/mapred-default.html
> >
> > Thanks,
> >
> > Tom
> >
> > On Wed, Jan 18, 2012 at 2:05 PM, Steve Lewis <lo...@gmail.com>
> > wrote:
> > > The map tasks fail timing out after 600 sec.
> > > I am processing one 9 GB file with 16,000,000 records. Each record
> (think
> > > is it as a line)  generates hundreds of key value pairs.
> > > The job is unusual in that the output of the mapper in terms of records
> > or
> > > bytes orders of magnitude larger than the input.
> > > I have no idea what is slowing down the job except that the problem is
> in
> > > the writes.
> > >
> > > If I change the job to merely bypass a fraction of the context.write
> > > statements the job succeeds.
> > > This is one map task that failed and one that succeeded - I cannot
> > > understand how a write can take so long
> > > or what else the mapper might be doing
> > >
> > > JOB FAILED WITH TIMEOUT
> > >
> > > *Parser*TotalProteins90,103NumberFragments10,933,089
> > >
> >
> *FileSystemCounters*HDFS_BYTES_READ67,245,605FILE_BYTES_WRITTEN444,054,807
> > > *Map-Reduce Framework*Combine output records10,033,499Map input records
> > > 90,103Spilled Records10,032,836Map output bytes3,520,182,794Combine
> input
> > > records10,844,881Map output records10,933,089
> > > Same code but fewer writes
> > > JOB SUCCEEDED
> > >
> > > *Parser*TotalProteins90,103NumberFragments206,658,758
> > > *FileSystemCounters*FILE_BYTES_READ111,578,253HDFS_BYTES_READ67,245,607
> > > FILE_BYTES_WRITTEN220,169,922
> > > *Map-Reduce Framework*Combine output records4,046,128Map input
> > > records90,103Spilled
> > > Records4,046,128Map output bytes662,354,413Combine input
> > records4,098,609Map
> > > output records2,066,588
> > > Any bright ideas
> > > --
> > > Steven M. Lewis PhD
> > > 4221 105th Ave NE
> > > Kirkland, WA 98033
> > > 206-384-1340 (cell)
> > > Skype lordjoe_com
> >
>
>
>
> --
> Steven M. Lewis PhD
> 4221 105th Ave NE
> Kirkland, WA 98033
> 206-384-1340 (cell)
> Skype lordjoe_com
>

Re: I am trying to run a large job and it is consistently failing with timeout - nothing happens for 600 sec

Posted by Steve Lewis <lo...@gmail.com>.

I KNOW is is a task timeout - what I do NOT know is WHY merely cutting the
number of writes causes it to go away. It seems to imply that some
context.write operation or something downstream from that is taking a huge
amount of time and that is all hadoop internal code - not mine so my
question is why should increasing the number and volume of wriotes cause a
task to time out

On Wed, Jan 18, 2012 at 2:33 PM, Tom Melendez <to...@supertom.com> wrote:

> Sounds like mapred.task.timeout?  The default is 10 minutes.
>
> http://hadoop.apache.org/common/docs/current/mapred-default.html
>
> Thanks,
>
> Tom
>
> On Wed, Jan 18, 2012 at 2:05 PM, Steve Lewis <lo...@gmail.com>
> wrote:
> > The map tasks fail timing out after 600 sec.
> > I am processing one 9 GB file with 16,000,000 records. Each record (think
> > is it as a line)  generates hundreds of key value pairs.
> > The job is unusual in that the output of the mapper in terms of records
> or
> > bytes orders of magnitude larger than the input.
> > I have no idea what is slowing down the job except that the problem is in
> > the writes.
> >
> > If I change the job to merely bypass a fraction of the context.write
> > statements the job succeeds.
> > This is one map task that failed and one that succeeded - I cannot
> > understand how a write can take so long
> > or what else the mapper might be doing
> >
> > JOB FAILED WITH TIMEOUT
> >
> > *Parser*TotalProteins90,103NumberFragments10,933,089
> >
> *FileSystemCounters*HDFS_BYTES_READ67,245,605FILE_BYTES_WRITTEN444,054,807
> > *Map-Reduce Framework*Combine output records10,033,499Map input records
> > 90,103Spilled Records10,032,836Map output bytes3,520,182,794Combine input
> > records10,844,881Map output records10,933,089
> > Same code but fewer writes
> > JOB SUCCEEDED
> >
> > *Parser*TotalProteins90,103NumberFragments206,658,758
> > *FileSystemCounters*FILE_BYTES_READ111,578,253HDFS_BYTES_READ67,245,607
> > FILE_BYTES_WRITTEN220,169,922
> > *Map-Reduce Framework*Combine output records4,046,128Map input
> > records90,103Spilled
> > Records4,046,128Map output bytes662,354,413Combine input
> records4,098,609Map
> > output records2,066,588
> > Any bright ideas
> > --
> > Steven M. Lewis PhD
> > 4221 105th Ave NE
> > Kirkland, WA 98033
> > 206-384-1340 (cell)
> > Skype lordjoe_com
>



-- 
Steven M. Lewis PhD
4221 105th Ave NE
Kirkland, WA 98033
206-384-1340 (cell)
Skype lordjoe_com

Re: I am trying to run a large job and it is consistently failing with timeout - nothing happens for 600 sec

Posted by Tom Melendez <to...@supertom.com>.

Sounds like mapred.task.timeout?  The default is 10 minutes.

http://hadoop.apache.org/common/docs/current/mapred-default.html

Thanks,

Tom

On Wed, Jan 18, 2012 at 2:05 PM, Steve Lewis <lo...@gmail.com> wrote:
> The map tasks fail timing out after 600 sec.
> I am processing one 9 GB file with 16,000,000 records. Each record (think
> is it as a line)  generates hundreds of key value pairs.
> The job is unusual in that the output of the mapper in terms of records or
> bytes orders of magnitude larger than the input.
> I have no idea what is slowing down the job except that the problem is in
> the writes.
>
> If I change the job to merely bypass a fraction of the context.write
> statements the job succeeds.
> This is one map task that failed and one that succeeded - I cannot
> understand how a write can take so long
> or what else the mapper might be doing
>
> JOB FAILED WITH TIMEOUT
>
> *Parser*TotalProteins90,103NumberFragments10,933,089
> *FileSystemCounters*HDFS_BYTES_READ67,245,605FILE_BYTES_WRITTEN444,054,807
> *Map-Reduce Framework*Combine output records10,033,499Map input records
> 90,103Spilled Records10,032,836Map output bytes3,520,182,794Combine input
> records10,844,881Map output records10,933,089
> Same code but fewer writes
> JOB SUCCEEDED
>
> *Parser*TotalProteins90,103NumberFragments206,658,758
> *FileSystemCounters*FILE_BYTES_READ111,578,253HDFS_BYTES_READ67,245,607
> FILE_BYTES_WRITTEN220,169,922
> *Map-Reduce Framework*Combine output records4,046,128Map input
> records90,103Spilled
> Records4,046,128Map output bytes662,354,413Combine input records4,098,609Map
> output records2,066,588
> Any bright ideas
> --
> Steven M. Lewis PhD
> 4221 105th Ave NE
> Kirkland, WA 98033
> 206-384-1340 (cell)
> Skype lordjoe_com