You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Miles Osborne <mi...@inf.ed.ac.uk> on 2008/01/22 15:26:21 UTC

Hadoop-2438

Has there been any progress / a work-around for this?

Currently I'm experimenting with Streaming and I've encountered what looks
like the same problem as described here:

https://issues.apache.org/jira/browse/HADOOP-2438

So, I get much the same errors (see below).

For this particular task, when I replace the mappers and reducers with the
identity operation (ie just pass through the data) all is well.  When
instead I try to do something more taxing
(in this case, gathering together all ngrams with the same prefix), I get
these errors.

My guess is that this is something to do with caching / buffering, since I
presume that when the Stream mapper has real work to do, the associated Java
streamer buffers input until the Mapper signals that it can process more
data.  If the Mapper is busy, then a lot of data would get cached, causing
some internal buffer to overflow.

Miles

>

Date: Tue Jan 22 14:12:28 GMT 2008
java.io.IOException: Broken pipe
	at java.io.FileOutputStream.writeBytes(Native Method)
	at java.io.FileOutputStream.write(FileOutputStream.java:260)
	at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
	at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
	at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:124)
	at java.io.DataOutputStream.flush(DataOutputStream.java:106)
	at org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:96)
	at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:192)
	at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1760)


	at org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:107)
	at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:192)
	at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1760)

java.io.IOException: MROutput/MRErrThread
failed:java.lang.OutOfMemoryError: Java heap space
	at java.util.Arrays.copyOf(Arrays.java:2786)
	at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:94)
	at java.io.DataOutputStream.write(DataOutputStream.java:90)
	at org.apache.hadoop.io.Text.write(Text.java:243)
	at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:349)
	at org.apache.hadoop.streaming.PipeMapRed$MROutputThread.run(PipeMapRed.java:344)

	at org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:76)
	at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:192)
	at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1760)

java.io.IOException: MROutput/MRErrThread
failed:java.lang.OutOfMemoryError: Java heap space
	at java.util.Arrays.copyOf(Arrays.java:2786)
	at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:94)
	at java.io.DataOutputStream.write(DataOutputStream.java:90)
	at org.apache.hadoop.io.Text.write(Text.java:243)
	at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:349)
	at org.apache.hadoop.streaming.PipeMapRed$MROutputThread.run(PipeMapRed.java:344)

	at org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:76)
	at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:192)
	at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1760)

Re: Hadoop-2438

Posted by Miles Osborne <mi...@inf.ed.ac.uk>.
The max heap size for each child was the default,  200M.

Thanks for this tip:  right now i'm playing around with machines which we
mothballed years ago (they have 512M a pop!).  once i put more memory into
them i'll see if this works.

(Hadoop is a great way to breath life into otherwise unloved boxes)

Miles

On 22/01/2008, Arun C Murthy <ac...@yahoo-inc.com> wrote:
>
>
> On Jan 22, 2008, at 6:26 AM, Miles Osborne wrote:
>
> > Has there been any progress / a work-around for this?
> >
> > Currently I'm experimenting with Streaming and I've encountered
> > what looks
> > like the same problem as described here:
> >
> > https://issues.apache.org/jira/browse/HADOOP-2438
> >
>
> Uh, I'm not sure how we missed H-2438, but what is ur max heap size
> for the child?
> http://lucene.apache.org/hadoop/docs/r0.15.2/hadoop-
> default.html#mapred.child.java.opts
> Check if it works with -Xmx512m.
>
> I think i need to open a bug to bump the default up to 512M, sigh!
>
> thanks,
> Arun
>
> > So, I get much the same errors (see below).
> >
> > For this particular task, when I replace the mappers and reducers
> > with the
> > identity operation (ie just pass through the data) all is well.  When
> > instead I try to do something more taxing
> > (in this case, gathering together all ngrams with the same prefix),
> > I get
> > these errors.
> >
> > My guess is that this is something to do with caching / buffering,
> > since I
> > presume that when the Stream mapper has real work to do, the
> > associated Java
> > streamer buffers input until the Mapper signals that it can process
> > more
> > data.  If the Mapper is busy, then a lot of data would get cached,
> > causing
> > some internal buffer to overflow.
> >
> > Miles
> >
> >>
> >
> > Date: Tue Jan 22 14:12:28 GMT 2008
> > java.io.IOException: Broken pipe
> >       at java.io.FileOutputStream.writeBytes(Native Method)
> >       at java.io.FileOutputStream.write(FileOutputStream.java:260)
> >       at java.io.BufferedOutputStream.flushBuffer
> > (BufferedOutputStream.java:65)
> >       at java.io.BufferedOutputStream.flush(BufferedOutputStream.java
> :123)
> >       at java.io.BufferedOutputStream.flush(BufferedOutputStream.java
> :124)
> >       at java.io.DataOutputStream.flush(DataOutputStream.java:106)
> >       at org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:96)
> >       at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
> >       at org.apache.hadoop.mapred.MapTask.run(MapTask.java:192)
> >       at org.apache.hadoop.mapred.TaskTracker$Child.main
> > (TaskTracker.java:1760)
> >
> >
> >       at org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:107)
> >       at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
> >       at org.apache.hadoop.mapred.MapTask.run(MapTask.java:192)
> >       at org.apache.hadoop.mapred.TaskTracker$Child.main
> > (TaskTracker.java:1760)
> >
> > java.io.IOException: MROutput/MRErrThread
> > failed:java.lang.OutOfMemoryError: Java heap space
> >       at java.util.Arrays.copyOf(Arrays.java:2786)
> >       at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java
> :94)
> >       at java.io.DataOutputStream.write(DataOutputStream.java:90)
> >       at org.apache.hadoop.io.Text.write(Text.java:243)
> >       at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect
> > (MapTask.java:349)
> >       at org.apache.hadoop.streaming.PipeMapRed$MROutputThread.run
> > (PipeMapRed.java:344)
> >
> >       at org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:76)
> >       at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
> >       at org.apache.hadoop.mapred.MapTask.run(MapTask.java:192)
> >       at org.apache.hadoop.mapred.TaskTracker$Child.main
> > (TaskTracker.java:1760)
> >
> > java.io.IOException: MROutput/MRErrThread
> > failed:java.lang.OutOfMemoryError: Java heap space
> >       at java.util.Arrays.copyOf(Arrays.java:2786)
> >       at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java
> :94)
> >       at java.io.DataOutputStream.write(DataOutputStream.java:90)
> >       at org.apache.hadoop.io.Text.write(Text.java:243)
> >       at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect
> > (MapTask.java:349)
> >       at org.apache.hadoop.streaming.PipeMapRed$MROutputThread.run
> > (PipeMapRed.java:344)
> >
> >       at org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:76)
> >       at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
> >       at org.apache.hadoop.mapred.MapTask.run(MapTask.java:192)
> >       at org.apache.hadoop.mapred.TaskTracker$Child.main
> > (TaskTracker.java:1760)
>
>

Re: Hadoop-2438

Posted by Arun C Murthy <ac...@yahoo-inc.com>.
On Jan 22, 2008, at 6:26 AM, Miles Osborne wrote:

> Has there been any progress / a work-around for this?
>
> Currently I'm experimenting with Streaming and I've encountered  
> what looks
> like the same problem as described here:
>
> https://issues.apache.org/jira/browse/HADOOP-2438
>

Uh, I'm not sure how we missed H-2438, but what is ur max heap size  
for the child?
http://lucene.apache.org/hadoop/docs/r0.15.2/hadoop- 
default.html#mapred.child.java.opts
Check if it works with -Xmx512m.

I think i need to open a bug to bump the default up to 512M, sigh!

thanks,
Arun

> So, I get much the same errors (see below).
>
> For this particular task, when I replace the mappers and reducers  
> with the
> identity operation (ie just pass through the data) all is well.  When
> instead I try to do something more taxing
> (in this case, gathering together all ngrams with the same prefix),  
> I get
> these errors.
>
> My guess is that this is something to do with caching / buffering,  
> since I
> presume that when the Stream mapper has real work to do, the  
> associated Java
> streamer buffers input until the Mapper signals that it can process  
> more
> data.  If the Mapper is busy, then a lot of data would get cached,  
> causing
> some internal buffer to overflow.
>
> Miles
>
>>
>
> Date: Tue Jan 22 14:12:28 GMT 2008
> java.io.IOException: Broken pipe
> 	at java.io.FileOutputStream.writeBytes(Native Method)
> 	at java.io.FileOutputStream.write(FileOutputStream.java:260)
> 	at java.io.BufferedOutputStream.flushBuffer 
> (BufferedOutputStream.java:65)
> 	at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
> 	at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:124)
> 	at java.io.DataOutputStream.flush(DataOutputStream.java:106)
> 	at org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:96)
> 	at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
> 	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:192)
> 	at org.apache.hadoop.mapred.TaskTracker$Child.main 
> (TaskTracker.java:1760)
>
>
> 	at org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:107)
> 	at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
> 	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:192)
> 	at org.apache.hadoop.mapred.TaskTracker$Child.main 
> (TaskTracker.java:1760)
>
> java.io.IOException: MROutput/MRErrThread
> failed:java.lang.OutOfMemoryError: Java heap space
> 	at java.util.Arrays.copyOf(Arrays.java:2786)
> 	at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:94)
> 	at java.io.DataOutputStream.write(DataOutputStream.java:90)
> 	at org.apache.hadoop.io.Text.write(Text.java:243)
> 	at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect 
> (MapTask.java:349)
> 	at org.apache.hadoop.streaming.PipeMapRed$MROutputThread.run 
> (PipeMapRed.java:344)
>
> 	at org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:76)
> 	at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
> 	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:192)
> 	at org.apache.hadoop.mapred.TaskTracker$Child.main 
> (TaskTracker.java:1760)
>
> java.io.IOException: MROutput/MRErrThread
> failed:java.lang.OutOfMemoryError: Java heap space
> 	at java.util.Arrays.copyOf(Arrays.java:2786)
> 	at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:94)
> 	at java.io.DataOutputStream.write(DataOutputStream.java:90)
> 	at org.apache.hadoop.io.Text.write(Text.java:243)
> 	at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect 
> (MapTask.java:349)
> 	at org.apache.hadoop.streaming.PipeMapRed$MROutputThread.run 
> (PipeMapRed.java:344)
>
> 	at org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:76)
> 	at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
> 	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:192)
> 	at org.apache.hadoop.mapred.TaskTracker$Child.main 
> (TaskTracker.java:1760)


Re: Hadoop-2438

Posted by Vadim Zaliva <kr...@gmail.com>.
On Jan 22, 2008, at 15:17, Miles Osborne wrote:

Thanks Miles!

Vadim

> There are machine-learning papers dealing with Map Reduce proper, eg:
>
> *Map-Reduce for Machine Learning on Multicore*. Cheng-Tao Chu, Sang  
> Kyun
> Kim, Yi-An Lin, YuanYuan Yu, Gary Bradski, Andrew Y. Ng and Kunle  
> Olukotun.
> In *NIPS 19*, 2007.
> [ps<http://www.cs.stanford.edu/%7Eang/papers/nips06-mapreducemulticore.ps 
> >,
> pdf <http://www.cs.stanford.edu/%7Eang/papers/nips06-mapreducemulticore.pdf 
> >
> ]
> http://www.cs.stanford.edu/~ang/papers/nips06-mapreducemulticore.pdf
>
> This one is interesting, since it looks at some of the theory when  
> you do
> large-scale machine learning:
>
> *The Tradeoffs of Large Scale Learning*
> Leon Bottou, Olivier Bousquet
> [ps.gz]<http://books.nips.cc/papers/files/nips20/NIPS2007_0726.ps.gz>
> [pdf] <http://books.nips.cc/papers/files/nips20/ 
> NIPS2007_0726.pdf>[bibtex]<http://books.nips.cc/papers/files/nips20/NIPS2007_0726.bib 
> >
>
> http://books.nips.cc/papers/files/nips20/NIPS2007_0726.pdf
>
>
> Happy hunting!
>
> Miles


Re: Hadoop-2438

Posted by Miles Osborne <mi...@inf.ed.ac.uk>.
There are machine-learning papers dealing with Map Reduce proper, eg:

*Map-Reduce for Machine Learning on Multicore*. Cheng-Tao Chu, Sang Kyun
Kim, Yi-An Lin, YuanYuan Yu, Gary Bradski, Andrew Y. Ng and Kunle Olukotun.
In *NIPS 19*, 2007.
[ps<http://www.cs.stanford.edu/%7Eang/papers/nips06-mapreducemulticore.ps>,
pdf <http://www.cs.stanford.edu/%7Eang/papers/nips06-mapreducemulticore.pdf>
]
http://www.cs.stanford.edu/~ang/papers/nips06-mapreducemulticore.pdf

This one is interesting, since it looks at some of the theory when you do
large-scale machine learning:

*The Tradeoffs of Large Scale Learning*
Leon Bottou, Olivier Bousquet
[ps.gz]<http://books.nips.cc/papers/files/nips20/NIPS2007_0726.ps.gz>
[pdf] <http://books.nips.cc/papers/files/nips20/NIPS2007_0726.pdf>[bibtex]<http://books.nips.cc/papers/files/nips20/NIPS2007_0726.bib>

http://books.nips.cc/papers/files/nips20/NIPS2007_0726.pdf


Happy hunting!

Miles

Re: Hadoop-2438

Posted by Vadim Zaliva <kr...@gmail.com>.
On Jan 22, 2008, at 14:44, Ted Dunning wrote:

I am also very interested in machine learning applications of MapReduce.
Collaborative Filtering in particular. If there are some lists/groups/ 
publications
related to this subject I will appreciate any pointers.

Sincerely,
Vadim

>
> I would love to talk more off-line about our efforts in this regard.
>
> I will send you email.
>
>
> On 1/22/08 2:21 PM, "Miles Osborne" <mi...@inf.ed.ac.uk> wrote:
>
>> In my case, I'm using actual mappers and reducers, rather than  
>> shell script
>> commands.  I've also used Map-Reduce at Google when I was on  
>> sabbatical
>> there in 2006.
>>
>> That aside, I do take your point --you need to have a good grip on  
>> what Map
>> Reduce does to understand some of the challenges.  Here at  
>> Edinburgh I'm
>> leading a little push to start doing some of our core research  
>> within this
>> environment.  As a starter, I'm looking at the simple task of  
>> estimating
>> large n-gram based language models using M-R (think 5-grams and  
>> upwards from
>> lots of web data).  We are also about to look at core machine  
>> learning, such
>> as EM etc within this framework.  So, lots of fun and games ... and  
>> for me,
>> it is quite nice doing this kind of thing.  A good break from the  
>> usual
>> research.
>>
>> Miles
>>
>> On 22/01/2008, Ted Dunning <td...@veoh.com> wrote:
>>>
>>>
>>>
>>> Streaming has some real conceptual confusions awaiting the unwary.
>>>
>>> For instance, if you implement line counting, a correct  
>>> implementation is
>>> this:
>>>
>>>    stream -mapper cat -reducer 'uniq -c'
>>>
>>> (stream is an alias I use to avoid typing hadoop -jar ....)
>>>
>>> It is tempting, though very dangerous to do
>>>
>>>    stream -mapper 'sort | uniq -c' -reducer '...add up counts...'
>>>
>>> But this doesn't work right because the mapper isn't to produce  
>>> output
>>> after
>>> the last input line.  (it also tends to not work due to quoting  
>>> issues,
>>> but
>>> we can ignore that issue for the moment).  A similar confusion  
>>> occurs when
>>> the mapper exits, even normally.  Take the following program:
>>>
>>>    stream -mapper 'head -10' -reducer '...whatever...'
>>>
>>> Here the mapper exits after acting like the identity mapper for  
>>> the first
>>> ten input records and then exits.  According to the implicit  
>>> contract, it
>>> should instead stick around and accept all subsequent inputs and not
>>> produce
>>> any output.
>>>
>>> The need for fairly deep understanding of how hadoop and how  
>>> normal shell
>>> processing idioms need to be modified makes streaming a pretty  
>>> tricky
>>> thing
>>> to use, especially for the map-reduce novice.
>>>
>>> I don't think that this problem can be easily corrected since it  
>>> is due to
>>> a
>>> fairly fundamental mismatch between shell programming tradition  
>>> and what a
>>> mapper or reducer is.
>>>
>>>
>>> On 1/22/08 8:48 AM, "Joydeep Sen Sarma" <js...@facebook.com>  
>>> wrote:
>>>
>>>>> My guess is that this is something to do with caching / buffering,
>>> since I
>>>>> presume that when the Stream mapper has real work to do, the  
>>>>> associated
>>> Java
>>>>> streamer buffers input until the Mapper signals that it can  
>>>>> process
>>> more
>>>>> data.  If the Mapper is busy, then a lot of data would get cached,
>>> causing
>>>>> some internal buffer to overflow.
>>>>
>>>> unlikely. the java buffer would be fixed size. it would write to  
>>>> a unix
>>> pipe
>>>> periodically. if the streaming mapper is not consuming data - the  
>>>> java
>>> side
>>>> would quickly become blocked writing to this pipe.
>>>>
>>>> the broken pipe case is extremely common and just tells that the  
>>>> mapper
>>> died.
>>>> best thing to do is find the stderr log for the task (from the
>>> jobtracker ui)
>>>> and find if the mapper left something there before dying.
>>>>
>>>>
>>>> if streaming gurus are reading this - i am curious about one  
>>>> unrelated
>>> thing -
>>>> the java map task does a 'flush()' in the buffered input stream  
>>>> to the
>>>> streaming mapper after every input line. seemed like unnecessary
>>> overhead to
>>>> me. was curious why (must be some rationale).
>>>>
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: milesosb@gmail.com on behalf of Miles Osborne
>>>> Sent: Tue 1/22/2008 6:26 AM
>>>> To: hadoop-user@lucene.apache.org
>>>> Subject: Hadoop-2438
>>>>
>>>> Has there been any progress / a work-around for this?
>>>>
>>>> Currently I'm experimenting with Streaming and I've encountered  
>>>> what
>>> looks
>>>> like the same problem as described here:
>>>>
>>>> https://issues.apache.org/jira/browse/HADOOP-2438
>>>>
>>>> So, I get much the same errors (see below).
>>>>
>>>> For this particular task, when I replace the mappers and reducers  
>>>> with
>>> the
>>>> identity operation (ie just pass through the data) all is well.   
>>>> When
>>>> instead I try to do something more taxing
>>>> (in this case, gathering together all ngrams with the same  
>>>> prefix), I
>>> get
>>>> these errors.
>>>>
>>>> My guess is that this is something to do with caching /  
>>>> buffering, since
>>> I
>>>> presume that when the Stream mapper has real work to do, the  
>>>> associated
>>> Java
>>>> streamer buffers input until the Mapper signals that it can  
>>>> process more
>>>> data.  If the Mapper is busy, then a lot of data would get cached,
>>> causing
>>>> some internal buffer to overflow.
>>>>
>>>> Miles
>>>>
>>>>>
>>>>
>>>> Date: Tue Jan 22 14:12:28 GMT 2008
>>>> java.io.IOException: Broken pipe
>>>> at java.io.FileOutputStream.writeBytes(Native Method)
>>>> at java.io.FileOutputStream.write(FileOutputStream.java:260)
>>>> at  
>>>> java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java
>>> :65)
>>>> at java.io.BufferedOutputStream.flush(BufferedOutputStream.java: 
>>>> 123)
>>>> at java.io.BufferedOutputStream.flush(BufferedOutputStream.java: 
>>>> 124)
>>>> at java.io.DataOutputStream.flush(DataOutputStream.java:106)
>>>> at org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:96)
>>>> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
>>>> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:192)
>>>> at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java
>>> :1760)
>>>>
>>>>
>>>> at org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:107)
>>>> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
>>>> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:192)
>>>> at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java
>>> :1760)
>>>>
>>>> java.io.IOException: MROutput/MRErrThread
>>>> failed:java.lang.OutOfMemoryError: Java heap space
>>>> at java.util.Arrays.copyOf(Arrays.java:2786)
>>>> at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java: 
>>>> 94)
>>>> at java.io.DataOutputStream.write(DataOutputStream.java:90)
>>>> at org.apache.hadoop.io.Text.write(Text.java:243)
>>>> at org.apache.hadoop.mapred.MapTask 
>>>> $MapOutputBuffer.collect(MapTask.java
>>> :349)
>>>> at
>>>> org.apache.hadoop.streaming.PipeMapRed$MROutputThread.run(
>>> PipeMapRed.java:344)
>>>>
>>>> at org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:76)
>>>> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
>>>> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:192)
>>>> at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java
>>> :1760)
>>>>
>>>> java.io.IOException: MROutput/MRErrThread
>>>> failed:java.lang.OutOfMemoryError: Java heap space
>>>> at java.util.Arrays.copyOf(Arrays.java:2786)
>>>> at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java: 
>>>> 94)
>>>> at java.io.DataOutputStream.write(DataOutputStream.java:90)
>>>> at org.apache.hadoop.io.Text.write(Text.java:243)
>>>> at org.apache.hadoop.mapred.MapTask 
>>>> $MapOutputBuffer.collect(MapTask.java
>>> :349)
>>>> at
>>>> org.apache.hadoop.streaming.PipeMapRed$MROutputThread.run(
>>> PipeMapRed.java:344)
>>>>
>>>> at org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:76)
>>>> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
>>>> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:192)
>>>> at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java
>>> :1760)
>>>>
>>>
>>>
>


Re: Hadoop-2438

Posted by Ted Dunning <td...@veoh.com>.
I would love to talk more off-line about our efforts in this regard.

I will send you email.


On 1/22/08 2:21 PM, "Miles Osborne" <mi...@inf.ed.ac.uk> wrote:

> In my case, I'm using actual mappers and reducers, rather than shell script
> commands.  I've also used Map-Reduce at Google when I was on sabbatical
> there in 2006.
> 
> That aside, I do take your point --you need to have a good grip on what Map
> Reduce does to understand some of the challenges.  Here at Edinburgh I'm
> leading a little push to start doing some of our core research within this
> environment.  As a starter, I'm looking at the simple task of estimating
> large n-gram based language models using M-R (think 5-grams and upwards from
> lots of web data).  We are also about to look at core machine learning, such
> as EM etc within this framework.  So, lots of fun and games ... and for me,
> it is quite nice doing this kind of thing.  A good break from the usual
> research.
> 
> Miles
> 
> On 22/01/2008, Ted Dunning <td...@veoh.com> wrote:
>> 
>> 
>> 
>> Streaming has some real conceptual confusions awaiting the unwary.
>> 
>> For instance, if you implement line counting, a correct implementation is
>> this:
>> 
>>     stream -mapper cat -reducer 'uniq -c'
>> 
>> (stream is an alias I use to avoid typing hadoop -jar ....)
>> 
>> It is tempting, though very dangerous to do
>> 
>>     stream -mapper 'sort | uniq -c' -reducer '...add up counts...'
>> 
>> But this doesn't work right because the mapper isn't to produce output
>> after
>> the last input line.  (it also tends to not work due to quoting issues,
>> but
>> we can ignore that issue for the moment).  A similar confusion occurs when
>> the mapper exits, even normally.  Take the following program:
>> 
>>     stream -mapper 'head -10' -reducer '...whatever...'
>> 
>> Here the mapper exits after acting like the identity mapper for the first
>> ten input records and then exits.  According to the implicit contract, it
>> should instead stick around and accept all subsequent inputs and not
>> produce
>> any output.
>> 
>> The need for fairly deep understanding of how hadoop and how normal shell
>> processing idioms need to be modified makes streaming a pretty tricky
>> thing
>> to use, especially for the map-reduce novice.
>> 
>> I don't think that this problem can be easily corrected since it is due to
>> a
>> fairly fundamental mismatch between shell programming tradition and what a
>> mapper or reducer is.
>> 
>> 
>> On 1/22/08 8:48 AM, "Joydeep Sen Sarma" <js...@facebook.com> wrote:
>> 
>>>> My guess is that this is something to do with caching / buffering,
>> since I
>>>> presume that when the Stream mapper has real work to do, the associated
>> Java
>>>> streamer buffers input until the Mapper signals that it can process
>> more
>>>> data.  If the Mapper is busy, then a lot of data would get cached,
>> causing
>>>> some internal buffer to overflow.
>>> 
>>> unlikely. the java buffer would be fixed size. it would write to a unix
>> pipe
>>> periodically. if the streaming mapper is not consuming data - the java
>> side
>>> would quickly become blocked writing to this pipe.
>>> 
>>> the broken pipe case is extremely common and just tells that the mapper
>> died.
>>> best thing to do is find the stderr log for the task (from the
>> jobtracker ui)
>>> and find if the mapper left something there before dying.
>>> 
>>> 
>>> if streaming gurus are reading this - i am curious about one unrelated
>> thing -
>>> the java map task does a 'flush()' in the buffered input stream to the
>>> streaming mapper after every input line. seemed like unnecessary
>> overhead to
>>> me. was curious why (must be some rationale).
>>> 
>>> 
>>> 
>>> -----Original Message-----
>>> From: milesosb@gmail.com on behalf of Miles Osborne
>>> Sent: Tue 1/22/2008 6:26 AM
>>> To: hadoop-user@lucene.apache.org
>>> Subject: Hadoop-2438
>>> 
>>> Has there been any progress / a work-around for this?
>>> 
>>> Currently I'm experimenting with Streaming and I've encountered what
>> looks
>>> like the same problem as described here:
>>> 
>>> https://issues.apache.org/jira/browse/HADOOP-2438
>>> 
>>> So, I get much the same errors (see below).
>>> 
>>> For this particular task, when I replace the mappers and reducers with
>> the
>>> identity operation (ie just pass through the data) all is well.  When
>>> instead I try to do something more taxing
>>> (in this case, gathering together all ngrams with the same prefix), I
>> get
>>> these errors.
>>> 
>>> My guess is that this is something to do with caching / buffering, since
>> I
>>> presume that when the Stream mapper has real work to do, the associated
>> Java
>>> streamer buffers input until the Mapper signals that it can process more
>>> data.  If the Mapper is busy, then a lot of data would get cached,
>> causing
>>> some internal buffer to overflow.
>>> 
>>> Miles
>>> 
>>>> 
>>> 
>>> Date: Tue Jan 22 14:12:28 GMT 2008
>>> java.io.IOException: Broken pipe
>>> at java.io.FileOutputStream.writeBytes(Native Method)
>>> at java.io.FileOutputStream.write(FileOutputStream.java:260)
>>> at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java
>> :65)
>>> at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
>>> at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:124)
>>> at java.io.DataOutputStream.flush(DataOutputStream.java:106)
>>> at org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:96)
>>> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
>>> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:192)
>>> at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java
>> :1760)
>>> 
>>> 
>>> at org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:107)
>>> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
>>> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:192)
>>> at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java
>> :1760)
>>> 
>>> java.io.IOException: MROutput/MRErrThread
>>> failed:java.lang.OutOfMemoryError: Java heap space
>>> at java.util.Arrays.copyOf(Arrays.java:2786)
>>> at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:94)
>>> at java.io.DataOutputStream.write(DataOutputStream.java:90)
>>> at org.apache.hadoop.io.Text.write(Text.java:243)
>>> at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java
>> :349)
>>> at
>>> org.apache.hadoop.streaming.PipeMapRed$MROutputThread.run(
>> PipeMapRed.java:344)
>>> 
>>> at org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:76)
>>> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
>>> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:192)
>>> at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java
>> :1760)
>>> 
>>> java.io.IOException: MROutput/MRErrThread
>>> failed:java.lang.OutOfMemoryError: Java heap space
>>> at java.util.Arrays.copyOf(Arrays.java:2786)
>>> at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:94)
>>> at java.io.DataOutputStream.write(DataOutputStream.java:90)
>>> at org.apache.hadoop.io.Text.write(Text.java:243)
>>> at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java
>> :349)
>>> at
>>> org.apache.hadoop.streaming.PipeMapRed$MROutputThread.run(
>> PipeMapRed.java:344)
>>> 
>>> at org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:76)
>>> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
>>> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:192)
>>> at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java
>> :1760)
>>> 
>> 
>> 


Re: Hadoop-2438

Posted by Miles Osborne <mi...@inf.ed.ac.uk>.
In my case, I'm using actual mappers and reducers, rather than shell script
commands.  I've also used Map-Reduce at Google when I was on sabbatical
there in 2006.

That aside, I do take your point --you need to have a good grip on what Map
Reduce does to understand some of the challenges.  Here at Edinburgh I'm
leading a little push to start doing some of our core research within this
environment.  As a starter, I'm looking at the simple task of estimating
large n-gram based language models using M-R (think 5-grams and upwards from
lots of web data).  We are also about to look at core machine learning, such
as EM etc within this framework.  So, lots of fun and games ... and for me,
it is quite nice doing this kind of thing.  A good break from the usual
research.

Miles

On 22/01/2008, Ted Dunning <td...@veoh.com> wrote:
>
>
>
> Streaming has some real conceptual confusions awaiting the unwary.
>
> For instance, if you implement line counting, a correct implementation is
> this:
>
>     stream -mapper cat -reducer 'uniq -c'
>
> (stream is an alias I use to avoid typing hadoop -jar ....)
>
> It is tempting, though very dangerous to do
>
>     stream -mapper 'sort | uniq -c' -reducer '...add up counts...'
>
> But this doesn't work right because the mapper isn't to produce output
> after
> the last input line.  (it also tends to not work due to quoting issues,
> but
> we can ignore that issue for the moment).  A similar confusion occurs when
> the mapper exits, even normally.  Take the following program:
>
>     stream -mapper 'head -10' -reducer '...whatever...'
>
> Here the mapper exits after acting like the identity mapper for the first
> ten input records and then exits.  According to the implicit contract, it
> should instead stick around and accept all subsequent inputs and not
> produce
> any output.
>
> The need for fairly deep understanding of how hadoop and how normal shell
> processing idioms need to be modified makes streaming a pretty tricky
> thing
> to use, especially for the map-reduce novice.
>
> I don't think that this problem can be easily corrected since it is due to
> a
> fairly fundamental mismatch between shell programming tradition and what a
> mapper or reducer is.
>
>
> On 1/22/08 8:48 AM, "Joydeep Sen Sarma" <js...@facebook.com> wrote:
>
> >> My guess is that this is something to do with caching / buffering,
> since I
> >> presume that when the Stream mapper has real work to do, the associated
> Java
> >> streamer buffers input until the Mapper signals that it can process
> more
> >> data.  If the Mapper is busy, then a lot of data would get cached,
> causing
> >> some internal buffer to overflow.
> >
> > unlikely. the java buffer would be fixed size. it would write to a unix
> pipe
> > periodically. if the streaming mapper is not consuming data - the java
> side
> > would quickly become blocked writing to this pipe.
> >
> > the broken pipe case is extremely common and just tells that the mapper
> died.
> > best thing to do is find the stderr log for the task (from the
> jobtracker ui)
> > and find if the mapper left something there before dying.
> >
> >
> > if streaming gurus are reading this - i am curious about one unrelated
> thing -
> > the java map task does a 'flush()' in the buffered input stream to the
> > streaming mapper after every input line. seemed like unnecessary
> overhead to
> > me. was curious why (must be some rationale).
> >
> >
> >
> > -----Original Message-----
> > From: milesosb@gmail.com on behalf of Miles Osborne
> > Sent: Tue 1/22/2008 6:26 AM
> > To: hadoop-user@lucene.apache.org
> > Subject: Hadoop-2438
> >
> > Has there been any progress / a work-around for this?
> >
> > Currently I'm experimenting with Streaming and I've encountered what
> looks
> > like the same problem as described here:
> >
> > https://issues.apache.org/jira/browse/HADOOP-2438
> >
> > So, I get much the same errors (see below).
> >
> > For this particular task, when I replace the mappers and reducers with
> the
> > identity operation (ie just pass through the data) all is well.  When
> > instead I try to do something more taxing
> > (in this case, gathering together all ngrams with the same prefix), I
> get
> > these errors.
> >
> > My guess is that this is something to do with caching / buffering, since
> I
> > presume that when the Stream mapper has real work to do, the associated
> Java
> > streamer buffers input until the Mapper signals that it can process more
> > data.  If the Mapper is busy, then a lot of data would get cached,
> causing
> > some internal buffer to overflow.
> >
> > Miles
> >
> >>
> >
> > Date: Tue Jan 22 14:12:28 GMT 2008
> > java.io.IOException: Broken pipe
> > at java.io.FileOutputStream.writeBytes(Native Method)
> > at java.io.FileOutputStream.write(FileOutputStream.java:260)
> > at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java
> :65)
> > at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
> > at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:124)
> > at java.io.DataOutputStream.flush(DataOutputStream.java:106)
> > at org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:96)
> > at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
> > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:192)
> > at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java
> :1760)
> >
> >
> > at org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:107)
> > at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
> > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:192)
> > at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java
> :1760)
> >
> > java.io.IOException: MROutput/MRErrThread
> > failed:java.lang.OutOfMemoryError: Java heap space
> > at java.util.Arrays.copyOf(Arrays.java:2786)
> > at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:94)
> > at java.io.DataOutputStream.write(DataOutputStream.java:90)
> > at org.apache.hadoop.io.Text.write(Text.java:243)
> > at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java
> :349)
> > at
> > org.apache.hadoop.streaming.PipeMapRed$MROutputThread.run(
> PipeMapRed.java:344)
> >
> > at org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:76)
> > at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
> > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:192)
> > at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java
> :1760)
> >
> > java.io.IOException: MROutput/MRErrThread
> > failed:java.lang.OutOfMemoryError: Java heap space
> > at java.util.Arrays.copyOf(Arrays.java:2786)
> > at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:94)
> > at java.io.DataOutputStream.write(DataOutputStream.java:90)
> > at org.apache.hadoop.io.Text.write(Text.java:243)
> > at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java
> :349)
> > at
> > org.apache.hadoop.streaming.PipeMapRed$MROutputThread.run(
> PipeMapRed.java:344)
> >
> > at org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:76)
> > at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
> > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:192)
> > at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java
> :1760)
> >
>
>

Re: Hadoop-2438

Posted by Ted Dunning <td...@veoh.com>.

Streaming has some real conceptual confusions awaiting the unwary.

For instance, if you implement line counting, a correct implementation is
this:

    stream -mapper cat -reducer 'uniq -c'

(stream is an alias I use to avoid typing hadoop -jar ....)

It is tempting, though very dangerous to do

    stream -mapper 'sort | uniq -c' -reducer '...add up counts...'

But this doesn't work right because the mapper isn't to produce output after
the last input line.  (it also tends to not work due to quoting issues, but
we can ignore that issue for the moment).  A similar confusion occurs when
the mapper exits, even normally.  Take the following program:

    stream -mapper 'head -10' -reducer '...whatever...'

Here the mapper exits after acting like the identity mapper for the first
ten input records and then exits.  According to the implicit contract, it
should instead stick around and accept all subsequent inputs and not produce
any output.

The need for fairly deep understanding of how hadoop and how normal shell
processing idioms need to be modified makes streaming a pretty tricky thing
to use, especially for the map-reduce novice.

I don't think that this problem can be easily corrected since it is due to a
fairly fundamental mismatch between shell programming tradition and what a
mapper or reducer is.


On 1/22/08 8:48 AM, "Joydeep Sen Sarma" <js...@facebook.com> wrote:

>> My guess is that this is something to do with caching / buffering, since I
>> presume that when the Stream mapper has real work to do, the associated Java
>> streamer buffers input until the Mapper signals that it can process more
>> data.  If the Mapper is busy, then a lot of data would get cached, causing
>> some internal buffer to overflow.
> 
> unlikely. the java buffer would be fixed size. it would write to a unix pipe
> periodically. if the streaming mapper is not consuming data - the java side
> would quickly become blocked writing to this pipe.
> 
> the broken pipe case is extremely common and just tells that the mapper died.
> best thing to do is find the stderr log for the task (from the jobtracker ui)
> and find if the mapper left something there before dying.
> 
> 
> if streaming gurus are reading this - i am curious about one unrelated thing -
> the java map task does a 'flush()' in the buffered input stream to the
> streaming mapper after every input line. seemed like unnecessary overhead to
> me. was curious why (must be some rationale).
> 
> 
> 
> -----Original Message-----
> From: milesosb@gmail.com on behalf of Miles Osborne
> Sent: Tue 1/22/2008 6:26 AM
> To: hadoop-user@lucene.apache.org
> Subject: Hadoop-2438
>  
> Has there been any progress / a work-around for this?
> 
> Currently I'm experimenting with Streaming and I've encountered what looks
> like the same problem as described here:
> 
> https://issues.apache.org/jira/browse/HADOOP-2438
> 
> So, I get much the same errors (see below).
> 
> For this particular task, when I replace the mappers and reducers with the
> identity operation (ie just pass through the data) all is well.  When
> instead I try to do something more taxing
> (in this case, gathering together all ngrams with the same prefix), I get
> these errors.
> 
> My guess is that this is something to do with caching / buffering, since I
> presume that when the Stream mapper has real work to do, the associated Java
> streamer buffers input until the Mapper signals that it can process more
> data.  If the Mapper is busy, then a lot of data would get cached, causing
> some internal buffer to overflow.
> 
> Miles
> 
>> 
> 
> Date: Tue Jan 22 14:12:28 GMT 2008
> java.io.IOException: Broken pipe
> at java.io.FileOutputStream.writeBytes(Native Method)
> at java.io.FileOutputStream.write(FileOutputStream.java:260)
> at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
> at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
> at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:124)
> at java.io.DataOutputStream.flush(DataOutputStream.java:106)
> at org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:96)
> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:192)
> at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1760)
> 
> 
> at org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:107)
> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:192)
> at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1760)
> 
> java.io.IOException: MROutput/MRErrThread
> failed:java.lang.OutOfMemoryError: Java heap space
> at java.util.Arrays.copyOf(Arrays.java:2786)
> at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:94)
> at java.io.DataOutputStream.write(DataOutputStream.java:90)
> at org.apache.hadoop.io.Text.write(Text.java:243)
> at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:349)
> at 
> org.apache.hadoop.streaming.PipeMapRed$MROutputThread.run(PipeMapRed.java:344)
> 
> at org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:76)
> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:192)
> at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1760)
> 
> java.io.IOException: MROutput/MRErrThread
> failed:java.lang.OutOfMemoryError: Java heap space
> at java.util.Arrays.copyOf(Arrays.java:2786)
> at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:94)
> at java.io.DataOutputStream.write(DataOutputStream.java:90)
> at org.apache.hadoop.io.Text.write(Text.java:243)
> at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:349)
> at 
> org.apache.hadoop.streaming.PipeMapRed$MROutputThread.run(PipeMapRed.java:344)
> 
> at org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:76)
> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:192)
> at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1760)
> 


RE: Hadoop-2438

Posted by Joydeep Sen Sarma <js...@facebook.com>.
> My guess is that this is something to do with caching / buffering, since I
> presume that when the Stream mapper has real work to do, the associated Java
> streamer buffers input until the Mapper signals that it can process more
> data.  If the Mapper is busy, then a lot of data would get cached, causing
> some internal buffer to overflow.

unlikely. the java buffer would be fixed size. it would write to a unix pipe periodically. if the streaming mapper is not consuming data - the java side would quickly become blocked writing to this pipe.

the broken pipe case is extremely common and just tells that the mapper died. best thing to do is find the stderr log for the task (from the jobtracker ui) and find if the mapper left something there before dying.


if streaming gurus are reading this - i am curious about one unrelated thing - the java map task does a 'flush()' in the buffered input stream to the streaming mapper after every input line. seemed like unnecessary overhead to me. was curious why (must be some rationale).



-----Original Message-----
From: milesosb@gmail.com on behalf of Miles Osborne
Sent: Tue 1/22/2008 6:26 AM
To: hadoop-user@lucene.apache.org
Subject: Hadoop-2438
 
Has there been any progress / a work-around for this?

Currently I'm experimenting with Streaming and I've encountered what looks
like the same problem as described here:

https://issues.apache.org/jira/browse/HADOOP-2438

So, I get much the same errors (see below).

For this particular task, when I replace the mappers and reducers with the
identity operation (ie just pass through the data) all is well.  When
instead I try to do something more taxing
(in this case, gathering together all ngrams with the same prefix), I get
these errors.

My guess is that this is something to do with caching / buffering, since I
presume that when the Stream mapper has real work to do, the associated Java
streamer buffers input until the Mapper signals that it can process more
data.  If the Mapper is busy, then a lot of data would get cached, causing
some internal buffer to overflow.

Miles

>

Date: Tue Jan 22 14:12:28 GMT 2008
java.io.IOException: Broken pipe
	at java.io.FileOutputStream.writeBytes(Native Method)
	at java.io.FileOutputStream.write(FileOutputStream.java:260)
	at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
	at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
	at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:124)
	at java.io.DataOutputStream.flush(DataOutputStream.java:106)
	at org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:96)
	at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:192)
	at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1760)


	at org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:107)
	at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:192)
	at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1760)

java.io.IOException: MROutput/MRErrThread
failed:java.lang.OutOfMemoryError: Java heap space
	at java.util.Arrays.copyOf(Arrays.java:2786)
	at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:94)
	at java.io.DataOutputStream.write(DataOutputStream.java:90)
	at org.apache.hadoop.io.Text.write(Text.java:243)
	at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:349)
	at org.apache.hadoop.streaming.PipeMapRed$MROutputThread.run(PipeMapRed.java:344)

	at org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:76)
	at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:192)
	at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1760)

java.io.IOException: MROutput/MRErrThread
failed:java.lang.OutOfMemoryError: Java heap space
	at java.util.Arrays.copyOf(Arrays.java:2786)
	at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:94)
	at java.io.DataOutputStream.write(DataOutputStream.java:90)
	at org.apache.hadoop.io.Text.write(Text.java:243)
	at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:349)
	at org.apache.hadoop.streaming.PipeMapRed$MROutputThread.run(PipeMapRed.java:344)

	at org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:76)
	at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:192)
	at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1760)