You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Dennis Kubes <nu...@dragonflymc.com> on 2006/05/25 17:31:16 UTC

Help with MapReduce

I am trying to read a MapFile inside mapper and reducer 
implementations.  So far the only way I have found to do it is by 
opening a new reader for each map and reduce call.  Is anybody doing 
something similar and if so is there a way to open a single reader and 
reuse it across multiple map or reduce calls?

Dennis

Re: Help with MapReduce

Posted by Dennis Kubes <nu...@dragonflymc.com>.
I wasn't really asking about better approach.  I was more interested in 
how you are thinking about Hadoop in terms of problem solving.  I see 
now that it is more about sustained throughput (I should have picked 
that one up from the GFS paper) and that algorithms need to be coded for 
sustained throughput.  This is a different type of thinking then coding 
an algorithm for a single machine so I am learning as I go.  Thanks for 
your help.

Dennis

Doug Cutting wrote:
> Dennis Kubes wrote:
>> Ok.  This is a little different in that I need to start thinking 
>> about my algorithms in terms of sequential passes and multiple jobs 
>> instead of direct access.  That way I can use the input directories 
>> to get the data that I need.  Couldn't I also do it through the 
>> MapRunnable interface that creates a reader shared by an inner mapper 
>> class or is that hacking the interfaces when I should be thinking 
>> about this terms of sequential processing?
>
> You can do it however you like!  I don't know enough about your 
> problem to say definitively which is the best approach.  We're working 
> hard on Hadoop so that we can scalably stream data through MapReduce 
> at megabytes/second per node.  So you might do some back-of-the 
> envelope calculations.  Figure at least 10ms per random access.  So 
> your maximum random access rate might be around 100/second per drive.  
> Figure a 10MB/second transfer rate, so if randomly accessed data is 
> 100kB each, then your maximum random access rate drops to 50 
> items/drive/second. Since these are over the network, real performance 
> will probably be much worse.  Also, MapFile requires a scan per entry, 
> so you might really end up scanning 1MB per access, which would slow 
> random accesses to 10 items/drive/second.  You might benchmark your 
> random accesss performance  to get a better estimate, then compare 
> that to processing the whole collection through MapReduce.
>
> Doug

Re: Help with MapReduce

Posted by Doug Cutting <cu...@apache.org>.
Dennis Kubes wrote:
> Ok.  This is a little different in that I need to start thinking about 
> my algorithms in terms of sequential passes and multiple jobs instead of 
> direct access.  That way I can use the input directories to get the data 
> that I need.  Couldn't I also do it through the MapRunnable interface 
> that creates a reader shared by an inner mapper class or is that hacking 
> the interfaces when I should be thinking about this terms of sequential 
> processing?

You can do it however you like!  I don't know enough about your problem 
to say definitively which is the best approach.  We're working hard on 
Hadoop so that we can scalably stream data through MapReduce at 
megabytes/second per node.  So you might do some back-of-the envelope 
calculations.  Figure at least 10ms per random access.  So your maximum 
random access rate might be around 100/second per drive.  Figure a 
10MB/second transfer rate, so if randomly accessed data is 100kB each, 
then your maximum random access rate drops to 50 items/drive/second. 
Since these are over the network, real performance will probably be much 
worse.  Also, MapFile requires a scan per entry, so you might really end 
up scanning 1MB per access, which would slow random accesses to 10 
items/drive/second.  You might benchmark your random accesss performance 
  to get a better estimate, then compare that to processing the whole 
collection through MapReduce.

Doug

Re: Help with MapReduce

Posted by Dennis Kubes <nu...@dragonflymc.com>.
Ok.  This is a little different in that I need to start thinking about 
my algorithms in terms of sequential passes and multiple jobs instead of 
direct access.  That way I can use the input directories to get the data 
that I need.  Couldn't I also do it through the MapRunnable interface 
that creates a reader shared by an inner mapper class or is that hacking 
the interfaces when I should be thinking about this terms of sequential 
processing?

Dennis

Doug Cutting wrote:
> Dennis Kubes wrote:
>> The problem is that I have a single url.  I get the inlinks to that 
>> url and then I need to go access content from all of its inlink urls 
>> that have been fetched.
>> I was doing this through Random access.  But then I went back and 
>> re-read the google MapReduce paper and saw that it was designed for 
>> Sequential access and saw that Hadoop implements the same way.  But 
>> so far I haven't found a way to efficiently solve this kind of 
>> problem in sequential format.
>
> If your input urls are only a small fraction of the collection, then 
> random access might be appropriate, or you might instead use two (or 
> more) MapReduce passes, something like:
>
> 1. url -> inlink urls (using previously inverted link db)
> 2. inlink urls -> inlink content
>
> In each case the mapping might look like it's doing random access, 
> but, if input keys are sorted, and the "table" you're "selecting" from 
> (the link db in the first case and the content in the second) are 
> sorted, then the accesses will actually be sequential, scanning each 
> table only once.  But these will generally be remote DFS accesses.  
> MapReduce can usually arrange to place tasks on a node where the input 
> data is local, but when the map task then accesses other files this 
> optimization cannot be made.
>
> In Nutch, things are slightly more complicated, since the content is 
> organized by segment, each sorted by URL.  So you could either add 
> another MapReduce pass so that the inlink urls are sorted by segment 
> then url, or you could append all of your segments into a single segment.
>
> But if you're performing the calculation over the entire collection, 
> or even a substantial fraction, then you might be able to use a single 
> MapReduce pass, with the content and link db as inputs, performing 
> your required computations in reduce.  For anything larger than a 
> small fraction of your collection this will likely be fastest.
>
>> If I were to do it in the configure and close wouldn't that still 
>> open a single reader per map call?
>
> configure() and close() are only called once per map task.
>
> Doug

Re: Help with MapReduce

Posted by Doug Cutting <cu...@apache.org>.
Dennis Kubes wrote:
> The problem is that I have a single url.  I get the inlinks to that url 
> and then I need to go access content from all of its inlink urls that 
> have been fetched.
> I was doing this through Random access.  But then I went back and 
> re-read the google MapReduce paper and saw that it was designed for 
> Sequential access and saw that Hadoop implements the same way.  But so 
> far I haven't found a way to efficiently solve this kind of problem in 
> sequential format.

If your input urls are only a small fraction of the collection, then 
random access might be appropriate, or you might instead use two (or 
more) MapReduce passes, something like:

1. url -> inlink urls (using previously inverted link db)
2. inlink urls -> inlink content

In each case the mapping might look like it's doing random access, but, 
if input keys are sorted, and the "table" you're "selecting" from (the 
link db in the first case and the content in the second) are sorted, 
then the accesses will actually be sequential, scanning each table only 
once.  But these will generally be remote DFS accesses.  MapReduce can 
usually arrange to place tasks on a node where the input data is local, 
but when the map task then accesses other files this optimization cannot 
be made.

In Nutch, things are slightly more complicated, since the content is 
organized by segment, each sorted by URL.  So you could either add 
another MapReduce pass so that the inlink urls are sorted by segment 
then url, or you could append all of your segments into a single segment.

But if you're performing the calculation over the entire collection, or 
even a substantial fraction, then you might be able to use a single 
MapReduce pass, with the content and link db as inputs, performing your 
required computations in reduce.  For anything larger than a small 
fraction of your collection this will likely be fastest.

> If I were to do it in the configure and close wouldn't that still open a 
> single reader per map call?

configure() and close() are only called once per map task.

Doug

Re: Help with MapReduce

Posted by Dennis Kubes <nu...@dragonflymc.com>.
The problem is that I have a single url.  I get the inlinks to that url 
and then I need to go access content from all of its inlink urls that 
have been fetched. 

I was doing this through Random access.  But then I went back and 
re-read the google MapReduce paper and saw that it was designed for 
Sequential access and saw that Hadoop implements the same way.  But so 
far I haven't found a way to efficiently solve this kind of problem in 
sequential format.

If I were to do it in the configure and close wouldn't that still open a 
single reader per map call?

Dennis

Doug Cutting wrote:
> Dennis Kubes wrote:
>> I am trying to read a MapFile inside mapper and reducer 
>> implementations.  So far the only way I have found to do it is by 
>> opening a new reader for each map and reduce call.  Is anybody doing 
>> something similar and if so is there a way to open a single reader 
>> and reuse it across multiple map or reduce calls?
>
> Can't you open it in the configure() implementation?  And close it in 
> the close() implementation?
>
> Are you randomly accessing a MapFile from a map() implementation? 
> That's not going to scale very well.  MapReduce is designed for 
> sequential access.
>
> Doug

Re: Help with MapReduce

Posted by Doug Cutting <cu...@apache.org>.
Dennis Kubes wrote:
> I am trying to read a MapFile inside mapper and reducer 
> implementations.  So far the only way I have found to do it is by 
> opening a new reader for each map and reduce call.  Is anybody doing 
> something similar and if so is there a way to open a single reader and 
> reuse it across multiple map or reduce calls?

Can't you open it in the configure() implementation?  And close it in 
the close() implementation?

Are you randomly accessing a MapFile from a map() implementation? 
That's not going to scale very well.  MapReduce is designed for 
sequential access.

Doug

Re: Help with MapReduce

Posted by Dennis Kubes <nu...@dragonflymc.com>.
I think I figured it out.  In case anyone else is having a similar 
problem one way to do it would be to implement your own MapRunnable and 
open one reader for each MapRunnable.

Dennis Kubes wrote:
> I am trying to read a MapFile inside mapper and reducer 
> implementations.  So far the only way I have found to do it is by 
> opening a new reader for each map and reduce call.  Is anybody doing 
> something similar and if so is there a way to open a single reader and 
> reuse it across multiple map or reduce calls?
>
> Dennis