You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mapreduce-user@hadoop.apache.org by Phil McCarthy <ph...@gmail.com> on 2010/03/06 18:29:09 UTC

Parallelizing HTTP calls with MapReduce

Hi,

I'm new to Hadoop, and I'm trying to figure out the best way to use it
with EC2 to make large number of calls to a web API, and then process
and store the results. I'm completely new to Hadoop, so I'm wondering
what's the best high-level approach, in terms of using MapReduce to
parallelize the process. The calls will be regular HTTP requests, and
the URLs follow a known format, so can be generated easily.

This seems like it'd be a pretty common type of task, so apologies if
I've missed something obvious in the docs etc.

Cheers,
Phil McCarthy

Re: Parallelizing HTTP calls with MapReduce

Posted by Phil McCarthy <ph...@gmail.com>.

Thanks for the detailed answer, this will be useful stuff to know once
I'm optimizing/tuning.

I'm actually still at the stage of figuring out how to approach
applying the mapreduce pattern to the task, so I'll take your
suggestion of asking again on common-user.

Thanks!

On Sun, Mar 7, 2010 at 8:28 AM, Kay Kay <ka...@gmail.com> wrote:
> On 03/06/2010 09:29 AM, Phil McCarthy wrote:
>>
>> Hi,
>>
>> I'm new to Hadoop, and I'm trying to figure out the best way to use it
>> with EC2 to make large number of calls to a web API,
>
> Consider using a http client library / connection that is thread-safe
> potentially.
>>
>>  and then process
>> and store the results. I'm completely new to Hadoop, so I'm wondering
>> what's the best high-level approach, in terms of using MapReduce to
>> parallelize the process. The calls will be regular HTTP requests, and
>> the URLs follow a known format, so can be generated easily.
>>
>
> profile the mappers / reducers for memory usage ( primary) and observe the
> gc graph pattern for any crazy peaks/maximum-range of memory used and the
> cpu, after the same.
> While the programming language might be java, it might be best if you
> consider yourselves writing for a embedded environment and conserving bytes
> / new() / going slow on regex. etc.
> bandwidth of intermediate results , written to the context by the mappers
> (to hdfs, during the intermediate stage) and transferred to the reducers is
> a different thing altogether to be worth considered.
>
>> This seems like it'd be a pretty common type of task, so apologies if
>> I've missed something obvious in the docs etc.
>>
>
> Good luck ! As you might have figured out from the history - the list -
> common-user@hadoop.apache.org is more busier than this and irrespective of
> the name of the list being common, is still very relevant to hdfs /m-r
> questions.
>
>> Cheers,
>> Phil McCarthy
>>
>
>

Re: Parallelizing HTTP calls with MapReduce

Posted by Kay Kay <ka...@gmail.com>.

On 03/06/2010 09:29 AM, Phil McCarthy wrote:
> Hi,
>
> I'm new to Hadoop, and I'm trying to figure out the best way to use it
> with EC2 to make large number of calls to a web API,
Consider using a http client library / connection that is thread-safe 
potentially.
>   and then process
> and store the results. I'm completely new to Hadoop, so I'm wondering
> what's the best high-level approach, in terms of using MapReduce to
> parallelize the process. The calls will be regular HTTP requests, and
> the URLs follow a known format, so can be generated easily.
>    
profile the mappers / reducers for memory usage ( primary) and observe 
the gc graph pattern for any crazy peaks/maximum-range of memory used 
and the cpu, after the same.
While the programming language might be java, it might be best if you 
consider yourselves writing for a embedded environment and conserving 
bytes / new() / going slow on regex. etc.
bandwidth of intermediate results , written to the context by the 
mappers (to hdfs, during the intermediate stage) and transferred to the 
reducers is a different thing altogether to be worth considered.

> This seems like it'd be a pretty common type of task, so apologies if
> I've missed something obvious in the docs etc.
>    
Good luck ! As you might have figured out from the history - the list - 
common-user@hadoop.apache.org is more busier than this and irrespective 
of the name of the list being common, is still very relevant to hdfs 
/m-r questions.

> Cheers,
> Phil McCarthy
>

Re: Parallelizing HTTP calls with MapReduce

Posted by ph...@gmail.com.

Thanks for the replies, it sounds like there are a couple of different  
approaches for me to investigate. All of these requests will actually  
be to the same service, which should reduce DNS overhead, but rate  
limiting is going to be an issue — their developer guidelines are  
pretty clear on that.

I'm looking at an order of around 10 million calls, although there's  
no real performance requirement — it's a one-off job, but I'd prefer  
it not to take days.

Thanks again for the suggestions,

Phil

On 8 Mar 2010, at 18:46, Aaron Kimball <aa...@cloudera.com> wrote:

> I think you should actually use the Java-based MapReduce here.
>
> As has been noted, these will be network-bound calls. And if you're  
> trying to make a lot of them, my experience is that individual calls  
> are slow. 10,000 GET requests could each take a second or two,  
> especially if they involve DNS lookups. But they can be overlapped.
>
> If you're using the old API, consider using the Multithreaded  
> maprunner for this (I think that's  
> org.apache.hadoop.mapred.lib.MultithreadedMapRunner):
>
> JobConf job = new JobConf();
> job.setMapRunnerClass(MultithreadedMapRunner.class);
>
>  If you're using the new API, there's an analagous  
> o.a.h.mapreduce.lib.mapper.MultithreadedMapper that you should extend.
>
> This will allow you to pipeline all those requests and get much  
> faster throughput. (Each map task starts a thread pool of a few  
> threads, which will be given individual map inputs in an overlapped  
> fashion. The same instance of your Mapper class will be used across  
> all threads, so make sure to protect any instance variables.)
>
> For maximum efficiency, sort all your different URLs by hostname  
> first, so that each split of the input contains all the requests to  
> the same server -- this will allow your DNS caching to be much more  
> efficient (rather than have all your mappers try to DNS lookup the  
> same set of hosts).
>
> Of course, you want to be careful with something like this. A big  
> Hadoop cluster can easily bring a web server to its knees if you're  
> using too many map tasks in parallel on the same target :) You may  
> want to actually do some rate-limiting of requests to the same  
> node... but how to do that easily is a separate discussion.
>
> - Aaron
>
>
> On Sun, Mar 7, 2010 at 9:46 AM, Erez Katz <er...@yahoo.com> wrote:
> It should be very easy. If you just have say a list of URLS as  
> input...
> It is not even map-reduce task... just map task (with no reduce, i  
> don't see where you do a reduce on a key in this scenario).
> Look for map only tasks in the streaming documentation.
>
> Just pick your favorite scripting language that keeps reading urls  
> form the standard input stream line by line and outputs the result  
> to the standard output.
>
> ala python:
>
> import urllib,sys
>
> for line in sys.stdin:
>  url = line.strip()
>  x = urllib.urlopen(url)
>  print x.read()
>  u.close()
>
>
> That's all folks.
>
>
> No real reason to use Java/C++ here, most of the time will be spend  
> over network IO.
>
>
> Cheers,
>
>  Erez Katz
>
>
> --- On Sat, 3/6/10, Phil McCarthy <ph...@gmail.com> wrote:
>
> > From: Phil McCarthy <ph...@gmail.com>
> > Subject: Parallelizing HTTP calls with MapReduce
> > To: mapreduce-user@hadoop.apache.org
> > Date: Saturday, March 6, 2010, 9:29 AM
> > Hi,
> >
> > I'm new to Hadoop, and I'm trying to figure out the best
> > way to use it
> > with EC2 to make large number of calls to a web API, and
> > then process
> > and store the results. I'm completely new to Hadoop, so I'm
> > wondering
> > what's the best high-level approach, in terms of using
> > MapReduce to
> > parallelize the process. The calls will be regular HTTP
> > requests, and
> > the URLs follow a known format, so can be generated
> > easily.
> >
> > This seems like it'd be a pretty common type of task, so
> > apologies if
> > I've missed something obvious in the docs etc.
> >
> > Cheers,
> > Phil McCarthy
> >
>
>
>
>

Re: Parallelizing HTTP calls with MapReduce

Posted by Aaron Kimball <aa...@cloudera.com>.

I think you should actually use the Java-based MapReduce here.

As has been noted, these will be network-bound calls. And if you're trying
to make a lot of them, my experience is that individual calls are slow.
10,000 GET requests could each take a second or two, especially if they
involve DNS lookups. But they can be overlapped.

If you're using the old API, consider using the Multithreaded maprunner for
this (I think that's org.apache.hadoop.mapred.lib.MultithreadedMapRunner):

JobConf job = new JobConf();
job.setMapRunnerClass(MultithreadedMapRunner.class);

 If you're using the new API, there's an analagous
o.a.h.mapreduce.lib.mapper.MultithreadedMapper that you should extend.

This will allow you to pipeline all those requests and get much faster
throughput. (Each map task starts a thread pool of a few threads, which will
be given individual map inputs in an overlapped fashion. The same instance
of your Mapper class will be used across all threads, so make sure to
protect any instance variables.)

For maximum efficiency, sort all your different URLs by hostname first, so
that each split of the input contains all the requests to the same server --
this will allow your DNS caching to be much more efficient (rather than have
all your mappers try to DNS lookup the same set of hosts).

Of course, you want to be careful with something like this. A big Hadoop
cluster can easily bring a web server to its knees if you're using too many
map tasks in parallel on the same target :) You may want to actually do some
rate-limiting of requests to the same node... but how to do that easily is a
separate discussion.

- Aaron

On Sun, Mar 7, 2010 at 9:46 AM, Erez Katz <er...@yahoo.com> wrote:

> It should be very easy. If you just have say a list of URLS as input...
> It is not even map-reduce task... just map task (with no reduce, i don't
> see where you do a reduce on a key in this scenario).
> Look for map only tasks in the streaming documentation.
>
> Just pick your favorite scripting language that keeps reading urls form the
> standard input stream line by line and outputs the result to the standard
> output.
>
> ala python:
>
> import urllib,sys
>
> for line in sys.stdin:
>  url = line.strip()
>  x = urllib.urlopen(url)
>  print x.read()
>  u.close()
>
>
> That's all folks.
>
>
> No real reason to use Java/C++ here, most of the time will be spend over
> network IO.
>
>
> Cheers,
>
>  Erez Katz
>
>
> --- On Sat, 3/6/10, Phil McCarthy <ph...@gmail.com> wrote:
>
> > From: Phil McCarthy <ph...@gmail.com>
> > Subject: Parallelizing HTTP calls with MapReduce
> > To: mapreduce-user@hadoop.apache.org
> > Date: Saturday, March 6, 2010, 9:29 AM
> > Hi,
> >
> > I'm new to Hadoop, and I'm trying to figure out the best
> > way to use it
> > with EC2 to make large number of calls to a web API, and
> > then process
> > and store the results. I'm completely new to Hadoop, so I'm
> > wondering
> > what's the best high-level approach, in terms of using
> > MapReduce to
> > parallelize the process. The calls will be regular HTTP
> > requests, and
> > the URLs follow a known format, so can be generated
> > easily.
> >
> > This seems like it'd be a pretty common type of task, so
> > apologies if
> > I've missed something obvious in the docs etc.
> >
> > Cheers,
> > Phil McCarthy
> >
>
>
>
>

Re: Parallelizing HTTP calls with MapReduce

Posted by Erez Katz <er...@yahoo.com>.

It should be very easy. If you just have say a list of URLS as input...
It is not even map-reduce task... just map task (with no reduce, i don't see where you do a reduce on a key in this scenario).
Look for map only tasks in the streaming documentation.

Just pick your favorite scripting language that keeps reading urls form the standard input stream line by line and outputs the result to the standard output. 

ala python:

import urllib,sys

for line in sys.stdin:
  url = line.strip()
  x = urllib.urlopen(url)
  print x.read()
  u.close()

That's all folks.

No real reason to use Java/C++ here, most of the time will be spend over network IO.

Cheers,

  Erez Katz

--- On Sat, 3/6/10, Phil McCarthy <ph...@gmail.com> wrote:

> From: Phil McCarthy <ph...@gmail.com>
> Subject: Parallelizing HTTP calls with MapReduce
> To: mapreduce-user@hadoop.apache.org
> Date: Saturday, March 6, 2010, 9:29 AM
> Hi,
> 
> I'm new to Hadoop, and I'm trying to figure out the best
> way to use it
> with EC2 to make large number of calls to a web API, and
> then process
> and store the results. I'm completely new to Hadoop, so I'm
> wondering
> what's the best high-level approach, in terms of using
> MapReduce to
> parallelize the process. The calls will be regular HTTP
> requests, and
> the URLs follow a known format, so can be generated
> easily.
> 
> This seems like it'd be a pretty common type of task, so
> apologies if
> I've missed something obvious in the docs etc.
> 
> Cheers,
> Phil McCarthy
>