You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hadoop.apache.org by David Parks <da...@yahoo.com> on 2012/10/22 10:40:28 UTC

Large input files via HTTP

I want to create a MapReduce job which reads many multi-gigabyte input files
from various HTTP sources & processes them nightly.

Is there a reasonably flexible way to do this in the Hadoop job its self? I
expect the initial downloads to take many hours and I'd hope I can optimize
the # of connections (example: I'm limited to 5 connections to one host,
whereas another host has a 3-connection limit, so maximize as much as
possible).  Also the set of files to download will change a little over time
so the input list should be easily configurable (in a config file or
equivalent).

Is it normal to perform batch downloads like this before running the
mapreduce job? Or is it ok to include such steps in with the job? It seems
handy to keep the whole process as one neat package in Hadoop if possible.

Re: Large input files via HTTP

Posted by Steve Loughran <st...@hortonworks.com>.

data ingress is often done as an initial MR job.

Here is sounds like you'd need
-a list of URLs, which you can have a single mapper run through and  map to
(hostname,url)

which feeds to the reducer:

hostname [url1, url2, ..]

the reducer on each hostname key can do the GET operations for that host,
using whatever per-host limits you have. Remember to keep sending
heartbeats to the Task Tracker so it knows that your process is alive. Oh,
and see if you grab any content-length and checksum header keys to verify
at the end of a long download -you don't want to accidentally pull a
half-complete D/L into your work.

once the files are in HDFS you can do more work on them, which is where
something like an OOzie workflow can be handy.

On 22 October 2012 09:40, David Parks <da...@yahoo.com> wrote:

> I want to create a MapReduce job which reads many multi-gigabyte input
> files
> from various HTTP sources & processes them nightly.
>
> Is there a reasonably flexible way to do this in the Hadoop job its self? I
> expect the initial downloads to take many hours and I'd hope I can optimize
> the # of connections (example: I'm limited to 5 connections to one host,
> whereas another host has a 3-connection limit, so maximize as much as
> possible).  Also the set of files to download will change a little over
> time
> so the input list should be easily configurable (in a config file or
> equivalent).
>
> Is it normal to perform batch downloads like this before running the
> mapreduce job? Or is it ok to include such steps in with the job? It seems
> handy to keep the whole process as one neat package in Hadoop if possible.
>
>

Re: Large input files via HTTP

Posted by Steve Loughran <st...@hortonworks.com>.

data ingress is often done as an initial MR job.

Here is sounds like you'd need
-a list of URLs, which you can have a single mapper run through and  map to
(hostname,url)

which feeds to the reducer:

hostname [url1, url2, ..]

the reducer on each hostname key can do the GET operations for that host,
using whatever per-host limits you have. Remember to keep sending
heartbeats to the Task Tracker so it knows that your process is alive. Oh,
and see if you grab any content-length and checksum header keys to verify
at the end of a long download -you don't want to accidentally pull a
half-complete D/L into your work.

once the files are in HDFS you can do more work on them, which is where
something like an OOzie workflow can be handy.

On 22 October 2012 09:40, David Parks <da...@yahoo.com> wrote:

> I want to create a MapReduce job which reads many multi-gigabyte input
> files
> from various HTTP sources & processes them nightly.
>
> Is there a reasonably flexible way to do this in the Hadoop job its self? I
> expect the initial downloads to take many hours and I'd hope I can optimize
> the # of connections (example: I'm limited to 5 connections to one host,
> whereas another host has a 3-connection limit, so maximize as much as
> possible).  Also the set of files to download will change a little over
> time
> so the input list should be easily configurable (in a config file or
> equivalent).
>
> Is it normal to perform batch downloads like this before running the
> mapreduce job? Or is it ok to include such steps in with the job? It seems
> handy to keep the whole process as one neat package in Hadoop if possible.
>
>

Re: Large input files via HTTP

Posted by Steve Loughran <st...@hortonworks.com>.

data ingress is often done as an initial MR job.

Here is sounds like you'd need
-a list of URLs, which you can have a single mapper run through and  map to
(hostname,url)

which feeds to the reducer:

hostname [url1, url2, ..]

the reducer on each hostname key can do the GET operations for that host,
using whatever per-host limits you have. Remember to keep sending
heartbeats to the Task Tracker so it knows that your process is alive. Oh,
and see if you grab any content-length and checksum header keys to verify
at the end of a long download -you don't want to accidentally pull a
half-complete D/L into your work.

once the files are in HDFS you can do more work on them, which is where
something like an OOzie workflow can be handy.

On 22 October 2012 09:40, David Parks <da...@yahoo.com> wrote:

> I want to create a MapReduce job which reads many multi-gigabyte input
> files
> from various HTTP sources & processes them nightly.
>
> Is there a reasonably flexible way to do this in the Hadoop job its self? I
> expect the initial downloads to take many hours and I'd hope I can optimize
> the # of connections (example: I'm limited to 5 connections to one host,
> whereas another host has a 3-connection limit, so maximize as much as
> possible).  Also the set of files to download will change a little over
> time
> so the input list should be easily configurable (in a config file or
> equivalent).
>
> Is it normal to perform batch downloads like this before running the
> mapreduce job? Or is it ok to include such steps in with the job? It seems
> handy to keep the whole process as one neat package in Hadoop if possible.
>
>

Re: Large input files via HTTP

Posted by Steve Loughran <st...@hortonworks.com>.

data ingress is often done as an initial MR job.

Here is sounds like you'd need
-a list of URLs, which you can have a single mapper run through and  map to
(hostname,url)

which feeds to the reducer:

hostname [url1, url2, ..]

the reducer on each hostname key can do the GET operations for that host,
using whatever per-host limits you have. Remember to keep sending
heartbeats to the Task Tracker so it knows that your process is alive. Oh,
and see if you grab any content-length and checksum header keys to verify
at the end of a long download -you don't want to accidentally pull a
half-complete D/L into your work.

once the files are in HDFS you can do more work on them, which is where
something like an OOzie workflow can be handy.

On 22 October 2012 09:40, David Parks <da...@yahoo.com> wrote:

> I want to create a MapReduce job which reads many multi-gigabyte input
> files
> from various HTTP sources & processes them nightly.
>
> Is there a reasonably flexible way to do this in the Hadoop job its self? I
> expect the initial downloads to take many hours and I'd hope I can optimize
> the # of connections (example: I'm limited to 5 connections to one host,
> whereas another host has a 3-connection limit, so maximize as much as
> possible).  Also the set of files to download will change a little over
> time
> so the input list should be easily configurable (in a config file or
> equivalent).
>
> Is it normal to perform batch downloads like this before running the
> mapreduce job? Or is it ok to include such steps in with the job? It seems
> handy to keep the whole process as one neat package in Hadoop if possible.
>
>