You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by John Clarke <cl...@gmail.com> on 2009/05/27 10:58:55 UTC

avoid custom crawler getting blocked

My current project is to gather stats from a lot of different documents.
We're are not indexing just getting quite specific stats for each document.
We gather 12 different stats from each document.

Our requirements have changed somewhat now, originally it was working on
documents from our own servers but now it needs to fetch other ones from
quite a large variety of sources.

My approach up to now was to have the map function simply take each filepath
(or now URL) in turn, fetch the document, calculate the stats and output
those stats.

My new problem is some of the locations we are now visiting don't like their
IP being hit multiple times in a row.

Is it possible to check a URL against a visited list of IPs and if recently
visited either wait for a certain amount of time or push it back onto the
input stack so it will be processed later in the queue?

Or is there a better way?

Thanks,
John

Re: avoid custom crawler getting blocked

Posted by jason hadoop <ja...@gmail.com>.

Random ordering helps with per thread delays based on domain recency also
helps.

On Wed, May 27, 2009 at 6:47 AM, Ken Krugler <kk...@transpac.com>wrote:

> My current project is to gather stats from a lot of different documents.
>> We're are not indexing just getting quite specific stats for each
>> document.
>> We gather 12 different stats from each document.
>>
>> Our requirements have changed somewhat now, originally it was working on
>> documents from our own servers but now it needs to fetch other ones from
>> quite a large variety of sources.
>>
>> My approach up to now was to have the map function simply take each
>> filepath
>> (or now URL) in turn, fetch the document, calculate the stats and output
>> those stats.
>>
>> My new problem is some of the locations we are now visiting don't like
>> their
>> IP being hit multiple times in a row.
>>
>> Is it possible to check a URL against a visited list of IPs and if
>> recently
>> visited either wait for a certain amount of time or push it back onto the
>> input stack so it will be processed later in the queue?
>>
>> Or is there a better way?
>>
>
> Your use case is very similar to what we've been doing with Bixo. See
> http://bixo.101tec.com, and also
> http://bixo.101tec.com/wp-content/uploads/2009/05/bixo-intro.pdf
>
> Short answer is that we group URLs by paid-level domain in a map (actually
> using a Cascading GroupBy operation), and use per-domain queues with
> multi-threaded fetchers to efficiently load pages in a reduce (a Cascading
> Buffer operation).
>
> -- Ken
> --
> Ken Krugler
> +1 530-210-6378
>



-- 
Alpha Chapters of my book on Hadoop are available
http://www.apress.com/book/view/9781430219422
www.prohadoopbook.com a community for Hadoop Professionals

Re: avoid custom crawler getting blocked

Posted by Ken Krugler <kk...@transpac.com>.

>My current project is to gather stats from a lot of different documents.
>We're are not indexing just getting quite specific stats for each document.
>We gather 12 different stats from each document.
>
>Our requirements have changed somewhat now, originally it was working on
>documents from our own servers but now it needs to fetch other ones from
>quite a large variety of sources.
>
>My approach up to now was to have the map function simply take each filepath
>(or now URL) in turn, fetch the document, calculate the stats and output
>those stats.
>
>My new problem is some of the locations we are now visiting don't like their
>IP being hit multiple times in a row.
>
>Is it possible to check a URL against a visited list of IPs and if recently
>visited either wait for a certain amount of time or push it back onto the
>input stack so it will be processed later in the queue?
>
>Or is there a better way?

Your use case is very similar to what we've been doing with Bixo. See 
http://bixo.101tec.com, and also 
http://bixo.101tec.com/wp-content/uploads/2009/05/bixo-intro.pdf

Short answer is that we group URLs by paid-level domain in a map 
(actually using a Cascading GroupBy operation), and use per-domain 
queues with multi-threaded fetchers to efficiently load pages in a 
reduce (a Cascading Buffer operation).

-- Ken
-- 
Ken Krugler
+1 530-210-6378

Re: avoid custom crawler getting blocked

Posted by Tom White <to...@cloudera.com>.

Have you had a look at Nutch (http://lucene.apache.org/nutch/)? It has
solved this kind of problem.

Cheers,
Tom

On Wed, May 27, 2009 at 9:58 AM, John Clarke <cl...@gmail.com> wrote:
> My current project is to gather stats from a lot of different documents.
> We're are not indexing just getting quite specific stats for each document.
> We gather 12 different stats from each document.
>
> Our requirements have changed somewhat now, originally it was working on
> documents from our own servers but now it needs to fetch other ones from
> quite a large variety of sources.
>
> My approach up to now was to have the map function simply take each filepath
> (or now URL) in turn, fetch the document, calculate the stats and output
> those stats.
>
> My new problem is some of the locations we are now visiting don't like their
> IP being hit multiple times in a row.
>
> Is it possible to check a URL against a visited list of IPs and if recently
> visited either wait for a certain amount of time or push it back onto the
> input stack so it will be processed later in the queue?
>
> Or is there a better way?
>
> Thanks,
> John
>