You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by steve labar <st...@gmail.com> on 2015/03/14 21:07:30 UTC

Scheduling multiple possibly parallel nutch crawls based on different configurations?

Hi,

I have a use case where I need to define schedules for crawling of certain
domains with nutch. I'm having a hard time wrapping my head around how this
would be setup. It looks to me that the way nutch is designed it runs with
a single instance that can in itself handle a huge number of hosts.

So let's say I have three organizations who I will be crawling their sites.
Each organization will have their own set of seeds, configurations, and
start and stop times of active crawling. Conceivably each of these three
organizations would have their own crawl jobs that get fired up based on
the organizations defined schedules. Therefore, it is possible that two or
more jobs will be running at the same time. Is this something that can be
setup?
Thank you,

Re: Scheduling multiple possibly parallel nutch crawls based on different configurations?

Posted by Julien Nioche <li...@gmail.com>.

Hi guys,

Running different Nutch crawls on the same cluster is of course doable but
generally not very optimal. Assuming that you have one 'logical' crawl per
hostname for instance you'd end up with N instances of Fetcher all running
at the same time but using only a single thread and using one Map/Reduce
slot in the cluster. I have seen cases where people needed a large Hadoop
cluster just to have enough Map/Reduce Task capacity for their crawls to
run at the same time when the amount of data involved would have required a
single machine only. That's not even mentioning the code they wrote to
maintain a state of each crawl, schedule it on MapReduce etc... Pretty
ugly, completely inefficient and an absolute nightmare to maintain when the
number of logical crawls gets large.

Now if all you need is to prevent URLs from a given crawl to be fetched at
a given time, you could write a custom filter used only during the
generation step and add some metadata in your seed URL to assign a crawlID
to your URLs. When it comes to the generation, you could retrieve the
crawlID for a given URL, check with some external source of knowledge
whether fetching is allowed at that time of the day for that crawlID and
let the URL be put in the fetchlist accordingly. The info about the allowed
fetching time could be stored in the conf/ or in an external DB. The
URLFilter
<https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/net/URLFilter.java>
interface can't be of much use as it does not have a scope allowing you to
use it in the generation only + it does not take into account the metadata.
So yes, you'd have to hack the code in order to achieve that.

What some of my clients have done in similar situations was to have a DB
with all the data (seeds, configuration, extraction rules, URL filters,
etc...) per logical crawl and write custom implementations (Injector,
URLFilters, ParseFIlters etc...) connecting to that DB and run everything
within a single crawl. They use storm-crawler
<https://github.com/DigitalPebble/storm-crawler> but the same approach
could be used in Nutch.

HTH

Julien

On 15 March 2015 at 07:36, remi tassing <ta...@gmail.com> wrote:

> I have a similar need with an additional requirement whereby the crawlDB
> should be merged at the end.
> The best solution I could think of,so far, is having independent instances
> of nutch.
> Remi
> On Mar 14, 2015 9:08 PM, "steve labar" <st...@gmail.com>
> wrote:
>
> > Hi,
> >
> > I have a use case where I need to define schedules for crawling of
> certain
> > domains with nutch. I'm having a hard time wrapping my head around how
> this
> > would be setup. It looks to me that the way nutch is designed it runs
> with
> > a single instance that can in itself handle a huge number of hosts.
> >
> > So let's say I have three organizations who I will be crawling their
> sites.
> > Each organization will have their own set of seeds, configurations, and
> > start and stop times of active crawling. Conceivably each of these three
> > organizations would have their own crawl jobs that get fired up based on
> > the organizations defined schedules. Therefore, it is possible that two
> or
> > more jobs will be running at the same time. Is this something that can be
> > setup?
> > Thank you,
> >
>

-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Re: Scheduling multiple possibly parallel nutch crawls based on different configurations?

Posted by remi tassing <ta...@gmail.com>.

I have a similar need with an additional requirement whereby the crawlDB
should be merged at the end.
The best solution I could think of,so far, is having independent instances
of nutch.
Remi
On Mar 14, 2015 9:08 PM, "steve labar" <st...@gmail.com>
wrote:

> Hi,
>
> I have a use case where I need to define schedules for crawling of certain
> domains with nutch. I'm having a hard time wrapping my head around how this
> would be setup. It looks to me that the way nutch is designed it runs with
> a single instance that can in itself handle a huge number of hosts.
>
> So let's say I have three organizations who I will be crawling their sites.
> Each organization will have their own set of seeds, configurations, and
> start and stop times of active crawling. Conceivably each of these three
> organizations would have their own crawl jobs that get fired up based on
> the organizations defined schedules. Therefore, it is possible that two or
> more jobs will be running at the same time. Is this something that can be
> setup?
> Thank you,
>