You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by howard chen <ho...@gmail.com> on 2014/08/16 05:59:35 UTC

Use nutch as a distributed monitoring solution, any idea?

Hello

We are in the process of evaluating different opensource solutions for
our distributed monitoring solution.

Currently the system is developed in house, basic features are:

- there are over 100K urls to monitor at a specific interval, 1 min, 5
min, 15 min
- these 100K urls are mapped to 100 parsers, for checking different
syntax appear in the HTML
- send out alert if parser failed

While it is not exactly a crawler, but are very similar in nature.

We are looking at a solution that we can focus on our business logic
(i.e. the parsers), rather than the moving parts of the system (e.g.
how to distribute, how to queue etc).

Do you think nutch would be a good candidate?

Thanks.

Re: Use nutch as a distributed monitoring solution, any idea?

Posted by Julien Nioche <li...@gmail.com>.

Hi Howard (and Sebb),

You could do it with Nutch but due to the batch nature of MapReduce it is
not a natural fit e.g. no guarantee that the previous batch operation will
be finished in time for the next one. There could be ways around this but
the whole thing would get rather convoluted and difficult to maintain and
run in prod.

Instead you could use a realtime system like Storm which would simplify the
logic around the scheduling of the fetches. See
https://github.com/DigitalPebble/storm-crawler for components you could
reuse to that effect. Sounds like what you need is fairly straightforward
(no recursive discovery of new URLs, etc...) and it should not be too
difficult to do it with Storm.

Julien

PS: am on holiday this week, probably won't have access to the web for some
time

On 18 August 2014 09:51, howard chen <ho...@gmail.com> wrote:

> Hello
>
> On Sat, Aug 16, 2014 at 11:02 PM, Sebastian Nagel
> <wa...@googlemail.com> wrote:
> > * "mapped to 100 parsers": does it mean 100 configurations
> >   (or syntactic patterns) or really 100 parser objects?
>
>
> For each website we crawl (monitor), we need to run a set of tests
> again it, so we only download the HTML once, but run as much as 100
> tests against the HTML. The current system suck because when we have
> added new test or update the existing test codes, we need to stop and
> restart whole cluster, and we think we shouldn't waste time on
> reinventing a a distributed task system, so we are looking if any
> existing opensource solutions would be a better choice.
>
> Thanks
>

-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Re: Use nutch as a distributed monitoring solution, any idea?

Posted by howard chen <ho...@gmail.com>.

Hello

On Sat, Aug 16, 2014 at 11:02 PM, Sebastian Nagel
<wa...@googlemail.com> wrote:
> * "mapped to 100 parsers": does it mean 100 configurations
>   (or syntactic patterns) or really 100 parser objects?

For each website we crawl (monitor), we need to run a set of tests
again it, so we only download the HTML once, but run as much as 100
tests against the HTML. The current system suck because when we have
added new test or update the existing test codes, we need to stop and
restart whole cluster, and we think we shouldn't waste time on
reinventing a a distributed task system, so we are looking if any
existing opensource solutions would be a better choice.

Thanks

Re: Use nutch as a distributed monitoring solution, any idea?

Posted by Sebastian Nagel <wa...@googlemail.com>.

Hi,

in general, it should be possible to adapt Nutch to this task:

1 inject 100k URLs
  * fixed fetch interval for each can be defined in seed list:
    url \t nutchFetchIntervalMDName=<interval in seconds>

2 generate fetch list(s)
  * select pages which need to be checked now
  * partion by host (and/or parser)

3 fetch and parse (fetcher.parse = true)
  * eventually, report errors immediately
  * do not store raw and parsed content,
  * only keep fetch and parse status, and fetch time

4 update: add (next) fetch time and status to WebTable / CrawlDb

Repeat 2-4, from time to time: inject new URLs

Difficulties may include:
* "mapped to 100 parsers": does it mean 100 configurations
  (or syntactic patterns) or really 100 parser objects?
  For the latter it may more efficient to hold the objects
  in a server, then to create them anew in every fetch-and-parse job
* every of these steps involves one more MapReduce jobs
  which means a certain amount of overhead and delay
  (in seconds) for job management, and creating a JVM for each job.
  If the check intervals have to be followed precisely,
  the scheduling provided by Nutch may not be ideal.
  But if it's that, e.g., a 1-min-page is checked after 80s
  there should be no problem.

Sebastian

On 08/16/2014 05:59 AM, howard chen wrote:
> Hello
> 
> We are in the process of evaluating different opensource solutions for
> our distributed monitoring solution.
> 
> Currently the system is developed in house, basic features are:
> 
> - there are over 100K urls to monitor at a specific interval, 1 min, 5
> min, 15 min
> - these 100K urls are mapped to 100 parsers, for checking different
> syntax appear in the HTML
> - send out alert if parser failed
> 
> While it is not exactly a crawler, but are very similar in nature.
> 
> We are looking at a solution that we can focus on our business logic
> (i.e. the parsers), rather than the moving parts of the system (e.g.
> how to distribute, how to queue etc).
> 
> Do you think nutch would be a good candidate?
> 
> Thanks.
>