You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@manifoldcf.apache.org by Jack Krupansky <ja...@lucidimagination.com> on 2010/06/10 19:40:40 UTC

nutch vs. LCF for web crawling

It would be nice to have a brief summary comparison of the web crawling features of LCF relative to nutch. I personally don't know the details of nutch other than a quick read of the tutorial, but I am wondering whether there are any features of nutch web crawling that may not be available in the LCF web crawl connector.

A second question is whether nutch has any performance or volume advantage over LCF for web crawling, in a general, rough sense, although some specific performance tests for LCF will eventually be good to have.

I would envision people using LCF to crawl desired web sites rather than the whole web, but the number of desired sites to be crawled could still be a moderately large number. At some point we should publish some guidelines as to what amount of web crawling LCF is targeted to support, in a general, rough sense.

(Answers could go in the LCF FAQ.)

Thanks.


-- Jack Krupansky

RE: nutch vs. LCF for web crawling

Posted by ka...@nokia.com.
Oh - another big area of difference is that nutch is not an incremental crawler.
Karl

From: Wright Karl (Nokia-S/Cambridge)
Sent: Thursday, June 10, 2010 2:00 PM
To: connectors-user@incubator.apache.org
Subject: RE: nutch vs. LCF for web crawling

Hi Jack,

Nutch research sounds like a perfect project for you to tackle.

AFAIK, there are no missing LCF *features*, but of course there will be differences in (for example) how well the crawler recognizes and extracts links from content.  For instance, LCF does not extract links from anything other than html, xml, or text documents.  I do not know Nutch's behavior here.

In my reading of nutch, the big differences have to do with architecture - nutch is potentially distributed, running on Hadoop, and does not use an ACID database for its queue - and, as far as targeted audience is concerned, nutch is more of a toolkit than an interactive user-friendly crawler.   But that evaluation is based mainly on a relatively light and quick analysis of today's nutch.

FWIW, as I said before, MetaCarta does a number of performance tests in-house, many of which include RSS and Web connectors.  The emphasis of that testing is to be sure LCF is crawling as fast as the specified throttling parameters will allow.  You should not make the mistake of trying to compare raw throughput with throughput in a realistic throttling scenario.  Any attempt to crawl any given external site at the maximum code rate will almost certainly get you cut off by that site's sysadmin in short order, so throttling is utterly essential in the real world, and the "realistic" maximum throughput is directly related to the number of individual domains you are trying to crawl.  At MetaCarta we test with some 10,000 domains in one of our internal tests, which is much more than most of our users will ever do, and the crawler still performs within 20% of maximum theoretical throughput.

My larger point is that before you ask for metrics, you really need to think through the test cases you are interested in.  A single raw number is not going to help you here.

Karl


From: ext Jack Krupansky [mailto:jack.krupansky@lucidimagination.com]
Sent: Thursday, June 10, 2010 1:41 PM
To: connectors-user@incubator.apache.org
Subject: nutch vs. LCF for web crawling

It would be nice to have a brief summary comparison of the web crawling features of LCF relative to nutch. I personally don't know the details of nutch other than a quick read of the tutorial, but I am wondering whether there are any features of nutch web crawling that may not be available in the LCF web crawl connector.

A second question is whether nutch has any performance or volume advantage over LCF for web crawling, in a general, rough sense, although some specific performance tests for LCF will eventually be good to have.

I would envision people using LCF to crawl desired web sites rather than the whole web, but the number of desired sites to be crawled could still be a moderately large number. At some point we should publish some guidelines as to what amount of web crawling LCF is targeted to support, in a general, rough sense.

(Answers could go in the LCF FAQ.)

Thanks.


-- Jack Krupansky

RE: nutch vs. LCF for web crawling

Posted by ka...@nokia.com.
Hi Jack,

Nutch research sounds like a perfect project for you to tackle.

AFAIK, there are no missing LCF *features*, but of course there will be differences in (for example) how well the crawler recognizes and extracts links from content.  For instance, LCF does not extract links from anything other than html, xml, or text documents.  I do not know Nutch's behavior here.

In my reading of nutch, the big differences have to do with architecture - nutch is potentially distributed, running on Hadoop, and does not use an ACID database for its queue - and, as far as targeted audience is concerned, nutch is more of a toolkit than an interactive user-friendly crawler.   But that evaluation is based mainly on a relatively light and quick analysis of today's nutch.

FWIW, as I said before, MetaCarta does a number of performance tests in-house, many of which include RSS and Web connectors.  The emphasis of that testing is to be sure LCF is crawling as fast as the specified throttling parameters will allow.  You should not make the mistake of trying to compare raw throughput with throughput in a realistic throttling scenario.  Any attempt to crawl any given external site at the maximum code rate will almost certainly get you cut off by that site's sysadmin in short order, so throttling is utterly essential in the real world, and the "realistic" maximum throughput is directly related to the number of individual domains you are trying to crawl.  At MetaCarta we test with some 10,000 domains in one of our internal tests, which is much more than most of our users will ever do, and the crawler still performs within 20% of maximum theoretical throughput.

My larger point is that before you ask for metrics, you really need to think through the test cases you are interested in.  A single raw number is not going to help you here.

Karl


From: ext Jack Krupansky [mailto:jack.krupansky@lucidimagination.com]
Sent: Thursday, June 10, 2010 1:41 PM
To: connectors-user@incubator.apache.org
Subject: nutch vs. LCF for web crawling

It would be nice to have a brief summary comparison of the web crawling features of LCF relative to nutch. I personally don't know the details of nutch other than a quick read of the tutorial, but I am wondering whether there are any features of nutch web crawling that may not be available in the LCF web crawl connector.

A second question is whether nutch has any performance or volume advantage over LCF for web crawling, in a general, rough sense, although some specific performance tests for LCF will eventually be good to have.

I would envision people using LCF to crawl desired web sites rather than the whole web, but the number of desired sites to be crawled could still be a moderately large number. At some point we should publish some guidelines as to what amount of web crawling LCF is targeted to support, in a general, rough sense.

(Answers could go in the LCF FAQ.)

Thanks.


-- Jack Krupansky