You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by AJ Ferrigno <aj...@gmail.com> on 2015/03/30 20:44:17 UTC

Crawl External Sites to Depth of 1

Hi everyone,

I'm a new user to nutch version 2.3. I have gotten it up and running,
crawling some of our websites to a particular depth, and it mostly seems to
work fine.

One thing I have been tasked with is the idea of crawling to one particular
depth n on this main website, but then only going to depth 1 on external
websites. Mainly for documents (non html stuff such as pdfs) linked
externally. But it could be applied to external html pages that themselves
link elsewhere. I guess that means I'd want to parse content but don't
inject new URLs once we go off a whitelisted set of domains.

I don't think this idea exists as a configuration option, so I have been
trying to simulate it with other methods. I have played with whitelisting
good domains via the urlfilter-domain plugin, and then turning
db.ignore.external.links true, hoping that the ignore.external.links
treated the whitelisted domains as "non external" (which doesn't seem to
work). But even if it did, that wouldn't limit that external domain to a
depth of m (or 1 in this case) once encountered.

Has anyone else needed to do something like this? Are there other options I
can try?

Thanks in advance,
AJ

Re: Crawl External Sites to Depth of 1

Posted by Sebastian Nagel <wa...@googlemail.com>.
Hi,

assumed that the external URLs are not known beforehand
I don't see a simple solution - you need to add a custom
scoring filter plugin.  If the URLs are known it's easy:
check the property db.ignore.external.links.

In Nutch 1.x there is the plugin scoring-depth which
allows you to specify a maximum depth per seed URL
by writing seed lists in the format
  <seed> <tab> _maxdepth_=N

Scoring-depth is missing but you could start with
scoring-opic as an example. The idea:

1. add a marker to seeds indicating that this host
    is a seed, e.g.:
     http://seedhost.com/index.html \t seed_host=true

2. implement a scoring filter, mainly the method
    distributeScoreToOutlinks:
   (a) if source host and outlink host are equal
       pass the seed_host to the outlink's CrawlDatum
       (ScoreDatum in 2.x)
   (b) if the hosts are different do NOT pass the
       seed_host marker
   (c) if there is no seed_host marker on the source
       drop all outlinks

Cheers,
Sebastian


On 30.03.2015 20:44, AJ Ferrigno wrote:
> Hi everyone,
>
> I'm a new user to nutch version 2.3. I have gotten it up and running,
> crawling some of our websites to a particular depth, and it mostly seems to
> work fine.
>
> One thing I have been tasked with is the idea of crawling to one particular
> depth n on this main website, but then only going to depth 1 on external
> websites. Mainly for documents (non html stuff such as pdfs) linked
> externally. But it could be applied to external html pages that themselves
> link elsewhere. I guess that means I'd want to parse content but don't
> inject new URLs once we go off a whitelisted set of domains.
>
> I don't think this idea exists as a configuration option, so I have been
> trying to simulate it with other methods. I have played with whitelisting
> good domains via the urlfilter-domain plugin, and then turning
> db.ignore.external.links true, hoping that the ignore.external.links
> treated the whitelisted domains as "non external" (which doesn't seem to
> work). But even if it did, that wouldn't limit that external domain to a
> depth of m (or 1 in this case) once encountered.
>
> Has anyone else needed to do something like this? Are there other options I
> can try?
>
> Thanks in advance,
> AJ
>