You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Svyatoslav Lavryk <la...@gmail.com> on 2015/03/11 18:10:54 UTC

Nutch 1.9 and Hadoop 1.2.1 Domains Crawl Depth

Hello,

We use Nutch 1.9 with Hadoop 1.2.1 for crawling.

As we are crawling different domains, we are interested to be able to crawl
domains with different depths, ideally be able to have different depth for
each domain.

The crawling process starts using next command
nutch/runtime/deploy/bin/crawl.

According to the documentation we found that scoring-depth plugin allows us
to run crawl of the domain with appropriate depth of crawling.
Its properties can be modified in the nutch_default.xml configuration
file(scoring.depth.max).
After changing of scoring.depth.max value, if we understand correctly, we
need to rebuild Nutch so new apach_nutch.job file is prepared.

We do this rebuilt with next command (command is executed within directory,
where Nutch sources are located):
ant runtime

Our questions:
1. Is there any possibility to crawl of the different url lists with
different depth using this plugin but without rebuilding Nutch each time
the depth changes?
2. Isn't there any easier method to be able to control crawl depth on per
domain basis?

Thanks in advance,
Slavik.

Re: Nutch 1.9 and Hadoop 1.2.1 Domains Crawl Depth

Posted by Svyatoslav Lavryk <la...@gmail.com>.
Thank you very much Nirav, it helped.

On Wed, Mar 11, 2015 at 7:20 PM, Nirav Thaker <nt...@outsideiq.com> wrote:

> You will need to put '_maxdepth_' metadata in seed file like following:
>
> http://domain1.com/abc _maxdepth_=2 some.other.metadata=xys
>
>
> http://domain2.com/xyz _maxdepth_=99 some.other.metadata=abc
>
> HTH
>
>
> On 03/11/2015 01:10 PM, Svyatoslav Lavryk wrote:
>
>> Hello,
>>
>> We use Nutch 1.9 with Hadoop 1.2.1 for crawling.
>>
>> As we are crawling different domains, we are interested to be able to
>> crawl
>> domains with different depths, ideally be able to have different depth for
>> each domain.
>>
>> The crawling process starts using next command
>> nutch/runtime/deploy/bin/crawl.
>>
>> According to the documentation we found that scoring-depth plugin allows
>> us
>> to run crawl of the domain with appropriate depth of crawling.
>> Its properties can be modified in the nutch_default.xml configuration
>> file(scoring.depth.max).
>> After changing of scoring.depth.max value, if we understand correctly, we
>> need to rebuild Nutch so new apach_nutch.job file is prepared.
>>
>> We do this rebuilt with next command (command is executed within
>> directory,
>> where Nutch sources are located):
>> ant runtime
>>
>> Our questions:
>> 1. Is there any possibility to crawl of the different url lists with
>> different depth using this plugin but without rebuilding Nutch each time
>> the depth changes?
>> 2. Isn't there any easier method to be able to control crawl depth on per
>> domain basis?
>>
>> Thanks in advance,
>> Slavik.
>>
>>
>

Re: Nutch 1.9 and Hadoop 1.2.1 Domains Crawl Depth

Posted by Nirav Thaker <nt...@outsideiq.com>.
You will need to put '_maxdepth_' metadata in seed file like following:

http://domain1.com/abc _maxdepth_=2 some.other.metadata=xys


http://domain2.com/xyz _maxdepth_=99 some.other.metadata=abc

HTH

On 03/11/2015 01:10 PM, Svyatoslav Lavryk wrote:
> Hello,
>
> We use Nutch 1.9 with Hadoop 1.2.1 for crawling.
>
> As we are crawling different domains, we are interested to be able to crawl
> domains with different depths, ideally be able to have different depth for
> each domain.
>
> The crawling process starts using next command
> nutch/runtime/deploy/bin/crawl.
>
> According to the documentation we found that scoring-depth plugin allows us
> to run crawl of the domain with appropriate depth of crawling.
> Its properties can be modified in the nutch_default.xml configuration
> file(scoring.depth.max).
> After changing of scoring.depth.max value, if we understand correctly, we
> need to rebuild Nutch so new apach_nutch.job file is prepared.
>
> We do this rebuilt with next command (command is executed within directory,
> where Nutch sources are located):
> ant runtime
>
> Our questions:
> 1. Is there any possibility to crawl of the different url lists with
> different depth using this plugin but without rebuilding Nutch each time
> the depth changes?
> 2. Isn't there any easier method to be able to control crawl depth on per
> domain basis?
>
> Thanks in advance,
> Slavik.
>