You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by "S.L" <si...@gmail.com> on 2014/08/20 07:03:07 UTC

Nutch not crawling all the domains in the seed list.

Hi All,

I have 10 domains in the seed list , *Nutch 1.7 *consistently crawls only 5
of those domaisn and ignores the other 5  domains , can you please let me
know whats preventing it from crawling all the domains.

I am running this on *Hadoop2.3.0* and in a cluster mode and giving a *depth
of 10* when submitting the job. I have already set the
*db.ignore.external.links
*property to tru as I only intend to crawl the domains in the seed list.

Some relevant properties that I have set , are mentioned below ,* please
advise*.

<property>
        <name>*fetcher.threads.per.queue*</name>
        <value>5</value>
        <description>This number is the maximum number of threads that
            should be allowed to access a queue at one time. Replaces
            deprecated parameter 'fetcher.threads.per.host'.
        </description>
    </property>

    <property>
        <name>*db.ignore.external.links*</name>
        <value>true</value>
        <description>If true, outlinks leading from a page to external hosts
            will be ignored. This is an effective way to limit the crawl to
            include
            only initially injected hosts, without creating complex
URLFilters.
        </description>
    </property>

Re: Nutch not crawling all the domains in the seed list.

Posted by "S.L" <si...@gmail.com>.

Any help guys ?


On Wed, Aug 20, 2014 at 12:13 PM, S.L <si...@gmail.com> wrote:

> Thanks,the problem is that if I reduce the URLs in the seed list to any 5
> , all of them are being crawled , which tells me its not a URL filtering
> issue , is just seems Nutch is not able to crawl more than 5 domains from
> the seed list , is there  a property that I am setting by mistake that's
> causing this behavior?
>
>
> On Wed, Aug 20, 2014 at 11:38 AM, Bin Wang <bi...@gmail.com> wrote:
>
>> Hi S.L.,
>>
>> 1. Nutch will follow site's robots.txt file as default, maybe you can take
>> a look at robot rule for the missing domains by going to
>> http://example.com/robots.txt?
>>
>> 2. Also, there are some URL filters that will be applied, maybe you can
>> paste the output after you inject the seed.txt (nutch inject), so you can
>> make sure all the URLs passed the filtering process.
>>
>> Bin
>>
>>
>> On Tue, Aug 19, 2014 at 11:03 PM, S.L <si...@gmail.com> wrote:
>>
>> > Hi All,
>> >
>> > I have 10 domains in the seed list , *Nutch 1.7 *consistently crawls
>> only 5
>> > of those domaisn and ignores the other 5  domains , can you please let
>> me
>> > know whats preventing it from crawling all the domains.
>> >
>> > I am running this on *Hadoop2.3.0* and in a cluster mode and giving a
>> > *depth
>> > of 10* when submitting the job. I have already set the
>> > *db.ignore.external.links
>> > *property to tru as I only intend to crawl the domains in the seed list.
>> >
>> > Some relevant properties that I have set , are mentioned below ,* please
>> > advise*.
>> >
>> > <property>
>> >         <name>*fetcher.threads.per.queue*</name>
>> >         <value>5</value>
>> >         <description>This number is the maximum number of threads that
>> >             should be allowed to access a queue at one time. Replaces
>> >             deprecated parameter 'fetcher.threads.per.host'.
>> >         </description>
>> >     </property>
>> >
>> >     <property>
>> >         <name>*db.ignore.external.links*</name>
>> >         <value>true</value>
>> >         <description>If true, outlinks leading from a page to external
>> > hosts
>> >             will be ignored. This is an effective way to limit the
>> crawl to
>> >             include
>> >             only initially injected hosts, without creating complex
>> > URLFilters.
>> >         </description>
>> >     </property>
>> >
>>
>
>

Re: Nutch not crawling all the domains in the seed list.

Posted by "S.L" <si...@gmail.com>.

Thanks,the problem is that if I reduce the URLs in the seed list to any 5 ,
all of them are being crawled , which tells me its not a URL filtering
issue , is just seems Nutch is not able to crawl more than 5 domains from
the seed list , is there  a property that I am setting by mistake that's
causing this behavior?


On Wed, Aug 20, 2014 at 11:38 AM, Bin Wang <bi...@gmail.com> wrote:

> Hi S.L.,
>
> 1. Nutch will follow site's robots.txt file as default, maybe you can take
> a look at robot rule for the missing domains by going to
> http://example.com/robots.txt?
>
> 2. Also, there are some URL filters that will be applied, maybe you can
> paste the output after you inject the seed.txt (nutch inject), so you can
> make sure all the URLs passed the filtering process.
>
> Bin
>
>
> On Tue, Aug 19, 2014 at 11:03 PM, S.L <si...@gmail.com> wrote:
>
> > Hi All,
> >
> > I have 10 domains in the seed list , *Nutch 1.7 *consistently crawls
> only 5
> > of those domaisn and ignores the other 5  domains , can you please let me
> > know whats preventing it from crawling all the domains.
> >
> > I am running this on *Hadoop2.3.0* and in a cluster mode and giving a
> > *depth
> > of 10* when submitting the job. I have already set the
> > *db.ignore.external.links
> > *property to tru as I only intend to crawl the domains in the seed list.
> >
> > Some relevant properties that I have set , are mentioned below ,* please
> > advise*.
> >
> > <property>
> >         <name>*fetcher.threads.per.queue*</name>
> >         <value>5</value>
> >         <description>This number is the maximum number of threads that
> >             should be allowed to access a queue at one time. Replaces
> >             deprecated parameter 'fetcher.threads.per.host'.
> >         </description>
> >     </property>
> >
> >     <property>
> >         <name>*db.ignore.external.links*</name>
> >         <value>true</value>
> >         <description>If true, outlinks leading from a page to external
> > hosts
> >             will be ignored. This is an effective way to limit the crawl
> to
> >             include
> >             only initially injected hosts, without creating complex
> > URLFilters.
> >         </description>
> >     </property>
> >
>

Re: Nutch not crawling all the domains in the seed list.

Posted by Bin Wang <bi...@gmail.com>.

Hi S.L.,

1. Nutch will follow site's robots.txt file as default, maybe you can take
a look at robot rule for the missing domains by going to
http://example.com/robots.txt?

2. Also, there are some URL filters that will be applied, maybe you can
paste the output after you inject the seed.txt (nutch inject), so you can
make sure all the URLs passed the filtering process.

Bin


On Tue, Aug 19, 2014 at 11:03 PM, S.L <si...@gmail.com> wrote:

> Hi All,
>
> I have 10 domains in the seed list , *Nutch 1.7 *consistently crawls only 5
> of those domaisn and ignores the other 5  domains , can you please let me
> know whats preventing it from crawling all the domains.
>
> I am running this on *Hadoop2.3.0* and in a cluster mode and giving a
> *depth
> of 10* when submitting the job. I have already set the
> *db.ignore.external.links
> *property to tru as I only intend to crawl the domains in the seed list.
>
> Some relevant properties that I have set , are mentioned below ,* please
> advise*.
>
> <property>
>         <name>*fetcher.threads.per.queue*</name>
>         <value>5</value>
>         <description>This number is the maximum number of threads that
>             should be allowed to access a queue at one time. Replaces
>             deprecated parameter 'fetcher.threads.per.host'.
>         </description>
>     </property>
>
>     <property>
>         <name>*db.ignore.external.links*</name>
>         <value>true</value>
>         <description>If true, outlinks leading from a page to external
> hosts
>             will be ignored. This is an effective way to limit the crawl to
>             include
>             only initially injected hosts, without creating complex
> URLFilters.
>         </description>
>     </property>
>