You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by "S.L" <si...@gmail.com> on 2014/04/29 05:14:18 UTC

Crawling multiple websites.

Hi All,

I am crawling multiple big websites for which I have the homepage as the
URL in the seed file. The problem I am facing is that one of the websites
is getting crawled at a faster pace than the rest of the websites and as a
result the indexed data contains a disproportionate number of entries for
this one website.

I suspect that this is happening because this website in question has
homepage with the maximum number of outlinks.

My questions is how can I control the behaviour of Nutch so as to crawl
every host/domain in a balanced way.

I am using Nutch 1.7

Thanks.

Re: Crawling multiple websites.

Posted by Zabini <an...@actimage.com>.

Hi,

You may find your solution in https://wiki.apache.org/nutch/OptimizingCrawls
with the fourth point.

Best Regard,
Zabini



--
View this message in context: http://lucene.472066.n3.nabble.com/Crawling-multiple-websites-tp4133640p4133768.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Crawling multiple websites.

Posted by "S.L" <si...@gmail.com>.

Thanks , Feng , but that not what we want though, you mean there is no
mechanism by which we can set a limit for  a host to fetch at each level
and put the rest in the queue so that we have a equal representation from
all hosts while the index is being built up ?


On Wed, Apr 30, 2014 at 1:26 AM, feng lu <am...@gmail.com> wrote:

> yes, that's right.
>
>
> On Tue, Apr 29, 2014 at 10:53 PM, S.L <si...@gmail.com> wrote:
>
> > Thanks,will this skip any URLs at each level/fetch if a particular host
> has
> > more than the value we set it to  ?
> >
> >
> > On Tue, Apr 29, 2014 at 10:48 AM, feng lu <am...@gmail.com> wrote:
> >
> > > Maybe you can set this property to limit the count of allowed URLs per
> > host
> > > / domain. default is -1.
> > >
> > > <property>
> > >   <name>generate.max.count</name>
> > >   <value>-1</value>
> > >   <description>The maximum number of urls in a single
> > >   fetchlist.  -1 if unlimited. The urls are counted according
> > >   to the value of the parameter generator.count.mode.
> > >   </description>
> > > </property>
> > >
> > >
> > >
> > > On Tue, Apr 29, 2014 at 11:14 AM, S.L <si...@gmail.com>
> wrote:
> > >
> > > > Hi All,
> > > >
> > > > I am crawling multiple big websites for which I have the homepage as
> > the
> > > > URL in the seed file. The problem I am facing is that one of the
> > websites
> > > > is getting crawled at a faster pace than the rest of the websites and
> > as
> > > a
> > > > result the indexed data contains a disproportionate number of entries
> > for
> > > > this one website.
> > > >
> > > > I suspect that this is happening because this website in question has
> > > > homepage with the maximum number of outlinks.
> > > >
> > > > My questions is how can I control the behaviour of Nutch so as to
> crawl
> > > > every host/domain in a balanced way.
> > > >
> > > > I am using Nutch 1.7
> > > >
> > > > Thanks.
> > > >
> > >
> > >
> > >
> > > --
> > > Don't Grow Old, Grow Up... :-)
> > >
> >
>
>
>
> --
> Don't Grow Old, Grow Up... :-)
>

Re: Crawling multiple websites.

Posted by feng lu <am...@gmail.com>.

yes, that's right.


On Tue, Apr 29, 2014 at 10:53 PM, S.L <si...@gmail.com> wrote:

> Thanks,will this skip any URLs at each level/fetch if a particular host has
> more than the value we set it to  ?
>
>
> On Tue, Apr 29, 2014 at 10:48 AM, feng lu <am...@gmail.com> wrote:
>
> > Maybe you can set this property to limit the count of allowed URLs per
> host
> > / domain. default is -1.
> >
> > <property>
> >   <name>generate.max.count</name>
> >   <value>-1</value>
> >   <description>The maximum number of urls in a single
> >   fetchlist.  -1 if unlimited. The urls are counted according
> >   to the value of the parameter generator.count.mode.
> >   </description>
> > </property>
> >
> >
> >
> > On Tue, Apr 29, 2014 at 11:14 AM, S.L <si...@gmail.com> wrote:
> >
> > > Hi All,
> > >
> > > I am crawling multiple big websites for which I have the homepage as
> the
> > > URL in the seed file. The problem I am facing is that one of the
> websites
> > > is getting crawled at a faster pace than the rest of the websites and
> as
> > a
> > > result the indexed data contains a disproportionate number of entries
> for
> > > this one website.
> > >
> > > I suspect that this is happening because this website in question has
> > > homepage with the maximum number of outlinks.
> > >
> > > My questions is how can I control the behaviour of Nutch so as to crawl
> > > every host/domain in a balanced way.
> > >
> > > I am using Nutch 1.7
> > >
> > > Thanks.
> > >
> >
> >
> >
> > --
> > Don't Grow Old, Grow Up... :-)
> >
>



-- 
Don't Grow Old, Grow Up... :-)

Re: Crawling multiple websites.

Posted by "S.L" <si...@gmail.com>.

Thanks,will this skip any URLs at each level/fetch if a particular host has
more than the value we set it to  ?


On Tue, Apr 29, 2014 at 10:48 AM, feng lu <am...@gmail.com> wrote:

> Maybe you can set this property to limit the count of allowed URLs per host
> / domain. default is -1.
>
> <property>
>   <name>generate.max.count</name>
>   <value>-1</value>
>   <description>The maximum number of urls in a single
>   fetchlist.  -1 if unlimited. The urls are counted according
>   to the value of the parameter generator.count.mode.
>   </description>
> </property>
>
>
>
> On Tue, Apr 29, 2014 at 11:14 AM, S.L <si...@gmail.com> wrote:
>
> > Hi All,
> >
> > I am crawling multiple big websites for which I have the homepage as the
> > URL in the seed file. The problem I am facing is that one of the websites
> > is getting crawled at a faster pace than the rest of the websites and as
> a
> > result the indexed data contains a disproportionate number of entries for
> > this one website.
> >
> > I suspect that this is happening because this website in question has
> > homepage with the maximum number of outlinks.
> >
> > My questions is how can I control the behaviour of Nutch so as to crawl
> > every host/domain in a balanced way.
> >
> > I am using Nutch 1.7
> >
> > Thanks.
> >
>
>
>
> --
> Don't Grow Old, Grow Up... :-)
>

Re: Crawling multiple websites.

Posted by feng lu <am...@gmail.com>.

Maybe you can set this property to limit the count of allowed URLs per host
/ domain. default is -1.

<property>
  <name>generate.max.count</name>
  <value>-1</value>
  <description>The maximum number of urls in a single
  fetchlist.  -1 if unlimited. The urls are counted according
  to the value of the parameter generator.count.mode.
  </description>
</property>

On Tue, Apr 29, 2014 at 11:14 AM, S.L <si...@gmail.com> wrote:

> Hi All,
>
> I am crawling multiple big websites for which I have the homepage as the
> URL in the seed file. The problem I am facing is that one of the websites
> is getting crawled at a faster pace than the rest of the websites and as a
> result the indexed data contains a disproportionate number of entries for
> this one website.
>
> I suspect that this is happening because this website in question has
> homepage with the maximum number of outlinks.
>
> My questions is how can I control the behaviour of Nutch so as to crawl
> every host/domain in a balanced way.
>
> I am using Nutch 1.7
>
> Thanks.
>

-- 
Don't Grow Old, Grow Up... :-)