You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Tomasz <po...@gmail.com> on 2016/03/01 17:57:21 UTC

Re: Limit number of pages per host/domain

I've been running Nutch 1.12 for two days (btw. I noticed significant load
drop during fetching comparing to 1.11, it dropped from 20 to 1.5 with 64
fetchers running). Anyway, I tried to use domainblacklist plugin but it
didn't work. This is what I did:

- I prepared the domain list with update/readhostdb,
- cp domainblacklist-urlfilter.txt to conf/ directory,
- enabled plugin in nutch-site.xml
(<name>plugin.includes</name><value>urlfilter-domainblacklist|protocol-httpclient[....])
- run generate command
bin/nutch generate c1/crawldb c1/segments -topN 50000 -noFilter
- started a fetch step...

...and nutch is still fetching urls from the blacklist. Did I miss
something? Can -noFilter option interfere domainblacklist plugin? I guess
-noFilter refers to regex-urlfilter, am I right? I can only seed in log
that the plugin was properly activated:

INFO  domainblacklist.DomainBlacklistURLFilter - Attribute "file" is
defined for plugin urlfilter-domainblacklist as
domainblacklist-urlfilter.txt

Tomasz


2016-02-24 15:48 GMT+01:00 Tomasz <po...@gmail.com>:

> Oh, great. Will try with 1.12, thanks.
>
> 2016-02-24 15:39 GMT+01:00 Markus Jelsma <ma...@openindex.io>:
>
>> Hi - oh crap. I forgot i just committed it to 1.12-SNAPSHOT, it is not in
>> the 1.11 release. You can fetch trunk or NUTCH-1.12-SNAPSHOT for that
>> feature!
>> Markus
>>
>>
>>
>> -----Original message-----
>> > From:Tomasz <po...@gmail.com>
>> > Sent: Wednesday 24th February 2016 15:26
>> > To: user@nutch.apache.org
>> > Subject: Re: Limit number of pages per host/domain
>> >
>> > Thanks a lot Markus. Unfortunately I forgot to mention I use Nutch 1.11
>> but
>> > hostdb works only with 2.x I guess.
>> >
>> > Tomasz
>> >
>> > 2016-02-24 11:53 GMT+01:00 Markus Jelsma <ma...@openindex.io>:
>> >
>> > > Hello - this is possible using the HostDB. If you updatehostdb
>> frequently
>> > > you get statistics on number of fetched, redirs, 404's and unfetched
>> for
>> > > any given host. Using readhostdb and a Jexl expression, you can then
>> emit a
>> > > blacklist of hosts that you can use during generate.
>> > >
>> > > # Update the hostdb
>> > > bin/nutch updatehostdb -hostdb crawl/hostdb -crawldb crawl/crawldb/
>> > >
>> > > # Get list of hosts that have over 100 records fetched or not modified
>> > > bin/nutch readhostdb crawl/hostdb/ output -dumpHostnames -expr '(ok >=
>> > > 100)'
>> > >
>> > > # Or get list of hosts that have over 100 records in total
>> > > bin/nutch readhostdb crawl/hostdb/ output -dumpHostnames -expr
>> > > '(numRecords >= 100)'
>> > >
>> > > List of fields that are expressible (line 93-104):
>> > >
>> > >
>> http://svn.apache.org/viewvc/nutch/trunk/src/java/org/apache/nutch/hostdb/ReadHostDb.java?view=markup
>> > >
>> > > You now have a list of hostnames that you can using with the
>> > > domainblacklist-urlfilter at generate step.
>> > >
>> > > Markus
>> > >
>> > >
>> > > -----Original message-----
>> > > > From:Tomasz <po...@gmail.com>
>> > > > Sent: Wednesday 24th February 2016 11:30
>> > > > To: user@nutch.apache.org
>> > > > Subject: Limit number of pages per host/domain
>> > > >
>> > > > Hello,
>> > > >
>> > > > One can set generate.max.count to limit number of urls for domain
>> or host
>> > > > in next fetch step. But is there a way to limit number of fetched
>> urls
>> > > for
>> > > > domain/host in a whole crawl process? Supposing I run
>> > > generate/fetch/update
>> > > > cycle 6 times and want to limit number of urls per host to 100 urls
>> > > (pages)
>> > > > and not more in a whole crawldb. How can I achieve that?
>> > > >
>> > > > Regards,
>> > > > Tomasz
>> > > >
>> > >
>> >
>>
>
>