You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Waleed <wa...@students.poly.edu> on 2012/01/07 09:03:32 UTC
Crawl only *.*.us
Hello everyone
I am trying to crawl only .us for example I want All domains that in all
com.us and net.us etc ...
of course I have it all in my seed list.
I set internal end external in nutch-default .xml.
......
<property>
<name>db.ignore.internal.links</name>
<value>false</value>
<description>If true, when adding new links to a page, links from
the same host are ignored. This is an effective way to limit the
size of the link database, keeping only the highest quality
links.
</description>
</property>
<property>
<name>db.ignore.external.links</name>
<value>true</value>
<description>If true, outlinks leading from a page to external hosts
will be ignored. This is an effective way to limit the crawl to include
only initially injected hosts, without creating complex URLFilters.
</description>
</property>
....
But I still get some documents not in my seed !!??
Am I missing something ??
--
View this message in context: http://lucene.472066.n3.nabble.com/Crawl-only-us-tp3639778p3639778.html
Sent from the Nutch - User mailing list archive at Nabble.com.
Re: Crawl only *.*.us
Posted by Markus Jelsma <ma...@openindex.io>.
You can use the domain url filter to crawl only urls in the listed domains.
> Hello everyone
> I am trying to crawl only .us for example I want All domains that in all
> com.us and net.us etc ...
> of course I have it all in my seed list.
>
> I set internal end external in nutch-default .xml.
> ......
> <property>
> <name>db.ignore.internal.links</name>
> <value>false</value>
> <description>If true, when adding new links to a page, links from
> the same host are ignored. This is an effective way to limit the
> size of the link database, keeping only the highest quality
> links.
> </description>
> </property>
>
> <property>
> <name>db.ignore.external.links</name>
> <value>true</value>
> <description>If true, outlinks leading from a page to external hosts
> will be ignored. This is an effective way to limit the crawl to include
> only initially injected hosts, without creating complex URLFilters.
> </description>
> </property>
> ....
>
> But I still get some documents not in my seed !!??
> Am I missing something ??
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Crawl-only-us-tp3639778p3639778.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
Re: Crawl only *.*.us
Posted by Sebastian Nagel <wa...@googlemail.com>.
Hi Waleed,
> in nutch-default.xml:
>
> <property>
> <name>plugin.includes</name>
> <value>domain-urlfilter.txt</value>
> </property>
No, you have to adapt the property so that among other plugins
urlfilter-domain is accepted by the regular expression. E.g.:
<property>
<name>plugin.includes</name>
<value>protocol-http|urlfilter-(regex|domain)|parse-...</value>
</property>
> And in domain-urlfilter.txt :
> I add just :
> .us
> And then I 'll be OK to go ?
> And in domain-urlfilter.txt :
> I add just :
> .us
No, it should be just:
us
This thread might also help:
http://lucene.472066.n3.nabble.com/Getting-domain-urlfilter-to-work-td618253.html
But your first solution
> <property>
> <name>db.ignore.external.links</name>
> <value>true</value>
> </property>
should do the same. Only documents from the hosts in your seed list are crawled.
> But I still get some documents not in my seed !!??
If you want to crawl only the seed list it's easier to set -depth to 1 and
set -topN so that your seed list fits in.
Bye, Sebastian
Re: Crawl only *.*.us
Posted by Waleed <wa...@students.poly.edu>.
Thank you.
So to configure it :
in nutch-default.xml:
<property>
<name>plugin.includes</name>
<value>domain-urlfilter.txt</value>
</property>
And in domain-urlfilter.txt :
I add just :
.us
And then I 'll be OK to go ?
--
View this message in context: http://lucene.472066.n3.nabble.com/Crawl-only-us-tp3639778p3641570.html
Sent from the Nutch - User mailing list archive at Nabble.com.