You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Waleed <wa...@students.poly.edu> on 2012/01/07 09:03:32 UTC

Crawl only *.*.us

Hello everyone 
I am trying to crawl only .us for example I want All domains that in all
com.us and net.us etc ...
of course I have it all in my seed list. 

I set internal end external in nutch-default .xml.
......
<property>
  <name>db.ignore.internal.links</name>
  <value>false</value>
  <description>If true, when adding new links to a page, links from
  the same host are ignored.  This is an effective way to limit the
  size of the link database, keeping only the highest quality
  links.
  </description>
</property>

<property>
  <name>db.ignore.external.links</name>
  <value>true</value>
  <description>If true, outlinks leading from a page to external hosts
  will be ignored. This is an effective way to limit the crawl to include
  only initially injected hosts, without creating complex URLFilters.
  </description>
</property>
....

But I still get some documents not in my seed !!?? 
Am I missing something ?? 

--
View this message in context: http://lucene.472066.n3.nabble.com/Crawl-only-us-tp3639778p3639778.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Crawl only *.*.us

Posted by Markus Jelsma <ma...@openindex.io>.
You can use the domain url filter to crawl only urls in the listed domains.

> Hello everyone
> I am trying to crawl only .us for example I want All domains that in all
> com.us and net.us etc ...
> of course I have it all in my seed list.
> 
> I set internal end external in nutch-default .xml.
> ......
> <property>
>   <name>db.ignore.internal.links</name>
>   <value>false</value>
>   <description>If true, when adding new links to a page, links from
>   the same host are ignored.  This is an effective way to limit the
>   size of the link database, keeping only the highest quality
>   links.
>   </description>
> </property>
> 
> <property>
>   <name>db.ignore.external.links</name>
>   <value>true</value>
>   <description>If true, outlinks leading from a page to external hosts
>   will be ignored. This is an effective way to limit the crawl to include
>   only initially injected hosts, without creating complex URLFilters.
>   </description>
> </property>
> ....
> 
> But I still get some documents not in my seed !!??
> Am I missing something ??
> 
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Crawl-only-us-tp3639778p3639778.html
> Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Crawl only *.*.us

Posted by Sebastian Nagel <wa...@googlemail.com>.
Hi Waleed,

> in nutch-default.xml:
>
> <property>
> <name>plugin.includes</name>
> <value>domain-urlfilter.txt</value>
> </property>

No, you have to adapt the property so that among other plugins
urlfilter-domain is accepted by the regular expression. E.g.:

<property>
   <name>plugin.includes</name>
   <value>protocol-http|urlfilter-(regex|domain)|parse-...</value>
</property>

 > And in domain-urlfilter.txt :
 > I add just :
 > .us
 > And then I 'll be OK to go ?
 > And in domain-urlfilter.txt :
 > I add just :
 > .us

No, it should be just:

us

This thread might also help:
http://lucene.472066.n3.nabble.com/Getting-domain-urlfilter-to-work-td618253.html

But your first solution
 > <property>
 >    <name>db.ignore.external.links</name>
 >    <value>true</value>
 > </property>
should do the same. Only documents from the hosts in your seed list are crawled.

 > But I still get some documents not in my seed !!??
If you want to crawl only the seed list it's easier to set -depth to 1 and
set -topN so that your seed list fits in.

Bye, Sebastian

Re: Crawl only *.*.us

Posted by Waleed <wa...@students.poly.edu>.
Thank you.

So to configure it :
in nutch-default.xml:

<property>
<name>plugin.includes</name>
<value>domain-urlfilter.txt</value>
</property> 

And in domain-urlfilter.txt :
I add just : 
.us
And then I 'll be OK to go ?



--
View this message in context: http://lucene.472066.n3.nabble.com/Crawl-only-us-tp3639778p3641570.html
Sent from the Nutch - User mailing list archive at Nabble.com.