You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Vivekanand Ittigi <vi...@biginfolabs.com> on 2014/07/29 07:18:40 UTC

crawling all links of same domain in nutch in solr

Hi,

Can anyone tel me how to crawl all other pages of same domain.
For example i'm feeding a website http://www.techcrunch.com/ in seed.txt.

Following property is added in nutch-site.xml

<property>
  <name>db.ignore.internal.links</name>
  <value>false</value>
  <description>If true, when adding new links to a page, links from
  the same host are ignored.  This is an effective way to limit the
  size of the link database, keeping only the highest quality
  links.
  </description>
</property>

And following is added in regex-urlfilter.txt

# accept anything else
+.

Note: if i add http://www.tutorialspoint.com/ in seed.txt, I'm able to
crawl all other pages but not techcrunch.com's pages though it has got many
other pages too.

Please help..?

Thanks,
Vivek

RE: crawling all links of same domain in nutch in solr

Posted by Markus Jelsma <ma...@openindex.io>.

Hi - use the domain URL filter plugin and list the domains, hosts or TLD's you want to restrict the crawl to.


 
 
-----Original message-----
> From:Vivekanand Ittigi <vi...@biginfolabs.com>
> Sent: Tuesday 29th July 2014 7:17
> To: solr-user@lucene.apache.org
> Subject: crawling all links of same domain in nutch in solr
> 
> Hi,
> 
> Can anyone tel me how to crawl all other pages of same domain.
> For example i'm feeding a website http://www.techcrunch.com/ in seed.txt.
> 
> Following property is added in nutch-site.xml
> 
> <property>
>   <name>db.ignore.internal.links</name>
>   <value>false</value>
>   <description>If true, when adding new links to a page, links from
>   the same host are ignored.  This is an effective way to limit the
>   size of the link database, keeping only the highest quality
>   links.
>   </description>
> </property>
> 
> And following is added in regex-urlfilter.txt
> 
> # accept anything else
> +.
> 
> Note: if i add http://www.tutorialspoint.com/ in seed.txt, I'm able to
> crawl all other pages but not techcrunch.com's pages though it has got many
> other pages too.
> 
> Please help..?
> 
> Thanks,
> Vivek
>