You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Vivekanand Ittigi <vi...@biginfolabs.com> on 2014/07/29 07:18:40 UTC
crawling all links of same domain in nutch in solr
Hi,
Can anyone tel me how to crawl all other pages of same domain.
For example i'm feeding a website http://www.techcrunch.com/ in seed.txt.
Following property is added in nutch-site.xml
<property>
<name>db.ignore.internal.links</name>
<value>false</value>
<description>If true, when adding new links to a page, links from
the same host are ignored. This is an effective way to limit the
size of the link database, keeping only the highest quality
links.
</description>
</property>
And following is added in regex-urlfilter.txt
# accept anything else
+.
Note: if i add http://www.tutorialspoint.com/ in seed.txt, I'm able to
crawl all other pages but not techcrunch.com's pages though it has got many
other pages too.
Please help..?
Thanks,
Vivek
RE: crawling all links of same domain in nutch in solr
Posted by Markus Jelsma <ma...@openindex.io>.
Hi - use the domain URL filter plugin and list the domains, hosts or TLD's you want to restrict the crawl to.
-----Original message-----
> From:Vivekanand Ittigi <vi...@biginfolabs.com>
> Sent: Tuesday 29th July 2014 7:17
> To: solr-user@lucene.apache.org
> Subject: crawling all links of same domain in nutch in solr
>
> Hi,
>
> Can anyone tel me how to crawl all other pages of same domain.
> For example i'm feeding a website http://www.techcrunch.com/ in seed.txt.
>
> Following property is added in nutch-site.xml
>
> <property>
> <name>db.ignore.internal.links</name>
> <value>false</value>
> <description>If true, when adding new links to a page, links from
> the same host are ignored. This is an effective way to limit the
> size of the link database, keeping only the highest quality
> links.
> </description>
> </property>
>
> And following is added in regex-urlfilter.txt
>
> # accept anything else
> +.
>
> Note: if i add http://www.tutorialspoint.com/ in seed.txt, I'm able to
> crawl all other pages but not techcrunch.com's pages though it has got many
> other pages too.
>
> Please help..?
>
> Thanks,
> Vivek
>