You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by reddibabu <re...@gmail.com> on 2014/03/21 12:43:59 UTC

How to crawl and index parallel way from Nutch into Solr

My requirement is to crawl and index urls based on -depth 100 and -topN 100.
The Nutch crawl command crawls all the urls first and then indexes them and
sends the data all at once to Solr. As the depth and topN are 100 each, the
whole process (crawling and indexing) takes around 4-5 hours.

I would like to know if there is a way where crawling and indexing can be
done in parallel so that some data can be seen in the Solr admin screen
while the Nutch crawl job is still in progress.



--
View this message in context: http://lucene.472066.n3.nabble.com/How-to-crawl-and-index-parallel-way-from-Nutch-into-Solr-tp4125990.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: How to crawl and index parallel way from Nutch into Solr

Posted by Talat Uyarer <ta...@uyarer.com>.

This is possible. Moreover you can run crawler more than one. You can
research Apache Oozie
 21 Mar 2014 13:44 tarihinde "reddibabu" <re...@gmail.com> yazdı:

> My requirement is to crawl and index urls based on -depth 100 and -topN
> 100.
> The Nutch crawl command crawls all the urls first and then indexes them and
> sends the data all at once to Solr. As the depth and topN are 100 each, the
> whole process (crawling and indexing) takes around 4-5 hours.
>
> I would like to know if there is a way where crawling and indexing can be
> done in parallel so that some data can be seen in the Solr admin screen
> while the Nutch crawl job is still in progress.
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/How-to-crawl-and-index-parallel-way-from-Nutch-into-Solr-tp4125990.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>

Re: How to crawl and index parallel way from Nutch into Solr

Posted by anupamk <an...@usc.edu>.

Instead of using the crawl script create your own script. 

Just run the bin/nutch command individually yourself in the script.

http://wiki.apache.org/nutch/NutchTutorial#A3.2_Using_Individual_Commands_for_Whole-Web_Crawling



You can do the follow for the each iteration

generate segment
fetch segment
parse segment 
update crawldb with segment 
solrindex the crawldb and segment 





--
View this message in context: http://lucene.472066.n3.nabble.com/How-to-crawl-and-index-parallel-way-from-Nutch-into-Solr-tp4125990p4126106.html
Sent from the Nutch - User mailing list archive at Nabble.com.