You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Kumar Krishnasami <ku...@vembu.com> on 2010/01/23 08:27:58 UTC

Using Nutch to crawl and use it as input to Solr

Hi All,

I am trying to decide if I could use Nutch for a project I am working on 
with the following requirements:

1. I need to build the ability to search a bunch of urls.
2. These urls are given to me and there is no need to crawl links from 
or to these urls.
3. From time to time new urls will be added to the original set of urls. 
I need to update the indexes as soon as I get a new url to be added to 
the original set of urls.
4. There is no need to rank these urls based on outside links etc..

Based on these requirements it seems that most of the capabilities of 
Nutch (crawling, hadoop etc.) would be an overkill for this project. 
There is no need for a linkdb etc..

Due to this I am thinking that I could use Solr with some other 
component to feed it with the appropriate data. If I use Solr, I would 
need a mechanism to fetch those urls and convert them to the format Solr 
needs the data to be sent to it. Can I use Nutch for this by just using 
the Fetcher and build something that would convert the html into the 
appropriate xml format for Solr? Is there something else that I could 
use that anyone here is aware of?

I am just starting out with Nutch and Solr and any help would be greatly 
appreciated.

Thanks,
Kumar.

Re: Using Nutch to crawl and use it as input to Solr

Posted by Otis Gospodnetic <og...@yahoo.com>.

Use Droids to crawl.  It already has hooks to index crawled content with Solr, e.g.
http://search-lucene.com/c?id=Droids:/droids-solr/src/main/java/org/apache/droids/solr/SolrHandler.java||solr


Otis
----
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Hadoop ecosystem search :: http://search-hadoop.com/



----- Original Message ----
> From: Kumar Krishnasami <ku...@vembu.com>
> To: nutch-user@lucene.apache.org
> Sent: Sat, January 23, 2010 2:27:58 AM
> Subject: Using Nutch to crawl and use it as input to Solr
> 
> Hi All,
> 
> I am trying to decide if I could use Nutch for a project I am working on with 
> the following requirements:
> 
> 1. I need to build the ability to search a bunch of urls.
> 2. These urls are given to me and there is no need to crawl links from or to 
> these urls.
> 3. From time to time new urls will be added to the original set of urls. I need 
> to update the indexes as soon as I get a new url to be added to the original set 
> of urls.
> 4. There is no need to rank these urls based on outside links etc..
> 
> Based on these requirements it seems that most of the capabilities of Nutch 
> (crawling, hadoop etc.) would be an overkill for this project. There is no need 
> for a linkdb etc..
> 
> Due to this I am thinking that I could use Solr with some other component to 
> feed it with the appropriate data. If I use Solr, I would need a mechanism to 
> fetch those urls and convert them to the format Solr needs the data to be sent 
> to it. Can I use Nutch for this by just using the Fetcher and build something 
> that would convert the html into the appropriate xml format for Solr? Is there 
> something else that I could use that anyone here is aware of?
> 
> I am just starting out with Nutch and Solr and any help would be greatly 
> appreciated.
> 
> Thanks,
> Kumar.