You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Dave Schneider <da...@cyc.com> on 2007/10/23 23:33:50 UTC

Sanity Check re: Converting customized Lucene crawl/index to use Nutch

Hi,

I have an existing application using Lucene that involves taking web 
pages, completely transmuting them into a form not at all recognizable 
as a natural language, and then searching through that form.  In 
particular, we built our own Lucene analyzer/tokenizer that can read out 
our funky format.  In order to take advantage of the Nutch crawler and 
Hadoop parallelization, we'd like to convert it.

I'm thinking that I could use the parser plugins that come with Nutch, 
write an indexer plugin that will take the text as recovered by the 
parsers, send it to an external process to produce the format that I 
need for indexing, replace the existing text in the document with my new 
text, and then index that.  I believe I'd also need an analyzer plugin 
that uses the code we already have to tokenize our wierd format. 

When this is done, I believe I should have an index that I can use with 
our existing Lucene-based search code, without necessarily needing to 
convert the search part over to run via Nutch.

I have this nagging feeling that I'm going to be violating some 
deep-seated assumptions if I do this, so I'd appreciate any advice I 
could get.

Thanks,

Dave