You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Dave Schneider <da...@cyc.com> on 2007/10/23 23:33:50 UTC
Sanity Check re: Converting customized Lucene crawl/index to use
Nutch
Hi,
I have an existing application using Lucene that involves taking web
pages, completely transmuting them into a form not at all recognizable
as a natural language, and then searching through that form. In
particular, we built our own Lucene analyzer/tokenizer that can read out
our funky format. In order to take advantage of the Nutch crawler and
Hadoop parallelization, we'd like to convert it.
I'm thinking that I could use the parser plugins that come with Nutch,
write an indexer plugin that will take the text as recovered by the
parsers, send it to an external process to produce the format that I
need for indexing, replace the existing text in the document with my new
text, and then index that. I believe I'd also need an analyzer plugin
that uses the code we already have to tokenize our wierd format.
When this is done, I believe I should have an index that I can use with
our existing Lucene-based search code, without necessarily needing to
convert the search part over to run via Nutch.
I have this nagging feeling that I'm going to be violating some
deep-seated assumptions if I do this, so I'd appreciate any advice I
could get.
Thanks,
Dave