You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Paul Tomblin <pt...@xcski.com> on 2009/07/24 16:39:43 UTC

Can I "chunk" during the crawl?

Forgive me if this is a bit of a n00b question.
I've been tasked with taking some other person's code and replacing all the
DieselPoint code with Lucene/Nutch.  What they do in DieselPoint is crawl
specific parts of the web, then perform some proprietary splitting up of the
returned pages into "chunks", and then the chunks themselves are
indexed.  Actually, I think they do it in a kind of a naive way,
because it appears that DieselPoint crawls and indexes, and then this
code goes through the index and creates
chunk files, possibly several from any given initial page, and then
DieselPoint is set loose to crawl and index those chunk files.  Then the app
uses *that* index in proprietary searches.
I'm trying to learn my way around Nutch, and I'm wondering if there might be
a way to get rid of the chunking stage by doing it directly in the initial
crawl, possibly by writing a plugin.  Unfortunately I'm under NDA so I can't
give away too much of what the chunking process does, but I hope I've given
enough information on what I'm trying to do.  Is what I'm doing possible?

-- 
http://www.linkedin.com/in/paultomblin