You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Max S <ma...@googlemail.com> on 2009/08/08 23:53:20 UTC

[max] Combining extracted data from multiple location before analysing and indexing.

Hi all, 

I'm working on a project to configure Nutch to crawl a few image-rich sites.

Ideally, the approach would be to crawl the site by going through the
folllowing steps:

1. Inject crawldb with urls to specific categories on the sites and to set a
limit to the crawl depth to focus the crawl to a few section. 
2. Crawl and extract text & outlink from html pages
3. Fetch outlink contents and determine the content type of the retrieved
data
4. If it's a JPG, extract metadata
5. Combine extracted text from the html with image metadata and analyse the
information
6. Index the results from the analysis. 

I have managed to integrate a JPG parser but I can't see how I can retain
the text extracted from HTML in memory and then combine these together with
image metadata before sending them to analyser? Anyone have any idea?

Regards
Max