You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Max S <ma...@googlemail.com> on 2009/08/08 23:53:20 UTC
[max] Combining extracted data from multiple location before analysing and indexing.
Hi all,
I'm working on a project to configure Nutch to crawl a few image-rich sites.
Ideally, the approach would be to crawl the site by going through the
folllowing steps:
1. Inject crawldb with urls to specific categories on the sites and to set a
limit to the crawl depth to focus the crawl to a few section.
2. Crawl and extract text & outlink from html pages
3. Fetch outlink contents and determine the content type of the retrieved
data
4. If it's a JPG, extract metadata
5. Combine extracted text from the html with image metadata and analyse the
information
6. Index the results from the analysis.
I have managed to integrate a JPG parser but I can't see how I can retain
the text extracted from HTML in memory and then combine these together with
image metadata before sending them to analyser? Anyone have any idea?
Regards
Max