You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Mathijs Homminga <ma...@knowlogy.nl> on 2007/01/05 11:42:29 UTC

Reparsing fetched content

Dear List,

I have written a parse-jpg plugin which rescales JPEG images before 
storing them:


public class JPEGParseFilter implements Parser {
...
  public Parse getParse(Content content) {
    ...
    content.setContent(scaledImage);
  }
}

This works fine when parsing is done fetch-time. So I assume that the 
Fetcher stores the content after it has been parsed (if parsing is not 
disabled).
However, when I perform a reparse (to scale down the images even 
further) the content does not seem to be modified.


Question 1: Is it true that the code above changes the fetched content 
before storing it (throwing away the original content)?

Question 2: Can I run this parse plugin to reparse the images and change 
the content again (e.g. to make the images smaller, without the need to 
refetch all content)? Or is the content write-once during fetch-parse time?

Thanks a lot,
Mathijs






Re: Reparsing fetched content

Posted by Eelco Lempsink <le...@paragin.nl>.
Hi Mathijs,

On 5-jan-2007, at 11:42, Mathijs Homminga wrote:
> I have written a parse-jpg plugin which rescales JPEG images before  
> storing them:
>
>
> public class JPEGParseFilter implements Parser {
> ...
>  public Parse getParse(Content content) {
>    ...
>    content.setContent(scaledImage);
>  }
> }
>
> This works fine when parsing is done fetch-time. So I assume that  
> the Fetcher stores the content after it has been parsed (if parsing  
> is not disabled).
> However, when I perform a reparse (to scale down the images even  
> further) the content does not seem to be modified.

The parsed content will be saved in the directories parse_data and  
parse_text in de segment dir.  The input directory used is the  
content directory, which contains the fetched data.

> Question 1: Is it true that the code above changes the fetched  
> content before storing it (throwing away the original content)?

No.  The original content is never stored.  setContent() just  
modifies the loaded object.  All the parse jobs does is: dir:content - 
 > job:parse -> dir:parse_data and dir:parse_text.

> Question 2: Can I run this parse plugin to reparse the images and  
> change the content again (e.g. to make the images smaller, without  
> the need to refetch all content)? Or is the content write-once  
> during fetch-parse time?

Not directly.  The parse job is a simple straight-forward write-once  
operation.  To reparse already parsed data you would have to  
implement your own job, which, for example, takes the parse_data as  
input directory, reparses the data to a temporary directory and then  
replace the original parse_data with the new one.

Take a look at the org.apache.nutch.parse.ParseSegment class to see  
how the parse job works.   Also, take a look at the  
org.apache.nutch.crawl.CrawlDb and  
org.apache.nutch.crawl.CrawlDbMerger classes for ways to implement  
the replacing of an existing directory.

Good luck!

-- 
Regards,

Eelco Lempsink