You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Mathijs Homminga <ma...@knowlogy.nl> on 2007/01/05 11:42:29 UTC
Reparsing fetched content
Dear List,
I have written a parse-jpg plugin which rescales JPEG images before
storing them:
public class JPEGParseFilter implements Parser {
...
public Parse getParse(Content content) {
...
content.setContent(scaledImage);
}
}
This works fine when parsing is done fetch-time. So I assume that the
Fetcher stores the content after it has been parsed (if parsing is not
disabled).
However, when I perform a reparse (to scale down the images even
further) the content does not seem to be modified.
Question 1: Is it true that the code above changes the fetched content
before storing it (throwing away the original content)?
Question 2: Can I run this parse plugin to reparse the images and change
the content again (e.g. to make the images smaller, without the need to
refetch all content)? Or is the content write-once during fetch-parse time?
Thanks a lot,
Mathijs
Re: Reparsing fetched content
Posted by Eelco Lempsink <le...@paragin.nl>.
Hi Mathijs,
On 5-jan-2007, at 11:42, Mathijs Homminga wrote:
> I have written a parse-jpg plugin which rescales JPEG images before
> storing them:
>
>
> public class JPEGParseFilter implements Parser {
> ...
> public Parse getParse(Content content) {
> ...
> content.setContent(scaledImage);
> }
> }
>
> This works fine when parsing is done fetch-time. So I assume that
> the Fetcher stores the content after it has been parsed (if parsing
> is not disabled).
> However, when I perform a reparse (to scale down the images even
> further) the content does not seem to be modified.
The parsed content will be saved in the directories parse_data and
parse_text in de segment dir. The input directory used is the
content directory, which contains the fetched data.
> Question 1: Is it true that the code above changes the fetched
> content before storing it (throwing away the original content)?
No. The original content is never stored. setContent() just
modifies the loaded object. All the parse jobs does is: dir:content -
> job:parse -> dir:parse_data and dir:parse_text.
> Question 2: Can I run this parse plugin to reparse the images and
> change the content again (e.g. to make the images smaller, without
> the need to refetch all content)? Or is the content write-once
> during fetch-parse time?
Not directly. The parse job is a simple straight-forward write-once
operation. To reparse already parsed data you would have to
implement your own job, which, for example, takes the parse_data as
input directory, reparses the data to a temporary directory and then
replace the original parse_data with the new one.
Take a look at the org.apache.nutch.parse.ParseSegment class to see
how the parse job works. Also, take a look at the
org.apache.nutch.crawl.CrawlDb and
org.apache.nutch.crawl.CrawlDbMerger classes for ways to implement
the replacing of an existing directory.
Good luck!
--
Regards,
Eelco Lempsink