You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Anton Beza <an...@gmail.com> on 2007/07/26 16:16:06 UTC
Pull out a page from already processed pages, re-parse and replace
Hello,
I'm trying to find a way to re-parse the pages stored through Nutch.
I want to be able to access the pages Nutch has already processed and
stored, apply a new parser, and replace the old content with the new.
Is this possible in Nutch 0.8, or will it have to be altered to achieve
this?
Thanks,
Anton
Re: Pull out a page from already processed pages, re-parse and replace
Posted by Anton Beza <an...@gmail.com>.
Thanks!
I'd like to automate this. Do you know which Java class does the actual
parsing?
Thanks again,
Anton
On 7/26/07, Andrzej Bialecki <ab...@getopt.org> wrote:
>
> Anton Beza wrote:
> > Hello,
> >
> > I'm trying to find a way to re-parse the pages stored through Nutch.
> >
> > I want to be able to access the pages Nutch has already processed and
> > stored, apply a new parser, and replace the old content with the new.
> >
> > Is this possible in Nutch 0.8, or will it have to be altered to achieve
> > this?
>
> Just remove the following directories from each segment: crawl_parse,
> parse_text, parse_data, and then run bin/nutch parse on these segments.
>
> --
> Best regards,
> Andrzej Bialecki <><
> ___. ___ ___ ___ _ _ __________________________________
> [__ || __|__/|__||\/| Information Retrieval, Semantic Web
> ___|||__|| \| || | Embedded Unix, System Integration
> http://www.sigram.com Contact: info at sigram dot com
>
>
Re: Pull out a page from already processed pages, re-parse and replace
Posted by Andrzej Bialecki <ab...@getopt.org>.
Anton Beza wrote:
> Hello,
>
> I'm trying to find a way to re-parse the pages stored through Nutch.
>
> I want to be able to access the pages Nutch has already processed and
> stored, apply a new parser, and replace the old content with the new.
>
> Is this possible in Nutch 0.8, or will it have to be altered to achieve
> this?
Just remove the following directories from each segment: crawl_parse,
parse_text, parse_data, and then run bin/nutch parse on these segments.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com