You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Anton Beza <an...@gmail.com> on 2007/07/26 16:16:06 UTC

Pull out a page from already processed pages, re-parse and replace

Hello,

I'm trying to find a way to re-parse the pages stored through Nutch.

I want to be able to access the pages Nutch has already processed and
stored, apply a new parser, and replace the old content with the new.

Is this possible in Nutch 0.8, or will it have to be altered to achieve
this?

Thanks,
Anton

Re: Pull out a page from already processed pages, re-parse and replace

Posted by Anton Beza <an...@gmail.com>.

Thanks!

I'd like to automate this.  Do you know which Java class does the actual
parsing?

Thanks again,
Anton

On 7/26/07, Andrzej Bialecki <ab...@getopt.org> wrote:
>
> Anton Beza wrote:
> > Hello,
> >
> > I'm trying to find a way to re-parse the pages stored through Nutch.
> >
> > I want to be able to access the pages Nutch has already processed and
> > stored, apply a new parser, and replace the old content with the new.
> >
> > Is this possible in Nutch 0.8, or will it have to be altered to achieve
> > this?
>
> Just remove the following directories from each segment: crawl_parse,
> parse_text, parse_data, and then run bin/nutch parse on these segments.
>
> --
> Best regards,
> Andrzej Bialecki     <><
>   ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>

Re: Pull out a page from already processed pages, re-parse and replace

Posted by Andrzej Bialecki <ab...@getopt.org>.

Anton Beza wrote:
> Hello,
> 
> I'm trying to find a way to re-parse the pages stored through Nutch.
> 
> I want to be able to access the pages Nutch has already processed and
> stored, apply a new parser, and replace the old content with the new.
> 
> Is this possible in Nutch 0.8, or will it have to be altered to achieve
> this?

Just remove the following directories from each segment: crawl_parse, 
parse_text, parse_data, and then run bin/nutch parse on these segments.

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com