You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Alexander Aristov <al...@gmail.com> on 2009/01/20 11:56:00 UTC

how to split a page into separate documents

Hi all

Can someone suggest me how to write a plugin (or parser) which can parse a
page and produce more than one document from it.

I have pages which are composed of sections. And I would like make each
section as a separate searchable document in Nutch. I have no problem with
parsing the doc, I can write special parser and I know structure of the
pages.

But parsers return only one document - that is the matter.

How should I change the behavoiur?
-- 
Best Regards
Alexander Aristov

Re: how to split a page into separate documents

Posted by Doğacan Güney <do...@gmail.com>.
Hi,

On Tue, Jan 20, 2009 at 12:56 PM, Alexander Aristov
<al...@gmail.com> wrote:
> Hi all
>
> Can someone suggest me how to write a plugin (or parser) which can parse a
> page and produce more than one document from it.
>
> I have pages which are composed of sections. And I would like make each
> section as a separate searchable document in Nutch. I have no problem with
> parsing the doc, I can write special parser and I know structure of the
> pages.
>
> But parsers return only one document - that is the matter.
>
> How should I change the behavoiur?

Actually, nutch trunk parsers can return more than one document. Take a look at
feed plugin for an example.

> --
> Best Regards
> Alexander Aristov
>



-- 
Doğacan Güney