You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by placoteco placoteco <pl...@gmail.com> on 2009/09/21 13:27:09 UTC

Split an input document to store differents parts of it as independent lucene documents.

Hi everybody:

I would like to ask you a way to split an input document identified by an
URL, for example html or xml, to store its different parts as independent
documents within the index. Imagine that all the documents I have to crawl
have the same internal structure:
<html><body><paragraph>...</paragraph>...<paragraph>...</paragraph></body></html>.
So, I want to split that input and store every paragraph as an independent
document.

I'll try to explain it using an example. Suppose we have a link
http://my.server:myport/docA.html. Then we fetch it, but because its content
is:  <html><body><paragraph>first paragraph</paragraph><paragraph>second
paragraph</paragraph></body></html> I want to split it and store two
documents. The first one will contain the first paragraph and the second one
will contain the second paragraph. The lucene index will look something like
this:

    doc 1:
        -url: http://my.server:myport/docA.html
        -content: first paragraph
        -split: yes
        -split order: 1
        ...

   doc 2:
        -url: http://my.server:myport/docA.html
        -content: second paragraph
        -split: yes
        -split order: 2
        ...

Thanks in advance.