You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Daniel Moldovan <da...@level7.ro> on 2005/07/08 08:57:43 UTC

search that span over consecutive documents

Hello everyone,

 

My application must index a lot of books that are stored in xml files.

Each xml file represents a page of the book and this way each page becomes a
lucene Document.

Each page is organized in different sections and finally each section
contains lines.

 

What I need to do is give the user the possibility to search for a phrase
that starts at the 

and of a page and continues on the next page. The span should have some
limits, let's say,  

6 words on each page.

 

Does any one experienced this kind of search? Please share you knowledge if
you did.

Thank you very much,

 

Daniel


Re: search that span over consecutive documents

Posted by Erik Hatcher <er...@ehatchersolutions.com>.
On Jul 8, 2005, at 2:57 AM, Daniel Moldovan wrote:
> My application must index a lot of books that are stored in xml files.
>
> Each xml file represents a page of the book and this way each page  
> becomes a
> lucene Document.
>
> Each page is organized in different sections and finally each section
> contains lines.
>
>
>
> What I need to do is give the user the possibility to search for a  
> phrase
> that starts at the
> and of a page and continues on the next page. The span should have  
> some
> limits, let's say,  6 words on each page.
>
> Does any one experienced this kind of search? Please share you  
> knowledge if
> you did.

You're lucky you get to represent your data so hierarchically!  Try  
getting scholars to represent a book in such a fashion!!!  (I'm  
dealing with scholarly works in XML format and sections do not fall  
_within_ pages, they can span across pages).

In this case, one field of your document should probably index a page  
+ 6 words on either side of it from the previous and next pages.   
Maybe you also have a field that represents only the page as well.   
Perhaps something at query time decides which field to search?  Maybe  
all phrase queries use the overlapped field and other query types use  
the single page field?

     Erik