You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by mark harwood <ma...@yahoo.co.uk> on 2010/05/07 18:25:28 UTC

Adding another dimension to Lucene searches

I have been working on a hierarchical search capability for a while now and wanted to see if there was general interest in adopting some of the thinking into Lucene.

The idea needs a little explanation so I've put some slides up here to kick things off:

http://www.slideshare.net/MarkHarwood/proposal-for-nested-document-support-in-lucene

Cheers
Mark



      

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Re: Adding another dimension to Lucene searches

Posted by Koji Sekiguchi <ko...@r.email.ne.jp>.
(10/05/08 1:25), mark harwood wrote:
> I have been working on a hierarchical search capability for a while now and wanted to see if there was general interest in adopting some of the thinking into Lucene.
>
> The idea needs a little explanation so I've put some slides up here to kick things off:
>
> http://www.slideshare.net/MarkHarwood/proposal-for-nested-document-support-in-lucene
>    

NestedDocumentQuery is very simple and cool.
I like it very much!
+1

Koji

-- 
http://www.rondhuit.com/en/


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Re: Adding another dimension to Lucene searches

Posted by Earwin Burrfoot <ea...@gmail.com>.
I've used something very similar to fold matching documents by some
field value, like author_id.
The very same issue with keeping all the parts in same segment, solved
with composite documents that go through all the pipeline and flushing
segments manually.

On Fri, May 7, 2010 at 20:25, mark harwood <ma...@yahoo.co.uk> wrote:
> I have been working on a hierarchical search capability for a while now and wanted to see if there was general interest in adopting some of the thinking into Lucene.
>
> The idea needs a little explanation so I've put some slides up here to kick things off:
>
> http://www.slideshare.net/MarkHarwood/proposal-for-nested-document-support-in-lucene
>
> Cheers
> Mark
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>



-- 
Kirill Zakharenko/Кирилл Захаренко (earwin@gmail.com)
Phone: +7 (495) 683-567-4
ICQ: 104465785

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Re: Adding another dimension to Lucene searches

Posted by mark harwood <ma...@yahoo.co.uk>.
Having implemented this code on a few projects I find that the major challenge shifts from the back end to the problem of the front end and how to get end users to articulate the questions Lucene can answer with this.
Certainly an interesting challenge but that's another topic...





----- Original Message ----
From: J. Delgado <jo...@gmail.com>
To: dev@lucene.apache.org
Sent: Mon, 10 May, 2010 16:47:50
Subject: Re: Adding another dimension to Lucene searches

Hierachical documents is a key concept towads a unified
structured+unstructured search. It should allow us to fully implement
things such as XQuery + Full-Text
(http://www.w3.org/TR/xquery-full-text/)

Additionally it solves a century old problem: how to deal with
section/sub-sections in very large documents. Long time ago I was
indexing text books (in PDF) and had to break down the book into pages
and store the main doc id in a field as pointer to maintain the
relation.

Mark, way to go!

-- Joaquin

On Mon, May 10, 2010 at 8:03 AM, Grant Ingersoll <gs...@apache.org> wrote:
> Very cool stuff, Mark.
>
> Can you just open a JIRA and attach there?
>
> On May 10, 2010, at 8:38 AM, mark harwood wrote:
>
>> I've put up code, example data and tests for the Nested Document feature here: http://www.inperspective.com/lucene/LuceneNestedDocumentSupport.zip
>>
>> The data used in the unit tests is chosen to illustrate practical use of real-world content.
>> The final unit tests will work on more abstract data for more formal/exhaustive testing of functionality.
>>
>> This packaging changes no existing Lucene code and is bundled with 3.0.1 but should work with 2.9.1. The readme.txt highlights the issues with segment flushing that may need addressing before adoption.
>>
>>
>> Cheers
>> Mark
>>
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


      

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Re: Adding another dimension to Lucene searches

Posted by "J. Delgado" <jo...@gmail.com>.
Hierachical documents is a key concept towads a unified
structured+unstructured search. It should allow us to fully implement
things such as XQuery + Full-Text
(http://www.w3.org/TR/xquery-full-text/)

Additionally it solves a century old problem: how to deal with
section/sub-sections in very large documents. Long time ago I was
indexing text books (in PDF) and had to break down the book into pages
and store the main doc id in a field as pointer to maintain the
relation.

Mark, way to go!

-- Joaquin

On Mon, May 10, 2010 at 8:03 AM, Grant Ingersoll <gs...@apache.org> wrote:
> Very cool stuff, Mark.
>
> Can you just open a JIRA and attach there?
>
> On May 10, 2010, at 8:38 AM, mark harwood wrote:
>
>> I've put up code, example data and tests for the Nested Document feature here: http://www.inperspective.com/lucene/LuceneNestedDocumentSupport.zip
>>
>> The data used in the unit tests is chosen to illustrate practical use of real-world content.
>> The final unit tests will work on more abstract data for more formal/exhaustive testing of functionality.
>>
>> This packaging changes no existing Lucene code and is bundled with 3.0.1 but should work with 2.9.1. The readme.txt highlights the issues with segment flushing that may need addressing before adoption.
>>
>>
>> Cheers
>> Mark
>>
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Re: Adding another dimension to Lucene searches

Posted by Grant Ingersoll <gs...@apache.org>.
Very cool stuff, Mark.

Can you just open a JIRA and attach there?

On May 10, 2010, at 8:38 AM, mark harwood wrote:

> I've put up code, example data and tests for the Nested Document feature here: http://www.inperspective.com/lucene/LuceneNestedDocumentSupport.zip
> 
> The data used in the unit tests is chosen to illustrate practical use of real-world content.
> The final unit tests will work on more abstract data for more formal/exhaustive testing of functionality.
> 
> This packaging changes no existing Lucene code and is bundled with 3.0.1 but should work with 2.9.1. The readme.txt highlights the issues with segment flushing that may need addressing before adoption.
> 
> 
> Cheers
> Mark
> 
> 
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
> 



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Re: Adding another dimension to Lucene searches

Posted by mark harwood <ma...@yahoo.co.uk>.
I've put up code, example data and tests for the Nested Document feature here: http://www.inperspective.com/lucene/LuceneNestedDocumentSupport.zip

The data used in the unit tests is chosen to illustrate practical use of real-world content.
The final unit tests will work on more abstract data for more formal/exhaustive testing of functionality.

This packaging changes no existing Lucene code and is bundled with 3.0.1 but should work with 2.9.1. The readme.txt highlights the issues with segment flushing that may need addressing before adoption.


Cheers
Mark



      

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Re: Adding another dimension to Lucene searches

Posted by Lance Norskog <go...@gmail.com>.
There are two separate problems that I know of in indexing parts of
PDFs in an overlapping way:

1) block-structured documents of
   a) the entire PDF file
   b) chapters
   c) sections of chapters
   d.....z)
2)   Tracking the set of pages that each document contains.

As I understand this, LUCENE-2324 handles the first case but not the
second. True?

On Sat, May 8, 2010 at 10:37 AM, Michael Busch <bu...@gmail.com> wrote:
> On 5/8/10 3:10 AM, Mark Harwood wrote:
>>
>> The downside is the need to maintain sequences of related docs in the same
>> segment - something Lucene currently doesn't make easy with its limited
>> control over when segments are flushed. I suspect we'll need some discussion
>> on how best to support this.
>>
>
> LUCENE-2324 should help to make this work even when you add documents with
> multiple threads.  There will be one DocumentsWriter per thread (DWPT), and
> the different DWPTs will write to their own segments.  We will also have an
> extension point to control thread binding.  Then you can make sure that all
> parts of your compound document end up sequentially in the same segment.
>
> One thing we have to make sure though is that a DWPT doesn't flush "between"
> different parts of your compound doc.  Hmm, we might have to add a "flush
> policy" to our growing family of policies.
>
>  Michael
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>



-- 
Lance Norskog
goksron@gmail.com

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Re: Adding another dimension to Lucene searches

Posted by Michael Busch <bu...@gmail.com>.
On 5/8/10 3:10 AM, Mark Harwood wrote:
> The downside is the need to maintain sequences of related docs in the same segment - something Lucene currently doesn't make easy with its limited control over when segments are flushed. I suspect we'll need some discussion on how best to support this.
>    

LUCENE-2324 should help to make this work even when you add documents 
with multiple threads.  There will be one DocumentsWriter per thread 
(DWPT), and the different DWPTs will write to their own segments.  We 
will also have an extension point to control thread binding.  Then you 
can make sure that all parts of your compound document end up 
sequentially in the same segment.

One thing we have to make sure though is that a DWPT doesn't flush 
"between" different parts of your compound doc.  Hmm, we might have to 
add a "flush policy" to our growing family of policies.

  Michael

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Re: Adding another dimension to Lucene searches

Posted by Mark Harwood <ma...@yahoo.co.uk>.
OK, seems like there is some interest.
I'll work on packaging the code/unit tests/demos and make it available.


> matching ids ... but I didn't quite catch from the slides how you encode
> the parent-child link... is it just "the next docs are sub-documents
> until the next parent doc"? 

Yes - using physical proximity avoids any kind of costly look-ups and allows efficient streaming/skipTo logic to work as per usual.

The downside is the need to maintain sequences of related docs in the same segment - something Lucene currently doesn't make easy with its limited control over when segments are flushed. I suspect we'll need some discussion on how best to support this.

Another dependency is that Lucene maintains sequencing of documents when merging segments together - this is something I think we can rely on currently (please correct me if I'm wrong) but I would like to formalise this with a Junit test or some other form of commitment which guarantees this state of affairs.

Cheers
Mark


On 8 May 2010, at 08:32, Andrzej Bialecki wrote:

> On 2010-05-07 18:25, mark harwood wrote:
>> I have been working on a hierarchical search capability for a while now and wanted to see if there was general interest in adopting some of the thinking into Lucene.
>> 
>> The idea needs a little explanation so I've put some slides up here to kick things off:
>> 
>> http://www.slideshare.net/MarkHarwood/proposal-for-nested-document-support-in-lucene
> 
> Very cool stuff. If I understand the design correctly, the cost of the
> query is roughly the same as constructing a Filter Query from the parent
> query, and then executing the child query with this filter. You probably
> use childScorer.skipTo(nextParentId) to avoid actually traversing all
> matching ids ... but I didn't quite catch from the slides how you encode
> the parent-child link... is it just "the next docs are sub-documents
> until the next parent doc"? or is it a field in the children that points
> to a unique id field of the parent?
> 
> 
> 
> -- 
> Best regards,
> Andrzej Bialecki     <><
> ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Re: Adding another dimension to Lucene searches

Posted by Andrzej Bialecki <ab...@getopt.org>.
On 2010-05-07 18:25, mark harwood wrote:
> I have been working on a hierarchical search capability for a while now and wanted to see if there was general interest in adopting some of the thinking into Lucene.
> 
> The idea needs a little explanation so I've put some slides up here to kick things off:
> 
> http://www.slideshare.net/MarkHarwood/proposal-for-nested-document-support-in-lucene

Very cool stuff. If I understand the design correctly, the cost of the
query is roughly the same as constructing a Filter Query from the parent
query, and then executing the child query with this filter. You probably
use childScorer.skipTo(nextParentId) to avoid actually traversing all
matching ids ... but I didn't quite catch from the slides how you encode
the parent-child link... is it just "the next docs are sub-documents
until the next parent doc"? or is it a field in the children that points
to a unique id field of the parent?



-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Re: Adding another dimension to Lucene searches

Posted by Ard Schrijvers <a....@onehippo.com>.
Think this is really interesting for Jackrabbit. I'd really like to
see it become part of the Lucene code base (though I am not sure
whether you where only polling Lucene devs...)

Regards Ard

On Fri, May 7, 2010 at 9:04 PM, Steven A Rowe <sa...@syr.edu> wrote:
> Hi Mark,
>
> This is extremely cool.  The user list regularly gets questions about modeling is-a relations, and as you outline in your presentation, there currently is no (performant) way to do it in the general case.
>
> Here's my (non-binding) +1 for inclusion in Lucene.
>
> Steve
>
> On 05/07/2010 at 12:25 PM, mark harwood wrote:
>> I have been working on a hierarchical search capability for a while now
>> and wanted to see if there was general interest in adopting some of the
>> thinking into Lucene.
>>
>> The idea needs a little explanation so I've put some slides up here to
>> kick things off:
>>
>> http://www.slideshare.net/MarkHarwood/proposal-for-nested-document-
>> support-in-lucene
>>
>> Cheers
>> Mark
>>
>>
>>
>>
>>
>> --------------------------------------------------------------------- To
>> unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For additional
>> commands, e-mail: dev-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


RE: Adding another dimension to Lucene searches

Posted by Steven A Rowe <sa...@syr.edu>.
Hi Mark,

This is extremely cool.  The user list regularly gets questions about modeling is-a relations, and as you outline in your presentation, there currently is no (performant) way to do it in the general case.

Here's my (non-binding) +1 for inclusion in Lucene.  

Steve

On 05/07/2010 at 12:25 PM, mark harwood wrote:
> I have been working on a hierarchical search capability for a while now
> and wanted to see if there was general interest in adopting some of the
> thinking into Lucene.
> 
> The idea needs a little explanation so I've put some slides up here to
> kick things off:
> 
> http://www.slideshare.net/MarkHarwood/proposal-for-nested-document-
> support-in-lucene
> 
> Cheers
> Mark
> 
> 
> 
> 
> 
> --------------------------------------------------------------------- To
> unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For additional
> commands, e-mail: dev-help@lucene.apache.org


Re: Adding another dimension to Lucene searches

Posted by Chris Hostetter <ho...@fucit.org>.
: I have been working on a hierarchical search capability for a while now 
: and wanted to see if there was general interest in adopting some of the 
: thinking into Lucene.

This looks cool ... up to slide #5 i thought you were just 
proposing something akin to using FieldMaskingSpanQuery, but 
NestedDocumentQuery is an awesome idea ... slide 9 makes it seem so easy 
(I assume you build up the BitSet by scanning the TermDocs for some 
configurable marker term denoting a "parent" doc?)

I haven't thought it through very well, but i suspect something like this 
would elimante a lot of the hairy usecases for "field collapsing" allowing 
a more streamlined solution for that problem as well.



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org