You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by "Mark Harwood (JIRA)" <ji...@apache.org> on 2010/05/10 17:30:17 UTC

[jira] Created: (LUCENE-2454) Nested Document query support

Nested Document query support
-----------------------------

                 Key: LUCENE-2454
                 URL: https://issues.apache.org/jira/browse/LUCENE-2454
             Project: Lucene - Java
          Issue Type: New Feature
          Components: Search
    Affects Versions: 3.0.2
            Reporter: Mark Harwood
            Assignee: Mark Harwood
            Priority: Minor


A facility for querying nested documents in a Lucene index as outlined in http://www.slideshare.net/MarkHarwood/proposal-for-nested-document-support-in-lucene

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2454) Nested Document query support

Posted by "Mark Harwood (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12878617#action_12878617 ] 

Mark Harwood commented on LUCENE-2454:
--------------------------------------

Yep, I can see an app with a thousand cached filters would have a problem with this impl as it stands. 

Maintaining parallel indexes always feels a little flaky to me, not least because of the loss of  transactional integrity you can get from using a single index.

Is another approach to make your cached filters document-type-specific?   I.e. they only hold numbers in the range of zero to number-of-docs-of-this-type.
To use a cached doc ID in such a filter you would need to make use of mapping arrays to project the type-specific doc id numbers into global doc-id references and back.
Lets imagine an index with a mix of  "A", "B" and "C" doc types organised as follows:
docId    docType
=====  =======
1            A
2            B
3            C
4            A
5            C
6            C

The mapping arrays for docType "C" would look as follows
{code:title=Bar.java|borderStyle=solid}
int [ ] globalDocIdToTypeCLookUp = {-1,-1,0,-1,1,2}        // sparse, sized 0-> num docs in overall index
int [ ] typeCToGlobalDocIdLookUp = {0,1,2}          // dense, sized 0-> num type C docs in overall index
{code}

Your cached filters would be created as follows:
{code:title=Bar.java|borderStyle=solid}
myTypeCBitset=new OpenBitSet(numberOfTypeCDocs);  //this line is hopefully where you save RAM!
//for all matching type C docs...
myTypeCBitSet.setBit(globalDocIdToTypeCLookUp[realDocId];
{code}

Your filters can then be used by dereferencing the child doc IDs as follows:
{code:title=Bar.java|borderStyle=solid}
int nextRealDocId=typeCToGlobalDocIdLookUp [myTypeCBitSet.getNextSetBit()];
{code}
  
Clearly the mapping arrays come at a cost of 4bytes*num docs which is non trivial. The sparse globalDocIdToTypeCLookUp array shown here could be avoided by reading TermDocs and counting at cached-Filter-create time .


> Nested Document query support
> -----------------------------
>
>                 Key: LUCENE-2454
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2454
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Search
>    Affects Versions: 3.0.2
>            Reporter: Mark Harwood
>            Assignee: Mark Harwood
>            Priority: Minor
>         Attachments: LuceneNestedDocumentSupport-1.zip
>
>
> A facility for querying nested documents in a Lucene index as outlined in http://www.slideshare.net/MarkHarwood/proposal-for-nested-document-support-in-lucene

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Updated: (LUCENE-2454) Nested Document query support

Posted by "Mark Harwood (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mark Harwood updated LUCENE-2454:
---------------------------------

    Attachment: LuceneNestedDocumentSupport-1.zip

Initial attachment is code plus illustrative data/tests. 
Fuller unit tests/build scripts etc to follow

> Nested Document query support
> -----------------------------
>
>                 Key: LUCENE-2454
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2454
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Search
>    Affects Versions: 3.0.2
>            Reporter: Mark Harwood
>            Assignee: Mark Harwood
>            Priority: Minor
>         Attachments: LuceneNestedDocumentSupport-1.zip
>
>
> A facility for querying nested documents in a Lucene index as outlined in http://www.slideshare.net/MarkHarwood/proposal-for-nested-document-support-in-lucene

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2454) Nested Document query support

Posted by "Mark Harwood (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12866148#action_12866148 ] 

Mark Harwood commented on LUCENE-2454:
--------------------------------------

bq. - there was a discussion on narrowing indexing API to something stream-like

Any idea where there that discussion was taking place? Happy to move flush-control discussions elsewhere if that is more appropriate.

> Nested Document query support
> -----------------------------
>
>                 Key: LUCENE-2454
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2454
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Search
>    Affects Versions: 3.0.2
>            Reporter: Mark Harwood
>            Assignee: Mark Harwood
>            Priority: Minor
>         Attachments: LuceneNestedDocumentSupport-1.zip
>
>
> A facility for querying nested documents in a Lucene index as outlined in http://www.slideshare.net/MarkHarwood/proposal-for-nested-document-support-in-lucene

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2454) Nested Document query support

Posted by "Mark Harwood (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12882899#action_12882899 ] 

Mark Harwood commented on LUCENE-2454:
--------------------------------------

bq. Can this help in searching over multiple child/nested documents?

Yes, a typical use case is to use "NestedDocumentQuery" to fetch the top 10 parents then do a second query to fetch the children using a mandatory clause which lists the primary keys of the selected parents (assuming the children have an indexed field with the parent primary key).
The "PerParentLimitedQuery" can be used to limit the number of child docs returned per parent if there are many e.g. pages in a book. Both these classes are in the zipped attachment to this issue.

> Nested Document query support
> -----------------------------
>
>                 Key: LUCENE-2454
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2454
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Search
>    Affects Versions: 3.0.2
>            Reporter: Mark Harwood
>            Assignee: Mark Harwood
>            Priority: Minor
>         Attachments: LuceneNestedDocumentSupport-1.zip
>
>
> A facility for querying nested documents in a Lucene index as outlined in http://www.slideshare.net/MarkHarwood/proposal-for-nested-document-support-in-lucene

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Updated: (LUCENE-2454) Nested Document query support

Posted by "Mark Harwood (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mark Harwood updated LUCENE-2454:
---------------------------------

    Attachment:     (was: LuceneNestedDocumentSupport-1.zip)

> Nested Document query support
> -----------------------------
>
>                 Key: LUCENE-2454
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2454
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Search
>    Affects Versions: 3.0.2
>            Reporter: Mark Harwood
>            Assignee: Mark Harwood
>            Priority: Minor
>
> A facility for querying nested documents in a Lucene index as outlined in http://www.slideshare.net/MarkHarwood/proposal-for-nested-document-support-in-lucene

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2454) Nested Document query support

Posted by "Mark Harwood (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12878434#action_12878434 ] 

Mark Harwood commented on LUCENE-2454:
--------------------------------------

bq. Wow, this is absolutely awesome! 

Thanks. I've found that this certainly solves problems I previously couldn't address at all in standard Lucene.

bq. The leading concern I have with this implementation is the size of the number of documents in the index as it affects the size of filters

These filters can obviously be cached but you'll need one filter per level you "roll up" to. Assuming a 300m doc index and only rolling up matches to the root that should only cost 300m /8 bits per byte = 37.5 meg of RAM. Index reloads should avoid the cost of completely rebuilding this filter nowadays because filters are cached at segment level and unchanged segments will retain their cached filters.
Perhaps a bigger concern is any norms arrays which are allocated one BYTE (as opposed to one bit) per document in the index.

bq. and they don't share any fields with the parent. 

For parents with only 1 child document instance of a given type, these could be safely "rolled up" into the parent and stored in the same document.



> Nested Document query support
> -----------------------------
>
>                 Key: LUCENE-2454
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2454
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Search
>    Affects Versions: 3.0.2
>            Reporter: Mark Harwood
>            Assignee: Mark Harwood
>            Priority: Minor
>         Attachments: LuceneNestedDocumentSupport-1.zip
>
>
> A facility for querying nested documents in a Lucene index as outlined in http://www.slideshare.net/MarkHarwood/proposal-for-nested-document-support-in-lucene

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Updated: (LUCENE-2454) Nested Document query support

Posted by "Mark Harwood (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mark Harwood updated LUCENE-2454:
---------------------------------

    Attachment:     (was: TestNestedDocumentQueryWithMultiSegments.java)

> Nested Document query support
> -----------------------------
>
>                 Key: LUCENE-2454
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2454
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Search
>    Affects Versions: 3.0.2
>            Reporter: Mark Harwood
>            Assignee: Mark Harwood
>            Priority: Minor
>
> A facility for querying nested documents in a Lucene index as outlined in http://www.slideshare.net/MarkHarwood/proposal-for-nested-document-support-in-lucene

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2454) Nested Document query support

Posted by "Mark Harwood (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12888908#action_12888908 ] 

Mark Harwood commented on LUCENE-2454:
--------------------------------------

The 2nd comment above talks about this and the need for Lucene to offer more control over flush policy.

bq.it only matches the the parent document acurately for the 1st segment. I think this is due to the way the parent docs are marked using a bit array for the ENTIRE index

But aren't filters held and evaluated the within the context of each sub reader? Are you sure the issue isn't limited to a parent/child combo that is split across segments? 

> Nested Document query support
> -----------------------------
>
>                 Key: LUCENE-2454
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2454
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Search
>    Affects Versions: 3.0.2
>            Reporter: Mark Harwood
>            Assignee: Mark Harwood
>            Priority: Minor
>         Attachments: LuceneNestedDocumentSupport-1.zip
>
>
> A facility for querying nested documents in a Lucene index as outlined in http://www.slideshare.net/MarkHarwood/proposal-for-nested-document-support-in-lucene

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2454) Nested Document query support

Posted by "Amit Kulkarni (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12882163#action_12882163 ] 

Amit Kulkarni commented on LUCENE-2454:
---------------------------------------

This is amazing feature!

Can this help in searching over multiple child/nested documents? Is there a sample code avaialble that demonstrates how to achieve this?

We have requirement wherein search result need to carry fields from child documents. Can this be achieved?

> Nested Document query support
> -----------------------------
>
>                 Key: LUCENE-2454
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2454
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Search
>    Affects Versions: 3.0.2
>            Reporter: Mark Harwood
>            Assignee: Mark Harwood
>            Priority: Minor
>         Attachments: LuceneNestedDocumentSupport-1.zip
>
>
> A facility for querying nested documents in a Lucene index as outlined in http://www.slideshare.net/MarkHarwood/proposal-for-nested-document-support-in-lucene

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2454) Nested Document query support

Posted by "David Smiley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12878741#action_12878741 ] 

David Smiley commented on LUCENE-2454:
--------------------------------------

That's an interesting strategy.  The size of these arrays is no big deal to me since there's only a couple of them.  My concern with this strategy is that I wonder if potentially many places in Solr would have to be become aware of this scheme which might make this strategy untenable to implement even though its theoretically sound.
  
Another nice thing about the parallel index is that the idf relevancy factor stays clean since it will only consider "real" documents.

I want to investigate these options closer ASAP since this feature you've implemented is something I need.  Before I saw this issue, I was going to try something with SpanNearQuery and the masking-field variant.

> Nested Document query support
> -----------------------------
>
>                 Key: LUCENE-2454
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2454
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Search
>    Affects Versions: 3.0.2
>            Reporter: Mark Harwood
>            Assignee: Mark Harwood
>            Priority: Minor
>         Attachments: LuceneNestedDocumentSupport-1.zip
>
>
> A facility for querying nested documents in a Lucene index as outlined in http://www.slideshare.net/MarkHarwood/proposal-for-nested-document-support-in-lucene

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2454) Nested Document query support

Posted by "Mark Harwood (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12889217#action_12889217 ] 

Mark Harwood commented on LUCENE-2454:
--------------------------------------

bq. I made a minor modification your approch by making it do a "Forward-scan" instead of reverse scan

Interesting, but I'm not sure what guarantees Lucene will make about:

* Sequencing of calls on scorer.nextDoc (i.e. are calls to all scorers involved guaranteed to be in doc-insert order ?)
* Index-time merging of segments (i.e. are all segments merged together in an order that keeps the parent doc in one segment next to the child doc from the next segment?)

That seems like a fragile set of dependencies. 
Also, don't things get tricky when reporting matches from NestedDocumentQuery and PerParentLimitedQuery back to the collector? During the query process the IndexSearcher resets the docId context (Collector.setNextReader) as it moves from one Scorer/segment to another. If we are delaying the assessment/reporting of matches until we've crossed a segment boundary it is too late to report a match on a child doc id from a previous segment as the collector has already changed context. Unfortunately "PerParentLimitedQuery" needs to do this when selecting the "best" children for a single parent.



> Nested Document query support
> -----------------------------
>
>                 Key: LUCENE-2454
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2454
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Search
>    Affects Versions: 3.0.2
>            Reporter: Mark Harwood
>            Assignee: Mark Harwood
>            Priority: Minor
>         Attachments: LuceneNestedDocumentSupport.zip
>
>
> A facility for querying nested documents in a Lucene index as outlined in http://www.slideshare.net/MarkHarwood/proposal-for-nested-document-support-in-lucene

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2454) Nested Document query support

Posted by "Earwin Burrfoot (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12866133#action_12866133 ] 

Earwin Burrfoot commented on LUCENE-2454:
-----------------------------------------

An alternate approach - there was a discussion on narrowing indexing API to something stream-like, whereas Document becomes its default implementation. We can add some flush-boundary signalling methods, or a notion of composite documents to that new API.

I like this approach more, as control does not spread out across different APIs/instances. You don't have to hold reference to your policy in the indexing code, you don't have to raise/lower flags in some remote code to signal things that are internal to yours.

> Nested Document query support
> -----------------------------
>
>                 Key: LUCENE-2454
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2454
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Search
>    Affects Versions: 3.0.2
>            Reporter: Mark Harwood
>            Assignee: Mark Harwood
>            Priority: Minor
>         Attachments: LuceneNestedDocumentSupport-1.zip
>
>
> A facility for querying nested documents in a Lucene index as outlined in http://www.slideshare.net/MarkHarwood/proposal-for-nested-document-support-in-lucene

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2454) Nested Document query support

Posted by "David Smiley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12878317#action_12878317 ] 

David Smiley commented on LUCENE-2454:
--------------------------------------

Wow, this is absolutely awesome!  This is one of the best enhancement requests to Lucene/Solr that I've seen as it brings a real enhancement this is difficult / impossible to do without.

The leading concern I have with this implementation is the size of the number of documents in the index as it affects the size of filters and perhaps other areas involving creating BitSet's.  I have a scenario in which the sub-documents number on average over 100 to each primary document.  These sub-documents are at least very small, and they don't share any fields with the parent.  For a large scale search situation, an index containing 3M lucene documents now needs to store over 300M, and thus require 100x the amount of RAM for filter caches as I require now.  Thoughts?

> Nested Document query support
> -----------------------------
>
>                 Key: LUCENE-2454
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2454
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Search
>    Affects Versions: 3.0.2
>            Reporter: Mark Harwood
>            Assignee: Mark Harwood
>            Priority: Minor
>         Attachments: LuceneNestedDocumentSupport-1.zip
>
>
> A facility for querying nested documents in a Lucene index as outlined in http://www.slideshare.net/MarkHarwood/proposal-for-nested-document-support-in-lucene

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2454) Nested Document query support

Posted by "Mark Harwood (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12866128#action_12866128 ] 

Mark Harwood commented on LUCENE-2454:
--------------------------------------

Robust use of this feature is dependent on careful management of segments i.e. that all compound documents are held in the same segment.

Michael Busch suggested the introduction of a new "FlushPolicy" on IndexWriter to offer the required control. (see http://mail-archives.apache.org/mod_mbox/lucene-dev/201005.mbox/%3C4BE5A14C.6040108@gmail.com%3E )
Sounds sensible to me given that IndexWriter currently manages to muddle 2 alternative policies in the one implementation and it looks like we now need a third.

Is this the place to start the debate on "FlushPolicy" ?
My guess is this change would involve :
* Deprecating/removing IndexWriter's setMaxBufferedDocs and setRAMBufferSizeMB.
* Providing a new "FlushPolicy" abstract class that is called with a "BufferContext " class to hold number buffered docs + ram usage. FlushPolicy is asked if flushing of various structures should be triggered given the context
* Provide default implementations of FlushPolicy that are number-of-documents-based and RAM-based.
* Provide a special "NestedDocumentFlushPolicy" that can wrap any other policy (ram/num docs) but only triggers flushes when application code has primed it to say a batch of related documents is completed.

Let me know where it's best to continue the thinking on these IndexWriter changes.

> Nested Document query support
> -----------------------------
>
>                 Key: LUCENE-2454
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2454
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Search
>    Affects Versions: 3.0.2
>            Reporter: Mark Harwood
>            Assignee: Mark Harwood
>            Priority: Minor
>         Attachments: LuceneNestedDocumentSupport-1.zip
>
>
> A facility for querying nested documents in a Lucene index as outlined in http://www.slideshare.net/MarkHarwood/proposal-for-nested-document-support-in-lucene

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2454) Nested Document query support

Posted by "Buddika Gajapala (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12889215#action_12889215 ] 

Buddika Gajapala commented on LUCENE-2454:
------------------------------------------

Mark, that was fast :)

BTW another scenario, when there are lot of data, there is a posibility of having parent docuemnt and matching child document in two different segments causing to miss some matches. I made a minor modification your approch by making it do a "Forward-scan" instead of reverse scan for parent docs and have the parent document inserted AFTER the child docs are inserted and in case of parent doc is located outside the scop of current doc, it's docid is preserved at the "Weight Object" level and nextDoc() modified to check fo that for the very 1st nextDoc call to new segment.

> Nested Document query support
> -----------------------------
>
>                 Key: LUCENE-2454
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2454
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Search
>    Affects Versions: 3.0.2
>            Reporter: Mark Harwood
>            Assignee: Mark Harwood
>            Priority: Minor
>         Attachments: LuceneNestedDocumentSupport.zip
>
>
> A facility for querying nested documents in a Lucene index as outlined in http://www.slideshare.net/MarkHarwood/proposal-for-nested-document-support-in-lucene

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2454) Nested Document query support

Posted by "David Smiley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12878581#action_12878581 ] 

David Smiley commented on LUCENE-2454:
--------------------------------------

35.7MB of RAM for every filter is a LOT compared to the 357KB I need now (100x).  Presumably the facet intersections now take 100x as long too.  I cache nearly a thousand of these per index (lots of faceting!) which is by the way just one Solr shard of many.  No can do.  :-(

I wonder if its plausible to consider a different implementation strategy employing a parallel index with the child documents storing the document IDs to the parent index.  I might even assume I need no more than 1000 child documents and thus index blank documents as filler so that if I am looking at a child document with id 32005 then it is the 6th sub-entity belonging to parent document id 32.  I know that document IDs are a bit transient so I know that some care would be needed to maintain this strategy.  Thoughts?

> Nested Document query support
> -----------------------------
>
>                 Key: LUCENE-2454
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2454
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Search
>    Affects Versions: 3.0.2
>            Reporter: Mark Harwood
>            Assignee: Mark Harwood
>            Priority: Minor
>         Attachments: LuceneNestedDocumentSupport-1.zip
>
>
> A facility for querying nested documents in a Lucene index as outlined in http://www.slideshare.net/MarkHarwood/proposal-for-nested-document-support-in-lucene

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2454) Nested Document query support

Posted by "Mark Harwood (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12889104#action_12889104 ] 

Mark Harwood commented on LUCENE-2454:
--------------------------------------

bq. Maybe we should add an addDocuments call to IW? To add more than one document, "atomically", so that any flush must happen before or after them? 

That would be nice. 
Another way of modelling this would be to introduce Document.add(Document childDoc) but I think that is a more fundamental and wide-reaching change.

> Nested Document query support
> -----------------------------
>
>                 Key: LUCENE-2454
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2454
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Search
>    Affects Versions: 3.0.2
>            Reporter: Mark Harwood
>            Assignee: Mark Harwood
>            Priority: Minor
>         Attachments: LuceneNestedDocumentSupport.zip
>
>
> A facility for querying nested documents in a Lucene index as outlined in http://www.slideshare.net/MarkHarwood/proposal-for-nested-document-support-in-lucene

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2454) Nested Document query support

Posted by "Mark Harwood (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12878782#action_12878782 ] 

Mark Harwood commented on LUCENE-2454:
--------------------------------------

bq.  I wonder if potentially many places in Solr would have to be become aware of this scheme 

Supporting multiple doc types in this way can be a big modelling change. Not everyone needs it so I suspect that may be a hard sell to the Solr crowd.

bq. Another nice thing about the parallel index is that the idf relevancy factor stays clean

I have an "IDF compensating" wrapper query that takes care of that. It wraps a child query to then wrap the Similarity class in use and adjusts IDF calculations to be based on the number of documents of the required type. I'll attach it here when I get a chance.

bq. Before I saw this issue, I was going to try something with SpanNearQuery and the masking-field variant.

I went through a similar thought process around using position info to make this stuff work. This child-doc approach used here seems the cleanest by far.

> Nested Document query support
> -----------------------------
>
>                 Key: LUCENE-2454
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2454
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Search
>    Affects Versions: 3.0.2
>            Reporter: Mark Harwood
>            Assignee: Mark Harwood
>            Priority: Minor
>         Attachments: LuceneNestedDocumentSupport-1.zip
>
>
> A facility for querying nested documents in a Lucene index as outlined in http://www.slideshare.net/MarkHarwood/proposal-for-nested-document-support-in-lucene

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2454) Nested Document query support

Posted by "Earwin Burrfoot (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12866134#action_12866134 ] 

Earwin Burrfoot commented on LUCENE-2454:
-----------------------------------------

Both things can be combined for sure. New stream-like indexing API stuffs docs into IW and controls when flushes /can/ happen, while FlushPolicy decides if they actually /do/ happen, when they /can/.

> Nested Document query support
> -----------------------------
>
>                 Key: LUCENE-2454
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2454
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Search
>    Affects Versions: 3.0.2
>            Reporter: Mark Harwood
>            Assignee: Mark Harwood
>            Priority: Minor
>         Attachments: LuceneNestedDocumentSupport-1.zip
>
>
> A facility for querying nested documents in a Lucene index as outlined in http://www.slideshare.net/MarkHarwood/proposal-for-nested-document-support-in-lucene

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2454) Nested Document query support

Posted by "Buddika Gajapala (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12888896#action_12888896 ] 

Buddika Gajapala commented on LUCENE-2454:
------------------------------------------

I tried this solution and works perfectly for smaller indexes with (either less number of Documents or Document size is small) However for larger indexes that span across multiple segments it only matches the the parent document acurately for the 1st segment. I think this is due to the way the parent docs are marked using a bit array for the ENTIRE index but actual traversing for matching criteria done by the Scorer is segment-by-segment (i.e. in nextDoc() and advance() methods) .  Have you considered this situation?

> Nested Document query support
> -----------------------------
>
>                 Key: LUCENE-2454
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2454
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Search
>    Affects Versions: 3.0.2
>            Reporter: Mark Harwood
>            Assignee: Mark Harwood
>            Priority: Minor
>         Attachments: LuceneNestedDocumentSupport-1.zip
>
>
> A facility for querying nested documents in a Lucene index as outlined in http://www.slideshare.net/MarkHarwood/proposal-for-nested-document-support-in-lucene

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2454) Nested Document query support

Posted by "Earwin Burrfoot (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12868268#action_12868268 ] 

Earwin Burrfoot commented on LUCENE-2454:
-----------------------------------------

I thiiiiink, here - LUCENE-2309

> Nested Document query support
> -----------------------------
>
>                 Key: LUCENE-2454
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2454
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Search
>    Affects Versions: 3.0.2
>            Reporter: Mark Harwood
>            Assignee: Mark Harwood
>            Priority: Minor
>         Attachments: LuceneNestedDocumentSupport-1.zip
>
>
> A facility for querying nested documents in a Lucene index as outlined in http://www.slideshare.net/MarkHarwood/proposal-for-nested-document-support-in-lucene

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Updated: (LUCENE-2454) Nested Document query support

Posted by "Mark Harwood (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mark Harwood updated LUCENE-2454:
---------------------------------

    Attachment: LuceneNestedDocumentSupport.zip

Updated package with fix for multi-segment issue and improved Junit test

> Nested Document query support
> -----------------------------
>
>                 Key: LUCENE-2454
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2454
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Search
>    Affects Versions: 3.0.2
>            Reporter: Mark Harwood
>            Assignee: Mark Harwood
>            Priority: Minor
>         Attachments: LuceneNestedDocumentSupport.zip
>
>
> A facility for querying nested documents in a Lucene index as outlined in http://www.slideshare.net/MarkHarwood/proposal-for-nested-document-support-in-lucene

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2454) Nested Document query support

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12889088#action_12889088 ] 

Michael McCandless commented on LUCENE-2454:
--------------------------------------------

Maybe we should add an addDocuments call to IW?  To add more than one document, "atomically", so that any flush must happen before or after them?

> Nested Document query support
> -----------------------------
>
>                 Key: LUCENE-2454
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2454
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Search
>    Affects Versions: 3.0.2
>            Reporter: Mark Harwood
>            Assignee: Mark Harwood
>            Priority: Minor
>         Attachments: LuceneNestedDocumentSupport-1.zip, TestNestedDocumentQueryWithMultiSegments.java
>
>
> A facility for querying nested documents in a Lucene index as outlined in http://www.slideshare.net/MarkHarwood/proposal-for-nested-document-support-in-lucene

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Updated: (LUCENE-2454) Nested Document query support

Posted by "Mark Harwood (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mark Harwood updated LUCENE-2454:
---------------------------------

    Attachment: TestNestedDocumentQueryWithMultiSegments.java

Attached Junit confirms issue with multiple segments (thanks, Buddika).
Previous tests masked the error.
I'm looking into a fix now.

Cheers,
Mark

> Nested Document query support
> -----------------------------
>
>                 Key: LUCENE-2454
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2454
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Search
>    Affects Versions: 3.0.2
>            Reporter: Mark Harwood
>            Assignee: Mark Harwood
>            Priority: Minor
>         Attachments: LuceneNestedDocumentSupport-1.zip, TestNestedDocumentQueryWithMultiSegments.java
>
>
> A facility for querying nested documents in a Lucene index as outlined in http://www.slideshare.net/MarkHarwood/proposal-for-nested-document-support-in-lucene

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org