You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Tim Smith (JIRA)" <ji...@apache.org> on 2009/08/18 23:00:28 UTC

[jira] Created: (LUCENE-1821) Weight.scorer() not passed doc offset for "sub reader"

Weight.scorer() not passed doc offset for "sub reader"
------------------------------------------------------

                 Key: LUCENE-1821
                 URL: https://issues.apache.org/jira/browse/LUCENE-1821
             Project: Lucene - Java
          Issue Type: Bug
          Components: Search
    Affects Versions: 2.9
            Reporter: Tim Smith


Now that searching is done on a per segment basis, there is no way for a Scorer to know the "actual" doc id for the document's it matches (only the relative doc offset into the segment)

If using caches in your scorer that are based on the "entire" index (all segments), there is now no way to index into them properly from inside a Scorer because the scorer is not passed the needed offset to calculate the "real" docid

suggest having Weight.scorer() method also take a integer for the doc offset

Abstract Weight class should have a constructor that takes this offset as well as a method to get the offset
All Weights that have "sub" weights must pass this offset down to created "sub" weights





-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1821) Weight.scorer() not passed doc offset for "sub reader"

Posted by "Tim Smith (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12746600#action_12746600 ] 

Tim Smith commented on LUCENE-1821:
-----------------------------------

well, you could go the route similar to the 2.4 TokenStream api (next() vs next(Token))

have Filter.getDocIdSet(IndexSearcher, IndexReader) call Filter.getDocIdSet(IndexReader), and vice versa by default
one method or the other would be required to be overridden

getDocIdSet(IndexReader) would be deprecated (and removed in 3.0)

Since the deprecated method would be removed in 3.0, and since noone would probably be depending on these new semantics right away this should work

Also, in general, QueryWrapperFilter performs a bit worse now in 2.9
this is because it creates an IndexSearcher for every query it wraps (which results in doing "gatherSubReaders" and creating the offsets anew each time getDocIdSet(IndexReader) is called
so, the new method with the IndexSearcher also passed in is much better for evaluating these Filters


> Weight.scorer() not passed doc offset for "sub reader"
> ------------------------------------------------------
>
>                 Key: LUCENE-1821
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1821
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Search
>    Affects Versions: 2.9
>            Reporter: Tim Smith
>             Fix For: 2.9
>
>         Attachments: LUCENE-1821.patch
>
>
> Now that searching is done on a per segment basis, there is no way for a Scorer to know the "actual" doc id for the document's it matches (only the relative doc offset into the segment)
> If using caches in your scorer that are based on the "entire" index (all segments), there is now no way to index into them properly from inside a Scorer because the scorer is not passed the needed offset to calculate the "real" docid
> suggest having Weight.scorer() method also take a integer for the doc offset
> Abstract Weight class should have a constructor that takes this offset as well as a method to get the offset
> All Weights that have "sub" weights must pass this offset down to created "sub" weights
> Details on workaround:
> In order to work around this, you must do the following:
> * Subclass IndexSearcher
> * Add "int getIndexReaderBase(IndexReader)" method to your subclass
> * during Weight creation, the Weight must hold onto a reference to the passed in Searcher (casted to your sub class)
> * during Scorer creation, the Scorer must be passed the result of YourSearcher.getIndexReaderBase(reader)
> * Scorer can now rebase any collected docids using this offset
> Example implementation of getIndexReaderBase():
> {code}
> // NOTE: more efficient implementation can be done if you cache the result if gatherSubReaders in your constructor
> public int getIndexReaderBase(IndexReader reader) {
>   if (reader == getReader()) {
>     return 0;
>   } else {
>     List readers = new ArrayList();
>     gatherSubReaders(readers);
>     Iterator iter = readers.iterator();
>     int maxDoc = 0;
>     while (iter.hasNext()) {
>       IndexReader r = (IndexReader)iter.next();
>       if (r == reader) {
>         return maxDoc;
>       } 
>       maxDoc += r.maxDoc();
>     } 
>   }
>   return -1; // reader not in searcher
> }
> {code}
> Notes:
> * This workaround makes it so you cannot serialize your custom Weight implementation

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Issue Comment Edited: (LUCENE-1821) Weight.scorer() not passed doc offset for "sub reader"

Posted by "Mark Miller (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12745756#action_12745756 ] 

Mark Miller edited comment on LUCENE-1821 at 8/20/09 6:12 PM:
--------------------------------------------------------------

{quote}
howevever, this method is actually also a bit broken now with per segment searching
what if reader is a MultiReader
{quote}

Right - there are many places where this could be the case - your still free to use multi-readers, though we encourage you to switch. We provide a cool cache sanity checker to help you find these cases, and evaluate whether or not you can make the switch. *edit* I know this doesn't help with filters - there was an issue that helped address that I think though - worked on by Hoss and Mike McCandless - not sure if that helps here or if this was overlooked or what though - I'll have to go skim that issue again. *edit*

If you just pass 0, many times it will be wrong. Why shouldn't this have access to a doc id cache as well? We always ask for everything in the context of the Reader given. I think thats the issue. Lucene just never officially supported this use case - we can't with MultiSearcher, Searchable, Remote - the API doesn't work with the idea that you can count on all the doc ids from a Reader. You were taking advantage of the implementation and your limited use of the full API - but its never been part of the API IMHO.

Perhaps we could one day change things - RMI hasn't really worked out in comparison to other methods large scale (supposedly very chatty - though I have been told very large installations have been built with it ) - we have already factored it into contrib. But this still doesn't fit the current model/API, and if we address it, it will take longer than 2.9 to do right IMO.

      was (Author: markrmiller@gmail.com):
    {quote}
howevever, this method is actually also a bit broken now with per segment searching
what if reader is a MultiReader
{quote}

Right - there are many places where this could be the case - your still free to use multi-readers, though we encourage you to switch. We provide a cool cache sanity checker to help you find these cases, and evaluate whether or not you can make the switch.

If you just pass 0, many times it will be wrong. Why shouldn't this have access to a doc id cache as well? We always ask for everything in the context of the Reader given. I think thats the issue. Lucene just never officially supported this use case - we can't with MultiSearcher, Searchable, Remote - the API doesn't work with the idea that you can count on all the doc ids from a Reader. You were taking advantage of the implementation and your limited use of the full API - but its never been part of the API IMHO.

Perhaps we could one day change things - RMI hasn't really worked out in comparison to other methods large scale (supposedly very chatty - though I have been told very large installations have been built with it ) - we have already factored it into contrib. But this still doesn't fit the current model/API, and if we address it, it will take longer than 2.9 to do right IMO.
  
> Weight.scorer() not passed doc offset for "sub reader"
> ------------------------------------------------------
>
>                 Key: LUCENE-1821
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1821
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Search
>    Affects Versions: 2.9
>            Reporter: Tim Smith
>             Fix For: 2.9
>
>         Attachments: LUCENE-1821.patch
>
>
> Now that searching is done on a per segment basis, there is no way for a Scorer to know the "actual" doc id for the document's it matches (only the relative doc offset into the segment)
> If using caches in your scorer that are based on the "entire" index (all segments), there is now no way to index into them properly from inside a Scorer because the scorer is not passed the needed offset to calculate the "real" docid
> suggest having Weight.scorer() method also take a integer for the doc offset
> Abstract Weight class should have a constructor that takes this offset as well as a method to get the offset
> All Weights that have "sub" weights must pass this offset down to created "sub" weights
> Details on workaround:
> In order to work around this, you must do the following:
> * Subclass IndexSearcher
> * Add "int getIndexReaderBase(IndexReader)" method to your subclass
> * during Weight creation, the Weight must hold onto a reference to the passed in Searcher (casted to your sub class)
> * during Scorer creation, the Scorer must be passed the result of YourSearcher.getIndexReaderBase(reader)
> * Scorer can now rebase any collected docids using this offset
> Example implementation of getIndexReaderBase():
> {code}
> // NOTE: more efficient implementation can be done if you cache the result if gatherSubReaders in your constructor
> public int getIndexReaderBase(IndexReader reader) {
>   if (reader == getReader()) {
>     return 0;
>   } else {
>     List readers = new ArrayList();
>     gatherSubReaders(readers);
>     Iterator iter = readers.iterator();
>     int maxDoc = 0;
>     while (iter.hasNext()) {
>       IndexReader r = (IndexReader)iter.next();
>       if (r == reader) {
>         return maxDoc;
>       } 
>       maxDoc += r.maxDoc();
>     } 
>   }
>   return -1; // reader not in searcher
> }
> {code}
> Notes:
> * This workaround makes it so you cannot serialize your custom Weight implementation

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1821) Weight.scorer() not passed doc offset for "sub reader"

Posted by "Mark Miller (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12746838#action_12746838 ] 

Mark Miller commented on LUCENE-1821:
-------------------------------------

bq.  faster commit/less memory vs faster sorting

Have you done any benching here? I think we actually found that even most sorting cases were faster than in 2.4.1.

Its a lot of poison to swallow top level - loading a field cache off a multi-segment index was dog slow.

> Weight.scorer() not passed doc offset for "sub reader"
> ------------------------------------------------------
>
>                 Key: LUCENE-1821
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1821
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Search
>    Affects Versions: 2.9
>            Reporter: Tim Smith
>             Fix For: 3.1
>
>         Attachments: LUCENE-1821.patch
>
>
> Now that searching is done on a per segment basis, there is no way for a Scorer to know the "actual" doc id for the document's it matches (only the relative doc offset into the segment)
> If using caches in your scorer that are based on the "entire" index (all segments), there is now no way to index into them properly from inside a Scorer because the scorer is not passed the needed offset to calculate the "real" docid
> suggest having Weight.scorer() method also take a integer for the doc offset
> Abstract Weight class should have a constructor that takes this offset as well as a method to get the offset
> All Weights that have "sub" weights must pass this offset down to created "sub" weights
> Details on workaround:
> In order to work around this, you must do the following:
> * Subclass IndexSearcher
> * Add "int getIndexReaderBase(IndexReader)" method to your subclass
> * during Weight creation, the Weight must hold onto a reference to the passed in Searcher (casted to your sub class)
> * during Scorer creation, the Scorer must be passed the result of YourSearcher.getIndexReaderBase(reader)
> * Scorer can now rebase any collected docids using this offset
> Example implementation of getIndexReaderBase():
> {code}
> // NOTE: more efficient implementation can be done if you cache the result if gatherSubReaders in your constructor
> public int getIndexReaderBase(IndexReader reader) {
>   if (reader == getReader()) {
>     return 0;
>   } else {
>     List readers = new ArrayList();
>     gatherSubReaders(readers);
>     Iterator iter = readers.iterator();
>     int maxDoc = 0;
>     while (iter.hasNext()) {
>       IndexReader r = (IndexReader)iter.next();
>       if (r == reader) {
>         return maxDoc;
>       } 
>       maxDoc += r.maxDoc();
>     } 
>   }
>   return -1; // reader not in searcher
> }
> {code}
> Notes:
> * This workaround makes it so you cannot serialize your custom Weight implementation

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1821) Weight.scorer() not passed doc offset for "sub reader"

Posted by "Tim Smith (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12745960#action_12745960 ] 

Tim Smith commented on LUCENE-1821:
-----------------------------------

bq. You never officially had the full index context
Officially, i didn't "not" have the full index context either (it was undefined at best, but was clear from both lucene code and my use of the API that i did have the full index context)

Whenever i do a search, i always explicitly know what context i'm searching in (its always an IndexSearcher context)
further, whenever i pass an IndexReader to any method (to create a cache/etc), i explicitly know what context i'm dealing with in order to know what the docids used mean
as the application developer, i have full control over what i pass into the lucene API and where, and know the context of passing that in (javadoc should just be fully clear on how what goes in is used (if not already) (i always have the option to not use a utility class/method provided by lucene if it does not have the proper context semantics i need (and can write my own that does)

bq. The current API would not support this without back compat breaks up the wazoo
i kinda see what you mean here, but then how is it ok to pass an IndexReader to this method by the same right
it seems like it should be ok to pass the IndexSearcher (the direct context for the IndexReader) for the IndexReader in question to Weight.scorer() if its ok to pass the IndexReader (the scorer() method's interface was already changed between 2.4 and 2.9 (adding allowDocsInOrder and topScorer))

bq. You can pick, but we have to be true to the API or change it (not easy with our back compat policies)
be fair, 2.9 has a lot of back compat breaks, both in API and runtime behavior (i had tons of compile errors when i dropped 2.9 in, as well as some other hacks i had to add in (at least temporarily) in order to get 2.9 to work due to run time changes (primarily this per segment search stuff))

I have no problem with back compat breaks in general (only took me about a day to absorb 2.9 initially (still working on fully taking advantage of new features and getting rid of deprecated class use)) The only requirement i would put on a back compat break is that it have a workaround to get back the the previous versions behavior (in this case have it possible to remap the docids to the "IndexSearcher" context inside the scorer)



> Weight.scorer() not passed doc offset for "sub reader"
> ------------------------------------------------------
>
>                 Key: LUCENE-1821
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1821
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Search
>    Affects Versions: 2.9
>            Reporter: Tim Smith
>             Fix For: 2.9
>
>         Attachments: LUCENE-1821.patch
>
>
> Now that searching is done on a per segment basis, there is no way for a Scorer to know the "actual" doc id for the document's it matches (only the relative doc offset into the segment)
> If using caches in your scorer that are based on the "entire" index (all segments), there is now no way to index into them properly from inside a Scorer because the scorer is not passed the needed offset to calculate the "real" docid
> suggest having Weight.scorer() method also take a integer for the doc offset
> Abstract Weight class should have a constructor that takes this offset as well as a method to get the offset
> All Weights that have "sub" weights must pass this offset down to created "sub" weights
> Details on workaround:
> In order to work around this, you must do the following:
> * Subclass IndexSearcher
> * Add "int getIndexReaderBase(IndexReader)" method to your subclass
> * during Weight creation, the Weight must hold onto a reference to the passed in Searcher (casted to your sub class)
> * during Scorer creation, the Scorer must be passed the result of YourSearcher.getIndexReaderBase(reader)
> * Scorer can now rebase any collected docids using this offset
> Example implementation of getIndexReaderBase():
> {code}
> // NOTE: more efficient implementation can be done if you cache the result if gatherSubReaders in your constructor
> public int getIndexReaderBase(IndexReader reader) {
>   if (reader == getReader()) {
>     return 0;
>   } else {
>     List readers = new ArrayList();
>     gatherSubReaders(readers);
>     Iterator iter = readers.iterator();
>     int maxDoc = 0;
>     while (iter.hasNext()) {
>       IndexReader r = (IndexReader)iter.next();
>       if (r == reader) {
>         return maxDoc;
>       } 
>       maxDoc += r.maxDoc();
>     } 
>   }
>   return -1; // reader not in searcher
> }
> {code}
> Notes:
> * This workaround makes it so you cannot serialize your custom Weight implementation

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1821) Weight.scorer() not passed doc offset for "sub reader"

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12746634#action_12746634 ] 

Michael McCandless commented on LUCENE-1821:
--------------------------------------------

bq. Using a per-segment cache will cause some significant performance loss when performing faceting, as it requires creating the facets for each segment, and then merging them (this results in a good deal of extra object overhead/memory overhead/more work where faceting on the multi-reader does not see this)

This is a good point... Yonik, how [in general!] is Solr handling the cutover to per-segment, for faceting?

> Weight.scorer() not passed doc offset for "sub reader"
> ------------------------------------------------------
>
>                 Key: LUCENE-1821
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1821
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Search
>    Affects Versions: 2.9
>            Reporter: Tim Smith
>             Fix For: 2.9
>
>         Attachments: LUCENE-1821.patch
>
>
> Now that searching is done on a per segment basis, there is no way for a Scorer to know the "actual" doc id for the document's it matches (only the relative doc offset into the segment)
> If using caches in your scorer that are based on the "entire" index (all segments), there is now no way to index into them properly from inside a Scorer because the scorer is not passed the needed offset to calculate the "real" docid
> suggest having Weight.scorer() method also take a integer for the doc offset
> Abstract Weight class should have a constructor that takes this offset as well as a method to get the offset
> All Weights that have "sub" weights must pass this offset down to created "sub" weights
> Details on workaround:
> In order to work around this, you must do the following:
> * Subclass IndexSearcher
> * Add "int getIndexReaderBase(IndexReader)" method to your subclass
> * during Weight creation, the Weight must hold onto a reference to the passed in Searcher (casted to your sub class)
> * during Scorer creation, the Scorer must be passed the result of YourSearcher.getIndexReaderBase(reader)
> * Scorer can now rebase any collected docids using this offset
> Example implementation of getIndexReaderBase():
> {code}
> // NOTE: more efficient implementation can be done if you cache the result if gatherSubReaders in your constructor
> public int getIndexReaderBase(IndexReader reader) {
>   if (reader == getReader()) {
>     return 0;
>   } else {
>     List readers = new ArrayList();
>     gatherSubReaders(readers);
>     Iterator iter = readers.iterator();
>     int maxDoc = 0;
>     while (iter.hasNext()) {
>       IndexReader r = (IndexReader)iter.next();
>       if (r == reader) {
>         return maxDoc;
>       } 
>       maxDoc += r.maxDoc();
>     } 
>   }
>   return -1; // reader not in searcher
> }
> {code}
> Notes:
> * This workaround makes it so you cannot serialize your custom Weight implementation

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1821) Weight.scorer() not passed doc offset for "sub reader"

Posted by "Tim Smith (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12745045#action_12745045 ] 

Tim Smith commented on LUCENE-1821:
-----------------------------------

"rely on" it is

> Weight.scorer() not passed doc offset for "sub reader"
> ------------------------------------------------------
>
>                 Key: LUCENE-1821
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1821
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Search
>    Affects Versions: 2.9
>            Reporter: Tim Smith
>         Attachments: LUCENE-1821.patch
>
>
> Now that searching is done on a per segment basis, there is no way for a Scorer to know the "actual" doc id for the document's it matches (only the relative doc offset into the segment)
> If using caches in your scorer that are based on the "entire" index (all segments), there is now no way to index into them properly from inside a Scorer because the scorer is not passed the needed offset to calculate the "real" docid
> suggest having Weight.scorer() method also take a integer for the doc offset
> Abstract Weight class should have a constructor that takes this offset as well as a method to get the offset
> All Weights that have "sub" weights must pass this offset down to created "sub" weights

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1821) Weight.scorer() not passed doc offset for "sub reader"

Posted by "Tim Smith (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12746004#action_12746004 ] 

Tim Smith commented on LUCENE-1821:
-----------------------------------

Looks like Filter should have another method added getDocIdSet(IndexSearcher searcher, IndexReader reader) (deprecating getDocIdSet(IndexReader))

new method would call old method by default (with little harm done in general)
IndexSearcher would call the new getDocIdSet() variant
QueryWrapperFilter would be updated to implement getDocIdSet(IndexSearcher, IndexReader) (with old method wrapping IndexReader with an IndexSearcher)
This would actually be cleaner for QueryWrapperFilter, as it wouldn't have to create a new IndexSearcher on every call

i definitely see that this is potentially more painful than the changes to the scorer() method (question is how many people implement custom Filters?)

Personally, i don't use Filter, so any changes here don't impact me, but to the best of my knowledge, i'm not the only one using lucene :)

> Weight.scorer() not passed doc offset for "sub reader"
> ------------------------------------------------------
>
>                 Key: LUCENE-1821
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1821
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Search
>    Affects Versions: 2.9
>            Reporter: Tim Smith
>             Fix For: 2.9
>
>         Attachments: LUCENE-1821.patch
>
>
> Now that searching is done on a per segment basis, there is no way for a Scorer to know the "actual" doc id for the document's it matches (only the relative doc offset into the segment)
> If using caches in your scorer that are based on the "entire" index (all segments), there is now no way to index into them properly from inside a Scorer because the scorer is not passed the needed offset to calculate the "real" docid
> suggest having Weight.scorer() method also take a integer for the doc offset
> Abstract Weight class should have a constructor that takes this offset as well as a method to get the offset
> All Weights that have "sub" weights must pass this offset down to created "sub" weights
> Details on workaround:
> In order to work around this, you must do the following:
> * Subclass IndexSearcher
> * Add "int getIndexReaderBase(IndexReader)" method to your subclass
> * during Weight creation, the Weight must hold onto a reference to the passed in Searcher (casted to your sub class)
> * during Scorer creation, the Scorer must be passed the result of YourSearcher.getIndexReaderBase(reader)
> * Scorer can now rebase any collected docids using this offset
> Example implementation of getIndexReaderBase():
> {code}
> // NOTE: more efficient implementation can be done if you cache the result if gatherSubReaders in your constructor
> public int getIndexReaderBase(IndexReader reader) {
>   if (reader == getReader()) {
>     return 0;
>   } else {
>     List readers = new ArrayList();
>     gatherSubReaders(readers);
>     Iterator iter = readers.iterator();
>     int maxDoc = 0;
>     while (iter.hasNext()) {
>       IndexReader r = (IndexReader)iter.next();
>       if (r == reader) {
>         return maxDoc;
>       } 
>       maxDoc += r.maxDoc();
>     } 
>   }
>   return -1; // reader not in searcher
> }
> {code}
> Notes:
> * This workaround makes it so you cannot serialize your custom Weight implementation

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1821) Weight.scorer() not passed doc offset for "sub reader"

Posted by "Tim Smith (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12744716#action_12744716 ] 

Tim Smith commented on LUCENE-1821:
-----------------------------------

For the explain case, if Searcher had the following method, the base offset would not have to be passed in (could be inferred):
{code}
public int getIndexReaderBase(IndexReader reader);
{code}

> Weight.scorer() not passed doc offset for "sub reader"
> ------------------------------------------------------
>
>                 Key: LUCENE-1821
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1821
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Search
>    Affects Versions: 2.9
>            Reporter: Tim Smith
>
> Now that searching is done on a per segment basis, there is no way for a Scorer to know the "actual" doc id for the document's it matches (only the relative doc offset into the segment)
> If using caches in your scorer that are based on the "entire" index (all segments), there is now no way to index into them properly from inside a Scorer because the scorer is not passed the needed offset to calculate the "real" docid
> suggest having Weight.scorer() method also take a integer for the doc offset
> Abstract Weight class should have a constructor that takes this offset as well as a method to get the offset
> All Weights that have "sub" weights must pass this offset down to created "sub" weights

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1821) Weight.scorer() not passed doc offset for "sub reader"

Posted by "Mark Miller (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12744742#action_12744742 ] 

Mark Miller commented on LUCENE-1821:
-------------------------------------

{quote}breaking this out into a per segment cache would then require me to also store along with the int[] index array a String[] 
this then gets really slow and nasty when trying to run through the required algorithms to use the cache for my application {quote}

Why is this? You must be keying this to the multi reader now right? Why cannot you not do the same thing keyed to the sub reader?

> Weight.scorer() not passed doc offset for "sub reader"
> ------------------------------------------------------
>
>                 Key: LUCENE-1821
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1821
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Search
>    Affects Versions: 2.9
>            Reporter: Tim Smith
>
> Now that searching is done on a per segment basis, there is no way for a Scorer to know the "actual" doc id for the document's it matches (only the relative doc offset into the segment)
> If using caches in your scorer that are based on the "entire" index (all segments), there is now no way to index into them properly from inside a Scorer because the scorer is not passed the needed offset to calculate the "real" docid
> suggest having Weight.scorer() method also take a integer for the doc offset
> Abstract Weight class should have a constructor that takes this offset as well as a method to get the offset
> All Weights that have "sub" weights must pass this offset down to created "sub" weights

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1821) Weight.scorer() not passed doc offset for "sub reader"

Posted by "Tim Smith (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12744746#action_12744746 ] 

Tim Smith commented on LUCENE-1821:
-----------------------------------

Thought about it a bit more

all i really need is the following method added to Searcher
{code}
 public int getIndexReaderBase(IndexReader);
{code}

if this doesn't go into 2.9, i should be able to add this method to my subclass of IndexSearcher (so i wouldn't put this as blocker status for me anymore)

as long as Searcher has getIndexReaderBase(), i can have my Weight hold onto the Searcher it was created with and when the scorer is created off it, i can lookup the base offset based on the IndexReader passed in

i can work up a patch for this in the morning if desired

also, there should be a big old warning in the change log about MultiReader based cache used in a Weight/Scorer being broken (not sure if there's a warning specifically about this in there)

> Weight.scorer() not passed doc offset for "sub reader"
> ------------------------------------------------------
>
>                 Key: LUCENE-1821
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1821
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Search
>    Affects Versions: 2.9
>            Reporter: Tim Smith
>
> Now that searching is done on a per segment basis, there is no way for a Scorer to know the "actual" doc id for the document's it matches (only the relative doc offset into the segment)
> If using caches in your scorer that are based on the "entire" index (all segments), there is now no way to index into them properly from inside a Scorer because the scorer is not passed the needed offset to calculate the "real" docid
> suggest having Weight.scorer() method also take a integer for the doc offset
> Abstract Weight class should have a constructor that takes this offset as well as a method to get the offset
> All Weights that have "sub" weights must pass this offset down to created "sub" weights

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1821) Weight.scorer() not passed doc offset for "sub reader"

Posted by "Tim Smith (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12849358#action_12849358 ] 

Tim Smith commented on LUCENE-1821:
-----------------------------------

This would actually be solved by LUCENE-2345 for me as i would then be able to tag SegmentReaders with any additional accounting information i would need

> Weight.scorer() not passed doc offset for "sub reader"
> ------------------------------------------------------
>
>                 Key: LUCENE-1821
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1821
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Search
>    Affects Versions: 2.9
>            Reporter: Tim Smith
>             Fix For: 3.1
>
>         Attachments: LUCENE-1821.patch
>
>
> Now that searching is done on a per segment basis, there is no way for a Scorer to know the "actual" doc id for the document's it matches (only the relative doc offset into the segment)
> If using caches in your scorer that are based on the "entire" index (all segments), there is now no way to index into them properly from inside a Scorer because the scorer is not passed the needed offset to calculate the "real" docid
> suggest having Weight.scorer() method also take a integer for the doc offset
> Abstract Weight class should have a constructor that takes this offset as well as a method to get the offset
> All Weights that have "sub" weights must pass this offset down to created "sub" weights
> Details on workaround:
> In order to work around this, you must do the following:
> * Subclass IndexSearcher
> * Add "int getIndexReaderBase(IndexReader)" method to your subclass
> * during Weight creation, the Weight must hold onto a reference to the passed in Searcher (casted to your sub class)
> * during Scorer creation, the Scorer must be passed the result of YourSearcher.getIndexReaderBase(reader)
> * Scorer can now rebase any collected docids using this offset
> Example implementation of getIndexReaderBase():
> {code}
> // NOTE: more efficient implementation can be done if you cache the result if gatherSubReaders in your constructor
> public int getIndexReaderBase(IndexReader reader) {
>   if (reader == getReader()) {
>     return 0;
>   } else {
>     List readers = new ArrayList();
>     gatherSubReaders(readers);
>     Iterator iter = readers.iterator();
>     int maxDoc = 0;
>     while (iter.hasNext()) {
>       IndexReader r = (IndexReader)iter.next();
>       if (r == reader) {
>         return maxDoc;
>       } 
>       maxDoc += r.maxDoc();
>     } 
>   }
>   return -1; // reader not in searcher
> }
> {code}
> Notes:
> * This workaround makes it so you cannot serialize your custom Weight implementation

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1821) Weight.scorer() not passed doc offset for "sub reader"

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12746623#action_12746623 ] 

Michael McCandless commented on LUCENE-1821:
--------------------------------------------

Tim, one option might be to subclass DirectoryReader (though, it's package protected now, and, you'd need to make your own "open" to return your subclass), and override getSequentialSubReaders to return null?  Then Lucene would treat it as an atomic reader.  Could that work?

> Weight.scorer() not passed doc offset for "sub reader"
> ------------------------------------------------------
>
>                 Key: LUCENE-1821
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1821
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Search
>    Affects Versions: 2.9
>            Reporter: Tim Smith
>             Fix For: 2.9
>
>         Attachments: LUCENE-1821.patch
>
>
> Now that searching is done on a per segment basis, there is no way for a Scorer to know the "actual" doc id for the document's it matches (only the relative doc offset into the segment)
> If using caches in your scorer that are based on the "entire" index (all segments), there is now no way to index into them properly from inside a Scorer because the scorer is not passed the needed offset to calculate the "real" docid
> suggest having Weight.scorer() method also take a integer for the doc offset
> Abstract Weight class should have a constructor that takes this offset as well as a method to get the offset
> All Weights that have "sub" weights must pass this offset down to created "sub" weights
> Details on workaround:
> In order to work around this, you must do the following:
> * Subclass IndexSearcher
> * Add "int getIndexReaderBase(IndexReader)" method to your subclass
> * during Weight creation, the Weight must hold onto a reference to the passed in Searcher (casted to your sub class)
> * during Scorer creation, the Scorer must be passed the result of YourSearcher.getIndexReaderBase(reader)
> * Scorer can now rebase any collected docids using this offset
> Example implementation of getIndexReaderBase():
> {code}
> // NOTE: more efficient implementation can be done if you cache the result if gatherSubReaders in your constructor
> public int getIndexReaderBase(IndexReader reader) {
>   if (reader == getReader()) {
>     return 0;
>   } else {
>     List readers = new ArrayList();
>     gatherSubReaders(readers);
>     Iterator iter = readers.iterator();
>     int maxDoc = 0;
>     while (iter.hasNext()) {
>       IndexReader r = (IndexReader)iter.next();
>       if (r == reader) {
>         return maxDoc;
>       } 
>       maxDoc += r.maxDoc();
>     } 
>   }
>   return -1; // reader not in searcher
> }
> {code}
> Notes:
> * This workaround makes it so you cannot serialize your custom Weight implementation

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Issue Comment Edited: (LUCENE-1821) Weight.scorer() not passed doc offset for "sub reader"

Posted by "Mark Miller (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12745963#action_12745963 ] 

Mark Miller edited comment on LUCENE-1821 at 8/21/09 6:27 AM:
--------------------------------------------------------------

bq. (it was undefined at best, but was clear from both lucene code and my use of the API that i did have the full index context)

It was an implementation detail. If you look at MultiSearcher, Searchable, Searcher and how the API is put together, you can see we don't support that type of thing. I think its fairly clear after a little thought.

You can limit your API's to handle just IndexSearchers, but as a project, we cannot.

bq. it seems like it should be ok to pass the IndexSearcher (the direct context for the IndexReader) 

MultiSearcher and Searchable make this impossible IMO. We would be playing to those that don't fully use the API, and thats a mistake in my opinion. At best, we would have to shift the whole API.

Its okay to pass the Reader because its a contextless Reader. There is no value in also passing a contextless Searcher IMO - especially when its an arbitrary different context. We have to live up to the current API - your throwing MultiSearcher, Searchable, Remote out the window.

bq. be fair, 2.9 has a lot of back compat breaks,

Oh I'm fair, I know that for sure - though I do like to argue way to much for my own good. All of these back compat breaks were painful to stomach ;) But we reached each one under special circumstances - usually our own early incompetence :) We technically are not allowed to just break things though. We break to fix what we already accidentally broke, or we break when we screwed up earlier and we are in between a rock and a hard place now - or we break when something else is broke anyway, so lets do more :) This was the release of the break for sure. We don't necessarily want this to happen every release though, and its our responsibility to strive towards our back compat policy (listed on the wiki).

I'm not talking about a break in adding a Searcher - that would be fine - back compat is already broken there - but unless we can pass a MultiSearcher there over a remove RMI call, its a break of the whole API IMO.

      was (Author: markrmiller@gmail.com):
    bq .(it was undefined at best, but was clear from both lucene code and my use of the API that i did have the full index context)

It was an implementation detail. If you look at MultiSearcher, Searchable, Searcher and how the API is put together, you can see we don't support that type of thing. I think its fairly clear after a little thought.

You can limit your API's to handle just IndexSearchers, but as a project, we cannot.

bq. it seems like it should be ok to pass the IndexSearcher (the direct context for the IndexReader) 

MultiSearcher and Searchable make this impossible IMO. We would be playing to those that don't fully use the API, and thats a mistake in my opinion. At best, we would have to shift the whole API.

Its okay to pass the Reader because its a contextless Reader. There is no value in also passing a contextless Searcher IMO - especially when its an arbitrary different context. We have to live up to the current API - your throwing MultiSearcher, Searchable, Remote out the window.

bq. be fair, 2.9 has a lot of back compat breaks,

Oh I'm fair, I know that for sure - though I do like to argue way to much for my own good. All of these back compat breaks were painful to stomach ;) But we reached each one under special circumstances - usually our own early incompetence :) We technically are not allowed to just break things though. We break to fix what we already accidentally broke, or we break when we screwed up earlier and we are in between a rock and a hard place now - or we break when something else is broke anyway, so lets do more :) This was the release of the break for sure. We don't necessarily want this to happen every release though, and its our responsibility to strive towards our back compat policy (listed on the wiki).

I'm not talking about a break in adding a Searcher - that would be fine - back compat is already broken there - but unless we can pass a MultiSearcher there over a remove RMI call, its a break of the whole API IMO.
  
> Weight.scorer() not passed doc offset for "sub reader"
> ------------------------------------------------------
>
>                 Key: LUCENE-1821
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1821
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Search
>    Affects Versions: 2.9
>            Reporter: Tim Smith
>             Fix For: 2.9
>
>         Attachments: LUCENE-1821.patch
>
>
> Now that searching is done on a per segment basis, there is no way for a Scorer to know the "actual" doc id for the document's it matches (only the relative doc offset into the segment)
> If using caches in your scorer that are based on the "entire" index (all segments), there is now no way to index into them properly from inside a Scorer because the scorer is not passed the needed offset to calculate the "real" docid
> suggest having Weight.scorer() method also take a integer for the doc offset
> Abstract Weight class should have a constructor that takes this offset as well as a method to get the offset
> All Weights that have "sub" weights must pass this offset down to created "sub" weights
> Details on workaround:
> In order to work around this, you must do the following:
> * Subclass IndexSearcher
> * Add "int getIndexReaderBase(IndexReader)" method to your subclass
> * during Weight creation, the Weight must hold onto a reference to the passed in Searcher (casted to your sub class)
> * during Scorer creation, the Scorer must be passed the result of YourSearcher.getIndexReaderBase(reader)
> * Scorer can now rebase any collected docids using this offset
> Example implementation of getIndexReaderBase():
> {code}
> // NOTE: more efficient implementation can be done if you cache the result if gatherSubReaders in your constructor
> public int getIndexReaderBase(IndexReader reader) {
>   if (reader == getReader()) {
>     return 0;
>   } else {
>     List readers = new ArrayList();
>     gatherSubReaders(readers);
>     Iterator iter = readers.iterator();
>     int maxDoc = 0;
>     while (iter.hasNext()) {
>       IndexReader r = (IndexReader)iter.next();
>       if (r == reader) {
>         return maxDoc;
>       } 
>       maxDoc += r.maxDoc();
>     } 
>   }
>   return -1; // reader not in searcher
> }
> {code}
> Notes:
> * This workaround makes it so you cannot serialize your custom Weight implementation

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1821) Weight.scorer() not passed doc offset for "sub reader"

Posted by "Tim Smith (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12745941#action_12745941 ] 

Tim Smith commented on LUCENE-1821:
-----------------------------------

I'm OK with having to jump through some hoops in order to get back to the "full index" context

It would be nice if this was more facilitated by lucene's API (IMO, this would be best handled by adding a Searcher as the first arg to Weight.scorer(), as then a Weight will not need to hold on to this (breaking serializable))

There are definitely plenty of use cases that take advantage of the "whole" index (one created by IndexWriter), so this ability should not be removed
I have at least 3 in my application alone (and they are all very important)

You get tradeoffs working "Per-Segment" vs "Per-MultiReader" when it comes to caching in general
going per-segment means caches load faster, and load less frequently, however this causes algorithms working with the caches to be slower (depending on algorithm and cache type)

for static boosting from a field value (ValueSource), it makes no difference
for numeric sorting, it makes no difference 

for string sorting, it makes a big difference - you now have to do a bunch of String.equals() calls, where you didn't have to in 2.4 (just used the ord index)
Given this reason, you should really be able to do string sorting 2 ways
* using per segment field cache (commit time/first query faster, sort time slower)
* using multi-reader field cache (commit time/first query slower, sort time faster)

This same argument also goes for features like faceting (not provided by lucene, but is provided by applications like solr, and my application). Using a per-segment cache will cause some significant performance loss when performing faceting, as it requires creating the facets for each segment, and then merging them (this results in a good deal of extra object overhead/memory overhead/more work where faceting on the multi-reader does not see this)

In the end, it should be up to the application developer to choose what strategy works best for them, and their application (fast commits/fast cache loading may take a back seat to fast query execution)

In general, i find there is a tradeoff between commit time and query time. The more you speed up commit time, the slower query time gets, and vice versa
I just want/need the ability to choose






> Weight.scorer() not passed doc offset for "sub reader"
> ------------------------------------------------------
>
>                 Key: LUCENE-1821
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1821
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Search
>    Affects Versions: 2.9
>            Reporter: Tim Smith
>             Fix For: 2.9
>
>         Attachments: LUCENE-1821.patch
>
>
> Now that searching is done on a per segment basis, there is no way for a Scorer to know the "actual" doc id for the document's it matches (only the relative doc offset into the segment)
> If using caches in your scorer that are based on the "entire" index (all segments), there is now no way to index into them properly from inside a Scorer because the scorer is not passed the needed offset to calculate the "real" docid
> suggest having Weight.scorer() method also take a integer for the doc offset
> Abstract Weight class should have a constructor that takes this offset as well as a method to get the offset
> All Weights that have "sub" weights must pass this offset down to created "sub" weights
> Details on workaround:
> In order to work around this, you must do the following:
> * Subclass IndexSearcher
> * Add "int getIndexReaderBase(IndexReader)" method to your subclass
> * during Weight creation, the Weight must hold onto a reference to the passed in Searcher (casted to your sub class)
> * during Scorer creation, the Scorer must be passed the result of YourSearcher.getIndexReaderBase(reader)
> * Scorer can now rebase any collected docids using this offset
> Example implementation of getIndexReaderBase():
> {code}
> // NOTE: more efficient implementation can be done if you cache the result if gatherSubReaders in your constructor
> public int getIndexReaderBase(IndexReader reader) {
>   if (reader == getReader()) {
>     return 0;
>   } else {
>     List readers = new ArrayList();
>     gatherSubReaders(readers);
>     Iterator iter = readers.iterator();
>     int maxDoc = 0;
>     while (iter.hasNext()) {
>       IndexReader r = (IndexReader)iter.next();
>       if (r == reader) {
>         return maxDoc;
>       } 
>       maxDoc += r.maxDoc();
>     } 
>   }
>   return -1; // reader not in searcher
> }
> {code}
> Notes:
> * This workaround makes it so you cannot serialize your custom Weight implementation

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1821) Weight.scorer() not passed doc offset for "sub reader"

Posted by "Mark Miller (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12744719#action_12744719 ] 

Mark Miller commented on LUCENE-1821:
-------------------------------------

Can you give a use example? What are you caching in the Scorer index wide? It really seems you should be doing it per segment ...

> Weight.scorer() not passed doc offset for "sub reader"
> ------------------------------------------------------
>
>                 Key: LUCENE-1821
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1821
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Search
>    Affects Versions: 2.9
>            Reporter: Tim Smith
>
> Now that searching is done on a per segment basis, there is no way for a Scorer to know the "actual" doc id for the document's it matches (only the relative doc offset into the segment)
> If using caches in your scorer that are based on the "entire" index (all segments), there is now no way to index into them properly from inside a Scorer because the scorer is not passed the needed offset to calculate the "real" docid
> suggest having Weight.scorer() method also take a integer for the doc offset
> Abstract Weight class should have a constructor that takes this offset as well as a method to get the offset
> All Weights that have "sub" weights must pass this offset down to created "sub" weights

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-1821) Weight.scorer() not passed doc offset for "sub reader"

Posted by "Tim Smith (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tim Smith updated LUCENE-1821:
------------------------------

    Description: 
Now that searching is done on a per segment basis, there is no way for a Scorer to know the "actual" doc id for the document's it matches (only the relative doc offset into the segment)

If using caches in your scorer that are based on the "entire" index (all segments), there is now no way to index into them properly from inside a Scorer because the scorer is not passed the needed offset to calculate the "real" docid

suggest having Weight.scorer() method also take a integer for the doc offset

Abstract Weight class should have a constructor that takes this offset as well as a method to get the offset
All Weights that have "sub" weights must pass this offset down to created "sub" weights


Details on workaround:
In order to work around this, you must do the following:
* Subclass IndexSearcher
* Add "int getIndexReaderBase(IndexReader)" method to your subclass
* during Weight creation, the Weight must hold onto a reference to the passed in Searcher (casted to your sub class)
* during Scorer creation, the Scorer must be passed the result of YourSearcher.getIndexReaderBase(reader)
* Scorer can now rebase any collected docids using this offset

Example implementation of getIndexReaderBase():
{code}
// NOTE: more efficient implementation can be done if you cache the result if gatherSubReaders in your constructor
public int getIndexReaderBase(IndexReader reader) {
  if (reader == getReader()) {
    return 0;
  } else {
    List readers = new ArrayList();
    gatherSubReaders(readers);
    Iterator iter = readers.iterator();
    int maxDoc = 0;
    while (iter.hasNext()) {
      IndexReader r = (IndexReader)iter.next();
      if (r == reader) {
        return maxDoc;
      } 
      maxDoc += r.maxDoc();
    } 
  }
  return -1; // reader not in searcher
}
{code}

Notes:
* This workaround makes it so you cannot serialize your custom Weight implementation


  was:
Now that searching is done on a per segment basis, there is no way for a Scorer to know the "actual" doc id for the document's it matches (only the relative doc offset into the segment)

If using caches in your scorer that are based on the "entire" index (all segments), there is now no way to index into them properly from inside a Scorer because the scorer is not passed the needed offset to calculate the "real" docid

suggest having Weight.scorer() method also take a integer for the doc offset

Abstract Weight class should have a constructor that takes this offset as well as a method to get the offset
All Weights that have "sub" weights must pass this offset down to created "sub" weights






> Weight.scorer() not passed doc offset for "sub reader"
> ------------------------------------------------------
>
>                 Key: LUCENE-1821
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1821
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Search
>    Affects Versions: 2.9
>            Reporter: Tim Smith
>         Attachments: LUCENE-1821.patch
>
>
> Now that searching is done on a per segment basis, there is no way for a Scorer to know the "actual" doc id for the document's it matches (only the relative doc offset into the segment)
> If using caches in your scorer that are based on the "entire" index (all segments), there is now no way to index into them properly from inside a Scorer because the scorer is not passed the needed offset to calculate the "real" docid
> suggest having Weight.scorer() method also take a integer for the doc offset
> Abstract Weight class should have a constructor that takes this offset as well as a method to get the offset
> All Weights that have "sub" weights must pass this offset down to created "sub" weights
> Details on workaround:
> In order to work around this, you must do the following:
> * Subclass IndexSearcher
> * Add "int getIndexReaderBase(IndexReader)" method to your subclass
> * during Weight creation, the Weight must hold onto a reference to the passed in Searcher (casted to your sub class)
> * during Scorer creation, the Scorer must be passed the result of YourSearcher.getIndexReaderBase(reader)
> * Scorer can now rebase any collected docids using this offset
> Example implementation of getIndexReaderBase():
> {code}
> // NOTE: more efficient implementation can be done if you cache the result if gatherSubReaders in your constructor
> public int getIndexReaderBase(IndexReader reader) {
>   if (reader == getReader()) {
>     return 0;
>   } else {
>     List readers = new ArrayList();
>     gatherSubReaders(readers);
>     Iterator iter = readers.iterator();
>     int maxDoc = 0;
>     while (iter.hasNext()) {
>       IndexReader r = (IndexReader)iter.next();
>       if (r == reader) {
>         return maxDoc;
>       } 
>       maxDoc += r.maxDoc();
>     } 
>   }
>   return -1; // reader not in searcher
> }
> {code}
> Notes:
> * This workaround makes it so you cannot serialize your custom Weight implementation

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1821) Weight.scorer() not passed doc offset for "sub reader"

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12746625#action_12746625 ] 

Michael McCandless commented on LUCENE-1821:
--------------------------------------------

bq. for string sorting, it makes a big difference - you now have to do a bunch of String.equals() calls, where you didn't have to in 2.4 (just used the ord index)

We actually went through a number of iterations on this, on the first cutover to per-segment collection, and eventually arrived at a decent comparator (StringOrdValComparator) that operates per segment.  Have you tested performance of this comparator?


> Weight.scorer() not passed doc offset for "sub reader"
> ------------------------------------------------------
>
>                 Key: LUCENE-1821
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1821
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Search
>    Affects Versions: 2.9
>            Reporter: Tim Smith
>             Fix For: 2.9
>
>         Attachments: LUCENE-1821.patch
>
>
> Now that searching is done on a per segment basis, there is no way for a Scorer to know the "actual" doc id for the document's it matches (only the relative doc offset into the segment)
> If using caches in your scorer that are based on the "entire" index (all segments), there is now no way to index into them properly from inside a Scorer because the scorer is not passed the needed offset to calculate the "real" docid
> suggest having Weight.scorer() method also take a integer for the doc offset
> Abstract Weight class should have a constructor that takes this offset as well as a method to get the offset
> All Weights that have "sub" weights must pass this offset down to created "sub" weights
> Details on workaround:
> In order to work around this, you must do the following:
> * Subclass IndexSearcher
> * Add "int getIndexReaderBase(IndexReader)" method to your subclass
> * during Weight creation, the Weight must hold onto a reference to the passed in Searcher (casted to your sub class)
> * during Scorer creation, the Scorer must be passed the result of YourSearcher.getIndexReaderBase(reader)
> * Scorer can now rebase any collected docids using this offset
> Example implementation of getIndexReaderBase():
> {code}
> // NOTE: more efficient implementation can be done if you cache the result if gatherSubReaders in your constructor
> public int getIndexReaderBase(IndexReader reader) {
>   if (reader == getReader()) {
>     return 0;
>   } else {
>     List readers = new ArrayList();
>     gatherSubReaders(readers);
>     Iterator iter = readers.iterator();
>     int maxDoc = 0;
>     while (iter.hasNext()) {
>       IndexReader r = (IndexReader)iter.next();
>       if (r == reader) {
>         return maxDoc;
>       } 
>       maxDoc += r.maxDoc();
>     } 
>   }
>   return -1; // reader not in searcher
> }
> {code}
> Notes:
> * This workaround makes it so you cannot serialize your custom Weight implementation

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1821) Weight.scorer() not passed doc offset for "sub reader"

Posted by "Tim Smith (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12744747#action_12744747 ] 

Tim Smith commented on LUCENE-1821:
-----------------------------------

the int[] index array is used as a "perfect" hash function in order to be able to map all documents sharing the same value, so the int[] index from each sub reader can have different hash values for the same term (Thats why the string sorting in 2.9 is real nasty, because it can't just compare the ord values any more (unless you're lucky and you're comparing docs from the same reader)

> Weight.scorer() not passed doc offset for "sub reader"
> ------------------------------------------------------
>
>                 Key: LUCENE-1821
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1821
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Search
>    Affects Versions: 2.9
>            Reporter: Tim Smith
>
> Now that searching is done on a per segment basis, there is no way for a Scorer to know the "actual" doc id for the document's it matches (only the relative doc offset into the segment)
> If using caches in your scorer that are based on the "entire" index (all segments), there is now no way to index into them properly from inside a Scorer because the scorer is not passed the needed offset to calculate the "real" docid
> suggest having Weight.scorer() method also take a integer for the doc offset
> Abstract Weight class should have a constructor that takes this offset as well as a method to get the offset
> All Weights that have "sub" weights must pass this offset down to created "sub" weights

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1821) Weight.scorer() not passed doc offset for "sub reader"

Posted by "Tim Smith (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12744855#action_12744855 ] 

Tim Smith commented on LUCENE-1821:
-----------------------------------

on "internal lucene ids"

right now (2.4) i always know what docids map to what IndexReader (always the top level)
i don't have a problem breaking that assumption, as long as i have the context to map docids between spaces (sub reader to top level and back (mabye even a couple more levels in between))  

I opened this ticket soley because in the Scorer API there is no presented way to actually do this mapping between spaces (short of the "hacks" i discussed way above)
I have no problem whatsoever in the changing of what "space" docids exist in, as long as i can get to and from the top level to this space
I also have no problem if how to do this mapping changes between releases (as long as its documented)

> Weight.scorer() not passed doc offset for "sub reader"
> ------------------------------------------------------
>
>                 Key: LUCENE-1821
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1821
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Search
>    Affects Versions: 2.9
>            Reporter: Tim Smith
>
> Now that searching is done on a per segment basis, there is no way for a Scorer to know the "actual" doc id for the document's it matches (only the relative doc offset into the segment)
> If using caches in your scorer that are based on the "entire" index (all segments), there is now no way to index into them properly from inside a Scorer because the scorer is not passed the needed offset to calculate the "real" docid
> suggest having Weight.scorer() method also take a integer for the doc offset
> Abstract Weight class should have a constructor that takes this offset as well as a method to get the offset
> All Weights that have "sub" weights must pass this offset down to created "sub" weights

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1821) Weight.scorer() not passed doc offset for "sub reader"

Posted by "Mark Miller (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12745963#action_12745963 ] 

Mark Miller commented on LUCENE-1821:
-------------------------------------

bq .(it was undefined at best, but was clear from both lucene code and my use of the API that i did have the full index context)

It was an implementation detail. If you look at MultiSearcher, Searchable, Searcher and how the API is put together, you can see we don't support that type of thing. I think its fairly clear after a little thought.

You can limit your API's to handle just IndexSearchers, but as a project, we cannot.

bq. it seems like it should be ok to pass the IndexSearcher (the direct context for the IndexReader) 

MultiSearcher and Searchable make this impossible IMO. We would be playing to those that don't fully use the API, and thats a mistake in my opinion. At best, we would have to shift the whole API.

Its okay to pass the Reader because its a contextless Reader. There is no value in also passing a contextless Searcher IMO - especially when its an arbitrary different context. We have to live up to the current API - your throwing MultiSearcher, Searchable, Remote out the window.

bq. be fair, 2.9 has a lot of back compat breaks,

Oh I'm fair, I know that for sure - though I do like to argue way to much for my own good. All of these back compat breaks were painful to stomach ;) But we reached each one under special circumstances - usually our own early incompetence :) We technically are not allowed to just break things though. We break to fix what we already accidentally broke, or we break when we screwed up earlier and we are in between a rock and a hard place now - or we break when something else is broke anyway, so lets do more :) This was the release of the break for sure. We don't necessarily want this to happen every release though, and its our responsibility to strive towards our back compat policy (listed on the wiki).

I'm not talking about a break in adding a Searcher - that would be fine - back compat is already broken there - but unless we can pass a MultiSearcher there over a remove RMI call, its a break of the whole API IMO.

> Weight.scorer() not passed doc offset for "sub reader"
> ------------------------------------------------------
>
>                 Key: LUCENE-1821
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1821
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Search
>    Affects Versions: 2.9
>            Reporter: Tim Smith
>             Fix For: 2.9
>
>         Attachments: LUCENE-1821.patch
>
>
> Now that searching is done on a per segment basis, there is no way for a Scorer to know the "actual" doc id for the document's it matches (only the relative doc offset into the segment)
> If using caches in your scorer that are based on the "entire" index (all segments), there is now no way to index into them properly from inside a Scorer because the scorer is not passed the needed offset to calculate the "real" docid
> suggest having Weight.scorer() method also take a integer for the doc offset
> Abstract Weight class should have a constructor that takes this offset as well as a method to get the offset
> All Weights that have "sub" weights must pass this offset down to created "sub" weights
> Details on workaround:
> In order to work around this, you must do the following:
> * Subclass IndexSearcher
> * Add "int getIndexReaderBase(IndexReader)" method to your subclass
> * during Weight creation, the Weight must hold onto a reference to the passed in Searcher (casted to your sub class)
> * during Scorer creation, the Scorer must be passed the result of YourSearcher.getIndexReaderBase(reader)
> * Scorer can now rebase any collected docids using this offset
> Example implementation of getIndexReaderBase():
> {code}
> // NOTE: more efficient implementation can be done if you cache the result if gatherSubReaders in your constructor
> public int getIndexReaderBase(IndexReader reader) {
>   if (reader == getReader()) {
>     return 0;
>   } else {
>     List readers = new ArrayList();
>     gatherSubReaders(readers);
>     Iterator iter = readers.iterator();
>     int maxDoc = 0;
>     while (iter.hasNext()) {
>       IndexReader r = (IndexReader)iter.next();
>       if (r == reader) {
>         return maxDoc;
>       } 
>       maxDoc += r.maxDoc();
>     } 
>   }
>   return -1; // reader not in searcher
> }
> {code}
> Notes:
> * This workaround makes it so you cannot serialize your custom Weight implementation

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1821) Weight.scorer() not passed doc offset for "sub reader"

Posted by "Tim Smith (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12745979#action_12745979 ] 

Tim Smith commented on LUCENE-1821:
-----------------------------------

NOTE: if the leaf IndexSearcher were to be passed to scorer(), it would also have to be passed to explain()

> Weight.scorer() not passed doc offset for "sub reader"
> ------------------------------------------------------
>
>                 Key: LUCENE-1821
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1821
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Search
>    Affects Versions: 2.9
>            Reporter: Tim Smith
>             Fix For: 2.9
>
>         Attachments: LUCENE-1821.patch
>
>
> Now that searching is done on a per segment basis, there is no way for a Scorer to know the "actual" doc id for the document's it matches (only the relative doc offset into the segment)
> If using caches in your scorer that are based on the "entire" index (all segments), there is now no way to index into them properly from inside a Scorer because the scorer is not passed the needed offset to calculate the "real" docid
> suggest having Weight.scorer() method also take a integer for the doc offset
> Abstract Weight class should have a constructor that takes this offset as well as a method to get the offset
> All Weights that have "sub" weights must pass this offset down to created "sub" weights
> Details on workaround:
> In order to work around this, you must do the following:
> * Subclass IndexSearcher
> * Add "int getIndexReaderBase(IndexReader)" method to your subclass
> * during Weight creation, the Weight must hold onto a reference to the passed in Searcher (casted to your sub class)
> * during Scorer creation, the Scorer must be passed the result of YourSearcher.getIndexReaderBase(reader)
> * Scorer can now rebase any collected docids using this offset
> Example implementation of getIndexReaderBase():
> {code}
> // NOTE: more efficient implementation can be done if you cache the result if gatherSubReaders in your constructor
> public int getIndexReaderBase(IndexReader reader) {
>   if (reader == getReader()) {
>     return 0;
>   } else {
>     List readers = new ArrayList();
>     gatherSubReaders(readers);
>     Iterator iter = readers.iterator();
>     int maxDoc = 0;
>     while (iter.hasNext()) {
>       IndexReader r = (IndexReader)iter.next();
>       if (r == reader) {
>         return maxDoc;
>       } 
>       maxDoc += r.maxDoc();
>     } 
>   }
>   return -1; // reader not in searcher
> }
> {code}
> Notes:
> * This workaround makes it so you cannot serialize your custom Weight implementation

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1821) Weight.scorer() not passed doc offset for "sub reader"

Posted by "Mark Miller (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12744755#action_12744755 ] 

Mark Miller commented on LUCENE-1821:
-------------------------------------

I'd go with your solution then - I'd hate to expose doc base stuff on Searcher or in Scorer (I'm not a big fan of gatherSubReaders being protected either).

Also, *we* can't put Searcher on a Weight - it needs to be serializable (though you can if you don't need to count on that).

Where do you suggest we should place a warning that we havn't? I don't think we ever intended to support the use of external ids.

I think your use case is fairly special, and I'm not sure we should expose per segment stuff where we don't have to.

It is an interesting use case though - maybe someone else will chime in.

> Weight.scorer() not passed doc offset for "sub reader"
> ------------------------------------------------------
>
>                 Key: LUCENE-1821
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1821
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Search
>    Affects Versions: 2.9
>            Reporter: Tim Smith
>
> Now that searching is done on a per segment basis, there is no way for a Scorer to know the "actual" doc id for the document's it matches (only the relative doc offset into the segment)
> If using caches in your scorer that are based on the "entire" index (all segments), there is now no way to index into them properly from inside a Scorer because the scorer is not passed the needed offset to calculate the "real" docid
> suggest having Weight.scorer() method also take a integer for the doc offset
> Abstract Weight class should have a constructor that takes this offset as well as a method to get the offset
> All Weights that have "sub" weights must pass this offset down to created "sub" weights

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1821) Weight.scorer() not passed doc offset for "sub reader"

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12746645#action_12746645 ] 

Yonik Seeley commented on LUCENE-1821:
--------------------------------------

bq. I say we push this issue from 2.9 for now.

+1 


> Weight.scorer() not passed doc offset for "sub reader"
> ------------------------------------------------------
>
>                 Key: LUCENE-1821
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1821
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Search
>    Affects Versions: 2.9
>            Reporter: Tim Smith
>             Fix For: 2.9
>
>         Attachments: LUCENE-1821.patch
>
>
> Now that searching is done on a per segment basis, there is no way for a Scorer to know the "actual" doc id for the document's it matches (only the relative doc offset into the segment)
> If using caches in your scorer that are based on the "entire" index (all segments), there is now no way to index into them properly from inside a Scorer because the scorer is not passed the needed offset to calculate the "real" docid
> suggest having Weight.scorer() method also take a integer for the doc offset
> Abstract Weight class should have a constructor that takes this offset as well as a method to get the offset
> All Weights that have "sub" weights must pass this offset down to created "sub" weights
> Details on workaround:
> In order to work around this, you must do the following:
> * Subclass IndexSearcher
> * Add "int getIndexReaderBase(IndexReader)" method to your subclass
> * during Weight creation, the Weight must hold onto a reference to the passed in Searcher (casted to your sub class)
> * during Scorer creation, the Scorer must be passed the result of YourSearcher.getIndexReaderBase(reader)
> * Scorer can now rebase any collected docids using this offset
> Example implementation of getIndexReaderBase():
> {code}
> // NOTE: more efficient implementation can be done if you cache the result if gatherSubReaders in your constructor
> public int getIndexReaderBase(IndexReader reader) {
>   if (reader == getReader()) {
>     return 0;
>   } else {
>     List readers = new ArrayList();
>     gatherSubReaders(readers);
>     Iterator iter = readers.iterator();
>     int maxDoc = 0;
>     while (iter.hasNext()) {
>       IndexReader r = (IndexReader)iter.next();
>       if (r == reader) {
>         return maxDoc;
>       } 
>       maxDoc += r.maxDoc();
>     } 
>   }
>   return -1; // reader not in searcher
> }
> {code}
> Notes:
> * This workaround makes it so you cannot serialize your custom Weight implementation

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-1821) Weight.scorer() not passed doc offset for "sub reader"

Posted by "Tim Smith (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tim Smith updated LUCENE-1821:
------------------------------

    Attachment: LUCENE-1821.patch

Here's a patch that adds getIndexReaderBase(IndexReader reader) to IndexSearcher

sadly, this cannot be easily added to MultiSearcher as well as it uses "Searchables", which would require adding this method to the Searchable interface
I could work up another patch that adds this method to the Searchable interface, however that has some back-compat concerns


> Weight.scorer() not passed doc offset for "sub reader"
> ------------------------------------------------------
>
>                 Key: LUCENE-1821
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1821
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Search
>    Affects Versions: 2.9
>            Reporter: Tim Smith
>         Attachments: LUCENE-1821.patch
>
>
> Now that searching is done on a per segment basis, there is no way for a Scorer to know the "actual" doc id for the document's it matches (only the relative doc offset into the segment)
> If using caches in your scorer that are based on the "entire" index (all segments), there is now no way to index into them properly from inside a Scorer because the scorer is not passed the needed offset to calculate the "real" docid
> suggest having Weight.scorer() method also take a integer for the doc offset
> Abstract Weight class should have a constructor that takes this offset as well as a method to get the offset
> All Weights that have "sub" weights must pass this offset down to created "sub" weights

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1821) Weight.scorer() not passed doc offset for "sub reader"

Posted by "Tim Smith (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12744854#action_12744854 ] 

Tim Smith commented on LUCENE-1821:
-----------------------------------

bq. I'm not saying we should make it impossible for you to do this - but I don't think we should open a path for scorers to reconstruct multi-reader virtual ids. I don't think a Scorer should know or care why type of IndexReader it is working with.

i disagree with that, i think the APIs should make it clear whether you are working with a sub reader or a top level reader
if a Scorer is given an IndexReader, it should have the same ability to reconstruct the "client facing" docid in the same manner as the Collector interface provides in order to provide a consistent interface between Collectors and Scorers
This reconstruction should be documented as "advanced", however it should still be available

Whereever an IndexReader is exposed in API calls, it should be possible to walk the IndexReader's parent IndexReaders until you get the top level reader in order to have the full context of that IndexReader. This walking should only be done at "init" time (Scorer construction/Collector setScorer(), and so on depending on need of the application, but it should be possible (ideally without doing nasty things))


> Weight.scorer() not passed doc offset for "sub reader"
> ------------------------------------------------------
>
>                 Key: LUCENE-1821
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1821
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Search
>    Affects Versions: 2.9
>            Reporter: Tim Smith
>
> Now that searching is done on a per segment basis, there is no way for a Scorer to know the "actual" doc id for the document's it matches (only the relative doc offset into the segment)
> If using caches in your scorer that are based on the "entire" index (all segments), there is now no way to index into them properly from inside a Scorer because the scorer is not passed the needed offset to calculate the "real" docid
> suggest having Weight.scorer() method also take a integer for the doc offset
> Abstract Weight class should have a constructor that takes this offset as well as a method to get the offset
> All Weights that have "sub" weights must pass this offset down to created "sub" weights

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1821) Weight.scorer() not passed doc offset for "sub reader"

Posted by "Mark Miller (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12745997#action_12745997 ] 

Mark Miller commented on LUCENE-1821:
-------------------------------------

Its org.apache.lucene.search.QueryWrapperFilter. And technically we have to account for subclasses and combinations and anything possible.

This being the release of the break, who knows though. I can't reasonably see releasing without notes indicating that you *must* recompile. And while we want to limit how much work must be done (we also consider the likely impact), this would be the time to skirt through.

Pretty much depends on what McCandless weighs in with I guess - unless a new spectator pops up.
{code}
getDocIdSet(IndexReader) : DocIdSet - org.apache.lucene.search.QueryWrapperFilter
	bits(IndexReader) : BitSet - org.apache.lucene.search.BooleanFilterTest.getOldBitSetFilter(...).new Filter() {...}
	ConstantScorer(Similarity, IndexReader, Weight) - org.apache.lucene.search.ConstantScoreQuery.ConstantScorer
	explain(Searcher, IndexReader, int) : Explanation - org.apache.lucene.search.FilteredQuery.createWeight(...).new Weight() {...}
	getDISI(ArrayList, int, IndexReader) : DocIdSetIterator - org.apache.lucene.search.BooleanFilter
	getDocIdSet(IndexReader) : DocIdSet - org.apache.lucene.search.BooleanFilter (3 matches)
	getDocIdSet(IndexReader) : DocIdSet - org.apache.lucene.search.CachingWrapperFilter
	getDocIdSet(IndexReader) : DocIdSet - org.apache.lucene.search.CachingWrapperFilterHelper
	getDocIdSet(IndexReader) : DocIdSet - org.apache.lucene.search.RemoteCachingWrapperFilter
	getDocIdSet(IndexReader) : DocIdSet - org.apache.lucene.search.RemoteCachingWrapperFilterHelper
	scorer(IndexReader, boolean, boolean) : Scorer - org.apache.lucene.search.FilteredQuery.createWeight(...).new Weight() {...}
	searchWithFilter(IndexReader, Weight, Filter, Collector) : void - org.apache.lucene.search.IndexSearcher
	tstFilterCard(String, int, Filter) : void - org.apache.lucene.search.BooleanFilterTest
{code}

> Weight.scorer() not passed doc offset for "sub reader"
> ------------------------------------------------------
>
>                 Key: LUCENE-1821
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1821
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Search
>    Affects Versions: 2.9
>            Reporter: Tim Smith
>             Fix For: 2.9
>
>         Attachments: LUCENE-1821.patch
>
>
> Now that searching is done on a per segment basis, there is no way for a Scorer to know the "actual" doc id for the document's it matches (only the relative doc offset into the segment)
> If using caches in your scorer that are based on the "entire" index (all segments), there is now no way to index into them properly from inside a Scorer because the scorer is not passed the needed offset to calculate the "real" docid
> suggest having Weight.scorer() method also take a integer for the doc offset
> Abstract Weight class should have a constructor that takes this offset as well as a method to get the offset
> All Weights that have "sub" weights must pass this offset down to created "sub" weights
> Details on workaround:
> In order to work around this, you must do the following:
> * Subclass IndexSearcher
> * Add "int getIndexReaderBase(IndexReader)" method to your subclass
> * during Weight creation, the Weight must hold onto a reference to the passed in Searcher (casted to your sub class)
> * during Scorer creation, the Scorer must be passed the result of YourSearcher.getIndexReaderBase(reader)
> * Scorer can now rebase any collected docids using this offset
> Example implementation of getIndexReaderBase():
> {code}
> // NOTE: more efficient implementation can be done if you cache the result if gatherSubReaders in your constructor
> public int getIndexReaderBase(IndexReader reader) {
>   if (reader == getReader()) {
>     return 0;
>   } else {
>     List readers = new ArrayList();
>     gatherSubReaders(readers);
>     Iterator iter = readers.iterator();
>     int maxDoc = 0;
>     while (iter.hasNext()) {
>       IndexReader r = (IndexReader)iter.next();
>       if (r == reader) {
>         return maxDoc;
>       } 
>       maxDoc += r.maxDoc();
>     } 
>   }
>   return -1; // reader not in searcher
> }
> {code}
> Notes:
> * This workaround makes it so you cannot serialize your custom Weight implementation

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-1821) Weight.scorer() not passed doc offset for "sub reader"

Posted by "Mark Miller (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mark Miller updated LUCENE-1821:
--------------------------------

    Affects Version/s:     (was: 2.9)
                       3.1

Yeah, no problem - tag whatever you'd like - I only went to nothing because it was the easiest default move.

With the current plan (subject to change), the earliest it could be considered again is 3.1, so I'll move there.

> Weight.scorer() not passed doc offset for "sub reader"
> ------------------------------------------------------
>
>                 Key: LUCENE-1821
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1821
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Search
>    Affects Versions: 3.1
>            Reporter: Tim Smith
>         Attachments: LUCENE-1821.patch
>
>
> Now that searching is done on a per segment basis, there is no way for a Scorer to know the "actual" doc id for the document's it matches (only the relative doc offset into the segment)
> If using caches in your scorer that are based on the "entire" index (all segments), there is now no way to index into them properly from inside a Scorer because the scorer is not passed the needed offset to calculate the "real" docid
> suggest having Weight.scorer() method also take a integer for the doc offset
> Abstract Weight class should have a constructor that takes this offset as well as a method to get the offset
> All Weights that have "sub" weights must pass this offset down to created "sub" weights
> Details on workaround:
> In order to work around this, you must do the following:
> * Subclass IndexSearcher
> * Add "int getIndexReaderBase(IndexReader)" method to your subclass
> * during Weight creation, the Weight must hold onto a reference to the passed in Searcher (casted to your sub class)
> * during Scorer creation, the Scorer must be passed the result of YourSearcher.getIndexReaderBase(reader)
> * Scorer can now rebase any collected docids using this offset
> Example implementation of getIndexReaderBase():
> {code}
> // NOTE: more efficient implementation can be done if you cache the result if gatherSubReaders in your constructor
> public int getIndexReaderBase(IndexReader reader) {
>   if (reader == getReader()) {
>     return 0;
>   } else {
>     List readers = new ArrayList();
>     gatherSubReaders(readers);
>     Iterator iter = readers.iterator();
>     int maxDoc = 0;
>     while (iter.hasNext()) {
>       IndexReader r = (IndexReader)iter.next();
>       if (r == reader) {
>         return maxDoc;
>       } 
>       maxDoc += r.maxDoc();
>     } 
>   }
>   return -1; // reader not in searcher
> }
> {code}
> Notes:
> * This workaround makes it so you cannot serialize your custom Weight implementation

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1821) Weight.scorer() not passed doc offset for "sub reader"

Posted by "Tim Smith (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12745416#action_12745416 ] 

Tim Smith commented on LUCENE-1821:
-----------------------------------

It would also be nice if the top level Searcher were pased in to Weight.scorer() (like in Weight.explain())
that way custom Weight implementations won't need to hold onto the Searcher at Weight creation time (thats a bigger patch though)

> Weight.scorer() not passed doc offset for "sub reader"
> ------------------------------------------------------
>
>                 Key: LUCENE-1821
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1821
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Search
>    Affects Versions: 2.9
>            Reporter: Tim Smith
>         Attachments: LUCENE-1821.patch
>
>
> Now that searching is done on a per segment basis, there is no way for a Scorer to know the "actual" doc id for the document's it matches (only the relative doc offset into the segment)
> If using caches in your scorer that are based on the "entire" index (all segments), there is now no way to index into them properly from inside a Scorer because the scorer is not passed the needed offset to calculate the "real" docid
> suggest having Weight.scorer() method also take a integer for the doc offset
> Abstract Weight class should have a constructor that takes this offset as well as a method to get the offset
> All Weights that have "sub" weights must pass this offset down to created "sub" weights
> Details on workaround:
> In order to work around this, you must do the following:
> * Subclass IndexSearcher
> * Add "int getIndexReaderBase(IndexReader)" method to your subclass
> * during Weight creation, the Weight must hold onto a reference to the passed in Searcher (casted to your sub class)
> * during Scorer creation, the Scorer must be passed the result of YourSearcher.getIndexReaderBase(reader)
> * Scorer can now rebase any collected docids using this offset
> Example implementation of getIndexReaderBase():
> {code}
> // NOTE: more efficient implementation can be done if you cache the result if gatherSubReaders in your constructor
> public int getIndexReaderBase(IndexReader reader) {
>   if (reader == getReader()) {
>     return 0;
>   } else {
>     List readers = new ArrayList();
>     gatherSubReaders(readers);
>     Iterator iter = readers.iterator();
>     int maxDoc = 0;
>     while (iter.hasNext()) {
>       IndexReader r = (IndexReader)iter.next();
>       if (r == reader) {
>         return maxDoc;
>       } 
>       maxDoc += r.maxDoc();
>     } 
>   }
>   return -1; // reader not in searcher
> }
> {code}
> Notes:
> * This workaround makes it so you cannot serialize your custom Weight implementation

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1821) Weight.scorer() not passed doc offset for "sub reader"

Posted by "Tim Smith (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12745977#action_12745977 ] 

Tim Smith commented on LUCENE-1821:
-----------------------------------

{quote}
It was an implementation detail. If you look at MultiSearcher, Searchable, Searcher and how the API is put together, you can see we don't support that type of thing. I think its fairly clear after a little thought.

You can limit your API's to handle just IndexSearchers, but as a project, we cannot.
{quote}
I totally understand your resistance here.  I get that i'm really utilizing advanced lucene concepts at very low levels (and these are subject to some changes that i will have to absorb with new versions)

bq. Its okay to pass the Reader because its a contextless Reader. There is no value in also passing a contextless Searcher
well, when you pass the Searcher that contains Reader, the Reader is no longer contextless.
also, the context of the Searcher can be fairly well defined (its a "leaf" Searcher. the one that actually called Weight.scorer())

Also, looking a bit more at MultiSearcher semantics, sorting requires this "leaf" Searcher context in order to work already
MultiSearcher just takes the top docs from each underlaying Searchable, adjusts the docids to the MultiSearcher Context, and sends them through another priority queue
So, this "leaf" Searcher context concept is required by sorting already. 
I just want my Scorer to be given this "leaf" context as well

Also, since it is a "leaf" context, the Weight.scorer() method could have the following interface:
{code}
/**
 * @param searcher The IndexSearcher that contains reader.
 */
public Scorer scorer(IndexSearcher searcher, IndexReader reader, boolean allowDocsInOrder, boolean topScorer);
{code}

then, with the patch i posted, i could call:
searcher.getIndexReaderBase(reader) 
and i'm all set



> Weight.scorer() not passed doc offset for "sub reader"
> ------------------------------------------------------
>
>                 Key: LUCENE-1821
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1821
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Search
>    Affects Versions: 2.9
>            Reporter: Tim Smith
>             Fix For: 2.9
>
>         Attachments: LUCENE-1821.patch
>
>
> Now that searching is done on a per segment basis, there is no way for a Scorer to know the "actual" doc id for the document's it matches (only the relative doc offset into the segment)
> If using caches in your scorer that are based on the "entire" index (all segments), there is now no way to index into them properly from inside a Scorer because the scorer is not passed the needed offset to calculate the "real" docid
> suggest having Weight.scorer() method also take a integer for the doc offset
> Abstract Weight class should have a constructor that takes this offset as well as a method to get the offset
> All Weights that have "sub" weights must pass this offset down to created "sub" weights
> Details on workaround:
> In order to work around this, you must do the following:
> * Subclass IndexSearcher
> * Add "int getIndexReaderBase(IndexReader)" method to your subclass
> * during Weight creation, the Weight must hold onto a reference to the passed in Searcher (casted to your sub class)
> * during Scorer creation, the Scorer must be passed the result of YourSearcher.getIndexReaderBase(reader)
> * Scorer can now rebase any collected docids using this offset
> Example implementation of getIndexReaderBase():
> {code}
> // NOTE: more efficient implementation can be done if you cache the result if gatherSubReaders in your constructor
> public int getIndexReaderBase(IndexReader reader) {
>   if (reader == getReader()) {
>     return 0;
>   } else {
>     List readers = new ArrayList();
>     gatherSubReaders(readers);
>     Iterator iter = readers.iterator();
>     int maxDoc = 0;
>     while (iter.hasNext()) {
>       IndexReader r = (IndexReader)iter.next();
>       if (r == reader) {
>         return maxDoc;
>       } 
>       maxDoc += r.maxDoc();
>     } 
>   }
>   return -1; // reader not in searcher
> }
> {code}
> Notes:
> * This workaround makes it so you cannot serialize your custom Weight implementation

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1821) Weight.scorer() not passed doc offset for "sub reader"

Posted by "Mark Miller (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12745986#action_12745986 ] 

Mark Miller commented on LUCENE-1821:
-------------------------------------

bq. clearly explained with javadoc

We need something along the lines of - you cannot count on this IndexSearcher to cover the whole index unless you control the use of your application to ensure that - its possible that this IndexSearcher will only correspond to one sub-index of multiple being used in the context of one large index.

> Weight.scorer() not passed doc offset for "sub reader"
> ------------------------------------------------------
>
>                 Key: LUCENE-1821
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1821
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Search
>    Affects Versions: 2.9
>            Reporter: Tim Smith
>             Fix For: 2.9
>
>         Attachments: LUCENE-1821.patch
>
>
> Now that searching is done on a per segment basis, there is no way for a Scorer to know the "actual" doc id for the document's it matches (only the relative doc offset into the segment)
> If using caches in your scorer that are based on the "entire" index (all segments), there is now no way to index into them properly from inside a Scorer because the scorer is not passed the needed offset to calculate the "real" docid
> suggest having Weight.scorer() method also take a integer for the doc offset
> Abstract Weight class should have a constructor that takes this offset as well as a method to get the offset
> All Weights that have "sub" weights must pass this offset down to created "sub" weights
> Details on workaround:
> In order to work around this, you must do the following:
> * Subclass IndexSearcher
> * Add "int getIndexReaderBase(IndexReader)" method to your subclass
> * during Weight creation, the Weight must hold onto a reference to the passed in Searcher (casted to your sub class)
> * during Scorer creation, the Scorer must be passed the result of YourSearcher.getIndexReaderBase(reader)
> * Scorer can now rebase any collected docids using this offset
> Example implementation of getIndexReaderBase():
> {code}
> // NOTE: more efficient implementation can be done if you cache the result if gatherSubReaders in your constructor
> public int getIndexReaderBase(IndexReader reader) {
>   if (reader == getReader()) {
>     return 0;
>   } else {
>     List readers = new ArrayList();
>     gatherSubReaders(readers);
>     Iterator iter = readers.iterator();
>     int maxDoc = 0;
>     while (iter.hasNext()) {
>       IndexReader r = (IndexReader)iter.next();
>       if (r == reader) {
>         return maxDoc;
>       } 
>       maxDoc += r.maxDoc();
>     } 
>   }
>   return -1; // reader not in searcher
> }
> {code}
> Notes:
> * This workaround makes it so you cannot serialize your custom Weight implementation

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1821) Weight.scorer() not passed doc offset for "sub reader"

Posted by "Tim Smith (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12757199#action_12757199 ] 

Tim Smith commented on LUCENE-1821:
-----------------------------------

I've been playing with per-segment caches for the last couple of weeks and have got everything working pretty well

However, i have to end up doing a lot of mapping between an IndexReader instance, and the "index into the IndexReader[]" array of the IndexSearcher
this then allows me to easily get the proper document offset where needed, and/or get a handle on the proper per-segment cache/evaluation object/etc

For my use cases, it would be much easier if the following methods were available:

on Weight:
{code}
// readerId is the "i" in the for (int i = 0; i < readers.length; ++i) in IndexSearcher
// NOTE: that readerId is at the IndexSearcher level, not the MultiSearcher level
public Scorer scorer(IndexReader reader, int readerId, boolean inOrder, boolean topLevel);
{code}

on Collector:
{code}
public void setNextReader(IndexReader reader, int docBase, int readerId);
// NOTE: this isn't extremely needed, as its easier to get the readerId from docBase (using a cached int[] of docbases for the searcher)
{code}

I suppose i could use the fact that these methods will always be called in order, keeping and incrementing counter, however the javadoc explicitly says that these methods may be called out of "segment order" to be more efficient in the future. It would therefore be very useful if these indexes were passed into these methods.

To work around this, my searcher currently has a getReaderIdForReader() method very similar to my earlier proposed getIndexReaderBase() method




> Weight.scorer() not passed doc offset for "sub reader"
> ------------------------------------------------------
>
>                 Key: LUCENE-1821
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1821
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Search
>    Affects Versions: 2.9
>            Reporter: Tim Smith
>             Fix For: 3.1
>
>         Attachments: LUCENE-1821.patch
>
>
> Now that searching is done on a per segment basis, there is no way for a Scorer to know the "actual" doc id for the document's it matches (only the relative doc offset into the segment)
> If using caches in your scorer that are based on the "entire" index (all segments), there is now no way to index into them properly from inside a Scorer because the scorer is not passed the needed offset to calculate the "real" docid
> suggest having Weight.scorer() method also take a integer for the doc offset
> Abstract Weight class should have a constructor that takes this offset as well as a method to get the offset
> All Weights that have "sub" weights must pass this offset down to created "sub" weights
> Details on workaround:
> In order to work around this, you must do the following:
> * Subclass IndexSearcher
> * Add "int getIndexReaderBase(IndexReader)" method to your subclass
> * during Weight creation, the Weight must hold onto a reference to the passed in Searcher (casted to your sub class)
> * during Scorer creation, the Scorer must be passed the result of YourSearcher.getIndexReaderBase(reader)
> * Scorer can now rebase any collected docids using this offset
> Example implementation of getIndexReaderBase():
> {code}
> // NOTE: more efficient implementation can be done if you cache the result if gatherSubReaders in your constructor
> public int getIndexReaderBase(IndexReader reader) {
>   if (reader == getReader()) {
>     return 0;
>   } else {
>     List readers = new ArrayList();
>     gatherSubReaders(readers);
>     Iterator iter = readers.iterator();
>     int maxDoc = 0;
>     while (iter.hasNext()) {
>       IndexReader r = (IndexReader)iter.next();
>       if (r == reader) {
>         return maxDoc;
>       } 
>       maxDoc += r.maxDoc();
>     } 
>   }
>   return -1; // reader not in searcher
> }
> {code}
> Notes:
> * This workaround makes it so you cannot serialize your custom Weight implementation

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-1821) Weight.scorer() not passed doc offset for "sub reader"

Posted by "Mark Miller (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mark Miller updated LUCENE-1821:
--------------------------------

    Fix Version/s:     (was: 2.9)

I'm going to push it out for now. Of course, feel free to argue for its re inclusion.

> Weight.scorer() not passed doc offset for "sub reader"
> ------------------------------------------------------
>
>                 Key: LUCENE-1821
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1821
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Search
>    Affects Versions: 2.9
>            Reporter: Tim Smith
>         Attachments: LUCENE-1821.patch
>
>
> Now that searching is done on a per segment basis, there is no way for a Scorer to know the "actual" doc id for the document's it matches (only the relative doc offset into the segment)
> If using caches in your scorer that are based on the "entire" index (all segments), there is now no way to index into them properly from inside a Scorer because the scorer is not passed the needed offset to calculate the "real" docid
> suggest having Weight.scorer() method also take a integer for the doc offset
> Abstract Weight class should have a constructor that takes this offset as well as a method to get the offset
> All Weights that have "sub" weights must pass this offset down to created "sub" weights
> Details on workaround:
> In order to work around this, you must do the following:
> * Subclass IndexSearcher
> * Add "int getIndexReaderBase(IndexReader)" method to your subclass
> * during Weight creation, the Weight must hold onto a reference to the passed in Searcher (casted to your sub class)
> * during Scorer creation, the Scorer must be passed the result of YourSearcher.getIndexReaderBase(reader)
> * Scorer can now rebase any collected docids using this offset
> Example implementation of getIndexReaderBase():
> {code}
> // NOTE: more efficient implementation can be done if you cache the result if gatherSubReaders in your constructor
> public int getIndexReaderBase(IndexReader reader) {
>   if (reader == getReader()) {
>     return 0;
>   } else {
>     List readers = new ArrayList();
>     gatherSubReaders(readers);
>     Iterator iter = readers.iterator();
>     int maxDoc = 0;
>     while (iter.hasNext()) {
>       IndexReader r = (IndexReader)iter.next();
>       if (r == reader) {
>         return maxDoc;
>       } 
>       maxDoc += r.maxDoc();
>     } 
>   }
>   return -1; // reader not in searcher
> }
> {code}
> Notes:
> * This workaround makes it so you cannot serialize your custom Weight implementation

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1821) Weight.scorer() not passed doc offset for "sub reader"

Posted by "Mark Miller (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12745985#action_12745985 ] 

Mark Miller commented on LUCENE-1821:
-------------------------------------

I certainly think IndexSearcher makes a lot more sense than Searchable or Searcher there - it somewhat handles the whole API break thing - your clearly limiting to an IndexSearcher, so its compatible with the current API and can be clearly explained with javadoc - I still worry about pushing the API towards things that leave MultiSearcher/Remote out in the cold on features. They are technically first class citizens. I think some expert warnings could sell me anyway though.

I still have a problem with this though:

{code}
public DocIdSet getDocIdSet(final IndexReader reader) throws IOException {
    final Weight weight = query.weight(new IndexSearcher(reader));
    return new DocIdSet() {
      public DocIdSetIterator iterator() throws IOException {
        return weight.scorer(reader, docBase?, true, false);
      }
    };
  }

{code}

That scorer call should get the right IndexSearcher and I don't see how it can without breaking back compat on this method and passing an IndexSearcher too.

> Weight.scorer() not passed doc offset for "sub reader"
> ------------------------------------------------------
>
>                 Key: LUCENE-1821
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1821
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Search
>    Affects Versions: 2.9
>            Reporter: Tim Smith
>             Fix For: 2.9
>
>         Attachments: LUCENE-1821.patch
>
>
> Now that searching is done on a per segment basis, there is no way for a Scorer to know the "actual" doc id for the document's it matches (only the relative doc offset into the segment)
> If using caches in your scorer that are based on the "entire" index (all segments), there is now no way to index into them properly from inside a Scorer because the scorer is not passed the needed offset to calculate the "real" docid
> suggest having Weight.scorer() method also take a integer for the doc offset
> Abstract Weight class should have a constructor that takes this offset as well as a method to get the offset
> All Weights that have "sub" weights must pass this offset down to created "sub" weights
> Details on workaround:
> In order to work around this, you must do the following:
> * Subclass IndexSearcher
> * Add "int getIndexReaderBase(IndexReader)" method to your subclass
> * during Weight creation, the Weight must hold onto a reference to the passed in Searcher (casted to your sub class)
> * during Scorer creation, the Scorer must be passed the result of YourSearcher.getIndexReaderBase(reader)
> * Scorer can now rebase any collected docids using this offset
> Example implementation of getIndexReaderBase():
> {code}
> // NOTE: more efficient implementation can be done if you cache the result if gatherSubReaders in your constructor
> public int getIndexReaderBase(IndexReader reader) {
>   if (reader == getReader()) {
>     return 0;
>   } else {
>     List readers = new ArrayList();
>     gatherSubReaders(readers);
>     Iterator iter = readers.iterator();
>     int maxDoc = 0;
>     while (iter.hasNext()) {
>       IndexReader r = (IndexReader)iter.next();
>       if (r == reader) {
>         return maxDoc;
>       } 
>       maxDoc += r.maxDoc();
>     } 
>   }
>   return -1; // reader not in searcher
> }
> {code}
> Notes:
> * This workaround makes it so you cannot serialize your custom Weight implementation

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-1821) Weight.scorer() not passed doc offset for "sub reader"

Posted by "Tim Smith (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tim Smith updated LUCENE-1821:
------------------------------

    Fix Version/s: 2.9

Marking as fix for 2.9 so this gets looked over real good prior to 2.9 going out (even if it is punted)

i believe i can workaround this (still integrating 2.9 into my app, and i haven't got to per-field caching stuff yet)

In order to get my app to work with 2.9 (without any major mods), i had to add the following to my IndexSearcher subclass:
NOTE: i never use Filter (thats why this skips it over)
{code}
  @Override
  public void search(Weight weight, Filter filter, Collector collector) throws IOException {
    // Need to work on the top level reader for now
    collector.setNextReader(reader, 0);
    final Scorer scorer = weight.scorer(reader, !collector.acceptsDocsOutOfOrder(), true);
    if (scorer != null) {
      scorer.score(collector);
    }
  }
{code}

> Weight.scorer() not passed doc offset for "sub reader"
> ------------------------------------------------------
>
>                 Key: LUCENE-1821
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1821
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Search
>    Affects Versions: 2.9
>            Reporter: Tim Smith
>             Fix For: 2.9
>
>         Attachments: LUCENE-1821.patch
>
>
> Now that searching is done on a per segment basis, there is no way for a Scorer to know the "actual" doc id for the document's it matches (only the relative doc offset into the segment)
> If using caches in your scorer that are based on the "entire" index (all segments), there is now no way to index into them properly from inside a Scorer because the scorer is not passed the needed offset to calculate the "real" docid
> suggest having Weight.scorer() method also take a integer for the doc offset
> Abstract Weight class should have a constructor that takes this offset as well as a method to get the offset
> All Weights that have "sub" weights must pass this offset down to created "sub" weights
> Details on workaround:
> In order to work around this, you must do the following:
> * Subclass IndexSearcher
> * Add "int getIndexReaderBase(IndexReader)" method to your subclass
> * during Weight creation, the Weight must hold onto a reference to the passed in Searcher (casted to your sub class)
> * during Scorer creation, the Scorer must be passed the result of YourSearcher.getIndexReaderBase(reader)
> * Scorer can now rebase any collected docids using this offset
> Example implementation of getIndexReaderBase():
> {code}
> // NOTE: more efficient implementation can be done if you cache the result if gatherSubReaders in your constructor
> public int getIndexReaderBase(IndexReader reader) {
>   if (reader == getReader()) {
>     return 0;
>   } else {
>     List readers = new ArrayList();
>     gatherSubReaders(readers);
>     Iterator iter = readers.iterator();
>     int maxDoc = 0;
>     while (iter.hasNext()) {
>       IndexReader r = (IndexReader)iter.next();
>       if (r == reader) {
>         return maxDoc;
>       } 
>       maxDoc += r.maxDoc();
>     } 
>   }
>   return -1; // reader not in searcher
> }
> {code}
> Notes:
> * This workaround makes it so you cannot serialize your custom Weight implementation

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1821) Weight.scorer() not passed doc offset for "sub reader"

Posted by "Tim Smith (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12746809#action_12746809 ] 

Tim Smith commented on LUCENE-1821:
-----------------------------------

bq. Actually sorting (during collection) already gives you the docBase so shouldn't your app already have the context needed for this?

Yes, i get the docbase and all during collection, so doing sorting with a top level cache will be no problem.
I was mainly using sorting as an example of some of the pain caused by per-segment searching/caches (the Collector API makes it easy enough to do sorting
on the top level or per segment, so i'm not concerned about integration here)

For my app, i plan to allow sorting to be either "per-segment" or "top-level" in order to allow people to choose thier poison: faster commit/less memory vs faster sorting
I also plan to do faceting likewise
certain features will always require a top-level cache (but those are advanced features anyway and should be expected to have impacts on commit time/first search time)

bq. Hmm... is advance in fact costly for your DocIdSets?

Think how costly it would be to do advance for the SortedVInt DocIdSet (linear search over compressed values)
for a bitset, this is instantaneous, but to conserve memory, its better to use a sorted int[] (or the SortedVInt stuff 2.9 provides)

in the end, i plan to bucketize the collected docs per segment, so in the end this should hopefully be less of an issue
nice thing about that approach is that i can have a bitset for one segment (lost of matches in this segment) and a very small int[] for a different segment based on the matches per segment. Biggest difficulty is doing the mapping to the per-segment "DocIdSet" (which will probably have to be slower)

bq. this one method would allow you to not have to subclass IndexSearcher.

I already have to subclass index searcher (i do a lot of extra stuff)
however, the IndexSearcher doesn't provide any protected access to its sub readers and doc starts, so i have to do this myself in my subclass's constructor (in the same way IndexSearcher is doing this

I would really like to see getIndexReaderBase() added to 2.9's IndexSearcher
I would also like to see the subreaders and docstarts either made protected or given protected accessor methods (so i don't have to recreate the same set of sub readers (and make sure i do this the same way for future versions of lucene)
Would also be nice to see a protected constructor on IndexSearcher like so:
{code}
  protected IndexSearcher(IndexReader reader, IndexReader[] subReaders, int[] docStarts) {
   ...
  }
{code}

This would allow creating "temporary" IndexSearchers much faster (don't need to gather sub readers)
This would allow:
* easily creating IndexSearcher that is "top-level" (subReaders[] would be length 1 and just contain reader)
* create a "temporary" IndexSearcher off another IndexSearcher that contains some "short lived" context (i have this use case)




> Weight.scorer() not passed doc offset for "sub reader"
> ------------------------------------------------------
>
>                 Key: LUCENE-1821
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1821
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Search
>    Affects Versions: 2.9
>            Reporter: Tim Smith
>             Fix For: 3.1
>
>         Attachments: LUCENE-1821.patch
>
>
> Now that searching is done on a per segment basis, there is no way for a Scorer to know the "actual" doc id for the document's it matches (only the relative doc offset into the segment)
> If using caches in your scorer that are based on the "entire" index (all segments), there is now no way to index into them properly from inside a Scorer because the scorer is not passed the needed offset to calculate the "real" docid
> suggest having Weight.scorer() method also take a integer for the doc offset
> Abstract Weight class should have a constructor that takes this offset as well as a method to get the offset
> All Weights that have "sub" weights must pass this offset down to created "sub" weights
> Details on workaround:
> In order to work around this, you must do the following:
> * Subclass IndexSearcher
> * Add "int getIndexReaderBase(IndexReader)" method to your subclass
> * during Weight creation, the Weight must hold onto a reference to the passed in Searcher (casted to your sub class)
> * during Scorer creation, the Scorer must be passed the result of YourSearcher.getIndexReaderBase(reader)
> * Scorer can now rebase any collected docids using this offset
> Example implementation of getIndexReaderBase():
> {code}
> // NOTE: more efficient implementation can be done if you cache the result if gatherSubReaders in your constructor
> public int getIndexReaderBase(IndexReader reader) {
>   if (reader == getReader()) {
>     return 0;
>   } else {
>     List readers = new ArrayList();
>     gatherSubReaders(readers);
>     Iterator iter = readers.iterator();
>     int maxDoc = 0;
>     while (iter.hasNext()) {
>       IndexReader r = (IndexReader)iter.next();
>       if (r == reader) {
>         return maxDoc;
>       } 
>       maxDoc += r.maxDoc();
>     } 
>   }
>   return -1; // reader not in searcher
> }
> {code}
> Notes:
> * This workaround makes it so you cannot serialize your custom Weight implementation

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1821) Weight.scorer() not passed doc offset for "sub reader"

Posted by "Tim Smith (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12744732#action_12744732 ] 

Tim Smith commented on LUCENE-1821:
-----------------------------------

I think i found a workaround (to allow me to upgrade)

I already subclass IndexSearcher, so i could just add the following method (and use this instead of regular search())
{code}
  public void mySearch(Weight weight, Collector collector) throws IOException {
    collector.setNextReader(reader, 0);
    Scorer scorer = weight.scorer(reader, !collector.acceptsDocsOutOfOrder(), true);
    if (scorer != null) {
      scorer.score(collector);
    }
  }
{code}

however, this prevents me from being able to take full advantage of per segment caching (which i really want to use for all my other caches)


> Weight.scorer() not passed doc offset for "sub reader"
> ------------------------------------------------------
>
>                 Key: LUCENE-1821
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1821
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Search
>    Affects Versions: 2.9
>            Reporter: Tim Smith
>
> Now that searching is done on a per segment basis, there is no way for a Scorer to know the "actual" doc id for the document's it matches (only the relative doc offset into the segment)
> If using caches in your scorer that are based on the "entire" index (all segments), there is now no way to index into them properly from inside a Scorer because the scorer is not passed the needed offset to calculate the "real" docid
> suggest having Weight.scorer() method also take a integer for the doc offset
> Abstract Weight class should have a constructor that takes this offset as well as a method to get the offset
> All Weights that have "sub" weights must pass this offset down to created "sub" weights

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1821) Weight.scorer() not passed doc offset for "sub reader"

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12746644#action_12746644 ] 

Yonik Seeley commented on LUCENE-1821:
--------------------------------------

bq. This is a good point... Yonik, how [in general!] is Solr handling the cutover to per-segment, for faceting?

It doesn't.  Faceting is not connected to searching in Solr, and is only done at the top level IndexReader.
We obviously want to enable per-segment faceting for more NRT in the future - with the expected disadvantage that it will be somewhat slower for some types of facets.  I imagine we will keep the top-level faceting as an option because there will be tradeoffs.

> Weight.scorer() not passed doc offset for "sub reader"
> ------------------------------------------------------
>
>                 Key: LUCENE-1821
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1821
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Search
>    Affects Versions: 2.9
>            Reporter: Tim Smith
>             Fix For: 2.9
>
>         Attachments: LUCENE-1821.patch
>
>
> Now that searching is done on a per segment basis, there is no way for a Scorer to know the "actual" doc id for the document's it matches (only the relative doc offset into the segment)
> If using caches in your scorer that are based on the "entire" index (all segments), there is now no way to index into them properly from inside a Scorer because the scorer is not passed the needed offset to calculate the "real" docid
> suggest having Weight.scorer() method also take a integer for the doc offset
> Abstract Weight class should have a constructor that takes this offset as well as a method to get the offset
> All Weights that have "sub" weights must pass this offset down to created "sub" weights
> Details on workaround:
> In order to work around this, you must do the following:
> * Subclass IndexSearcher
> * Add "int getIndexReaderBase(IndexReader)" method to your subclass
> * during Weight creation, the Weight must hold onto a reference to the passed in Searcher (casted to your sub class)
> * during Scorer creation, the Scorer must be passed the result of YourSearcher.getIndexReaderBase(reader)
> * Scorer can now rebase any collected docids using this offset
> Example implementation of getIndexReaderBase():
> {code}
> // NOTE: more efficient implementation can be done if you cache the result if gatherSubReaders in your constructor
> public int getIndexReaderBase(IndexReader reader) {
>   if (reader == getReader()) {
>     return 0;
>   } else {
>     List readers = new ArrayList();
>     gatherSubReaders(readers);
>     Iterator iter = readers.iterator();
>     int maxDoc = 0;
>     while (iter.hasNext()) {
>       IndexReader r = (IndexReader)iter.next();
>       if (r == reader) {
>         return maxDoc;
>       } 
>       maxDoc += r.maxDoc();
>     } 
>   }
>   return -1; // reader not in searcher
> }
> {code}
> Notes:
> * This workaround makes it so you cannot serialize your custom Weight implementation

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1821) Weight.scorer() not passed doc offset for "sub reader"

Posted by "Tim Smith (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12746643#action_12746643 ] 

Tim Smith commented on LUCENE-1821:
-----------------------------------

Lot of new comments to respond to :)
will try to cover them all

bq. decent comparator (StringOrdValComparator) that operates per segment.

Still, the StringOrdValComparator will have to break down and call String.equals() whenever it compars docs in different IndexReaders
It also has to do more maintenance in general than would be needed for just a StringOrd comparator that would have a cache across all IndexReaders
While the StringOrdValComparator may be faster in 2.9 than string sorting in 2.4, its not as fast as it could be if the cache was created on the IndexSearcher level
I looked at the new string sorting stuff last week, and it looks pretty smart to reduce the number of String.equals() calls needed, but this adds extra complexity and will still be reduced to String.equals() calls, which will translate to slower sorting than could be possible

bq. one option might be to subclass DirectoryReader 

The idea of this is to disable per segment searching?
I don't actually want to do that. I want to use per segment searching functionality to take advantage of caches on per segment basis where possible, and map docs to the IndexSearcher context when i can't do per segment caching.

bq. Could you compute the top-level ords, but then break it up per-segment?

I think i see what your getting at here, and i've already thought of this as a potential solution. The cache will always need to be created at the top most level, but it will be pre-broken out into a per-segment cache whose context is the top level IndexSearcher/MultiReader. The biggest problem here is the complexity of actually creating such a cache, which i'm sure will translate to this cache loading slower (hard to say how much slower without implementing)
I do plan to try this approach, but i expect this will be at least a week or two out from now.

I've currently updated my code for this to work per-segment by adding the docBase when performing the lookup into this cache (which is per-IndexSearcher)
I did this using my getIndexReaderBase() funciton i added to my subclass of IndexSearcher during Scorer construction time (I can live with this, however i would like to see getIndexReaderBase() added to IndexSearcher, and the IndexSearcher passed to Weight.scorer() so i don't need to hold onto my IndexSearcher subclass in my Weight implementation)

bq. just return the "virtual" per-segment DocIdSet.

Thats what i'm doing now. I use the docid base for the IndexReader, along with its maxDoc to have the Scorer represent a virtual slice for just the segment in question
The only real problem here is that during Scorer initialization for this i have to call fullDocIdSetIter.advance(docBase) in the Scorer constructor. If advance(int) for the DocIdSet in question is O(N), this adds an extra penalty per segment that did not exist before

bq. his isn't a long-term solution, since the order in which Lucene visits the readers isn't in general guaranteed,

that's where IndexSearcher.getIndexReaderBase(IndexReader) comes into play. If you call this in your scorer to get the docBase, it doesn't matter what order the segments are searched in (as it'll always return the proper base (in the context of the IndexSearcher that is))


Here's another potential thought (very rough, haven't consulted code to see how feasible this is):
what if Similarity had a method called getDocIdBase(IndexReader)
then, the searcher implementation could wrap the provided Similarity to provide the proper calculation
Similarity is always already passed through this chain of Weight creation and is passed into the Scorer
Obviously, a Query Implementation can completely drop the passing of the Searcher's similarity and drop in its own (but this would mean it doesn't care about getting these docid bases)
I think this approach would potentially resolve all MultiSearcher difficulties







> Weight.scorer() not passed doc offset for "sub reader"
> ------------------------------------------------------
>
>                 Key: LUCENE-1821
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1821
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Search
>    Affects Versions: 2.9
>            Reporter: Tim Smith
>             Fix For: 2.9
>
>         Attachments: LUCENE-1821.patch
>
>
> Now that searching is done on a per segment basis, there is no way for a Scorer to know the "actual" doc id for the document's it matches (only the relative doc offset into the segment)
> If using caches in your scorer that are based on the "entire" index (all segments), there is now no way to index into them properly from inside a Scorer because the scorer is not passed the needed offset to calculate the "real" docid
> suggest having Weight.scorer() method also take a integer for the doc offset
> Abstract Weight class should have a constructor that takes this offset as well as a method to get the offset
> All Weights that have "sub" weights must pass this offset down to created "sub" weights
> Details on workaround:
> In order to work around this, you must do the following:
> * Subclass IndexSearcher
> * Add "int getIndexReaderBase(IndexReader)" method to your subclass
> * during Weight creation, the Weight must hold onto a reference to the passed in Searcher (casted to your sub class)
> * during Scorer creation, the Scorer must be passed the result of YourSearcher.getIndexReaderBase(reader)
> * Scorer can now rebase any collected docids using this offset
> Example implementation of getIndexReaderBase():
> {code}
> // NOTE: more efficient implementation can be done if you cache the result if gatherSubReaders in your constructor
> public int getIndexReaderBase(IndexReader reader) {
>   if (reader == getReader()) {
>     return 0;
>   } else {
>     List readers = new ArrayList();
>     gatherSubReaders(readers);
>     Iterator iter = readers.iterator();
>     int maxDoc = 0;
>     while (iter.hasNext()) {
>       IndexReader r = (IndexReader)iter.next();
>       if (r == reader) {
>         return maxDoc;
>       } 
>       maxDoc += r.maxDoc();
>     } 
>   }
>   return -1; // reader not in searcher
> }
> {code}
> Notes:
> * This workaround makes it so you cannot serialize your custom Weight implementation

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-1821) Weight.scorer() not passed doc offset for "sub reader"

Posted by "Mark Miller (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mark Miller updated LUCENE-1821:
--------------------------------

    Affects Version/s:     (was: 3.1)
                       2.9
        Fix Version/s: 3.1

whoops - try the right thing this time

> Weight.scorer() not passed doc offset for "sub reader"
> ------------------------------------------------------
>
>                 Key: LUCENE-1821
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1821
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Search
>    Affects Versions: 2.9
>            Reporter: Tim Smith
>             Fix For: 3.1
>
>         Attachments: LUCENE-1821.patch
>
>
> Now that searching is done on a per segment basis, there is no way for a Scorer to know the "actual" doc id for the document's it matches (only the relative doc offset into the segment)
> If using caches in your scorer that are based on the "entire" index (all segments), there is now no way to index into them properly from inside a Scorer because the scorer is not passed the needed offset to calculate the "real" docid
> suggest having Weight.scorer() method also take a integer for the doc offset
> Abstract Weight class should have a constructor that takes this offset as well as a method to get the offset
> All Weights that have "sub" weights must pass this offset down to created "sub" weights
> Details on workaround:
> In order to work around this, you must do the following:
> * Subclass IndexSearcher
> * Add "int getIndexReaderBase(IndexReader)" method to your subclass
> * during Weight creation, the Weight must hold onto a reference to the passed in Searcher (casted to your sub class)
> * during Scorer creation, the Scorer must be passed the result of YourSearcher.getIndexReaderBase(reader)
> * Scorer can now rebase any collected docids using this offset
> Example implementation of getIndexReaderBase():
> {code}
> // NOTE: more efficient implementation can be done if you cache the result if gatherSubReaders in your constructor
> public int getIndexReaderBase(IndexReader reader) {
>   if (reader == getReader()) {
>     return 0;
>   } else {
>     List readers = new ArrayList();
>     gatherSubReaders(readers);
>     Iterator iter = readers.iterator();
>     int maxDoc = 0;
>     while (iter.hasNext()) {
>       IndexReader r = (IndexReader)iter.next();
>       if (r == reader) {
>         return maxDoc;
>       } 
>       maxDoc += r.maxDoc();
>     } 
>   }
>   return -1; // reader not in searcher
> }
> {code}
> Notes:
> * This workaround makes it so you cannot serialize your custom Weight implementation

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1821) Weight.scorer() not passed doc offset for "sub reader"

Posted by "Mark Miller (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12745756#action_12745756 ] 

Mark Miller commented on LUCENE-1821:
-------------------------------------

{quote}
howevever, this method is actually also a bit broken now with per segment searching
what if reader is a MultiReader
{quote}

Right - there are many places where this could be the case - your still free to use multi-readers, though we encourage you to switch. We provide a cool cache sanity checker to help you find these cases, and evaluate whether or not you can make the switch.

If you just pass 0, many times it will be wrong. Why shouldn't this have access to a doc id cache as well? We always ask for everything in the context of the Reader given. I think thats the issue. Lucene just never officially supported this use case - we can't with MultiSearcher, Searchable, Remote - the API doesn't work with the idea that you can count on all the doc ids from a Reader. You were taking advantage of the implementation and your limited use of the full API - but its never been part of the API IMHO.

Perhaps we could one day change things - RMI hasn't really worked out in comparison to other methods large scale (supposedly very chatty - though I have been told very large installations have been built with it ) - we have already factored it into contrib. But this still doesn't fit the current model/API, and if we address it, it will take longer than 2.9 to do right IMO.

> Weight.scorer() not passed doc offset for "sub reader"
> ------------------------------------------------------
>
>                 Key: LUCENE-1821
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1821
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Search
>    Affects Versions: 2.9
>            Reporter: Tim Smith
>             Fix For: 2.9
>
>         Attachments: LUCENE-1821.patch
>
>
> Now that searching is done on a per segment basis, there is no way for a Scorer to know the "actual" doc id for the document's it matches (only the relative doc offset into the segment)
> If using caches in your scorer that are based on the "entire" index (all segments), there is now no way to index into them properly from inside a Scorer because the scorer is not passed the needed offset to calculate the "real" docid
> suggest having Weight.scorer() method also take a integer for the doc offset
> Abstract Weight class should have a constructor that takes this offset as well as a method to get the offset
> All Weights that have "sub" weights must pass this offset down to created "sub" weights
> Details on workaround:
> In order to work around this, you must do the following:
> * Subclass IndexSearcher
> * Add "int getIndexReaderBase(IndexReader)" method to your subclass
> * during Weight creation, the Weight must hold onto a reference to the passed in Searcher (casted to your sub class)
> * during Scorer creation, the Scorer must be passed the result of YourSearcher.getIndexReaderBase(reader)
> * Scorer can now rebase any collected docids using this offset
> Example implementation of getIndexReaderBase():
> {code}
> // NOTE: more efficient implementation can be done if you cache the result if gatherSubReaders in your constructor
> public int getIndexReaderBase(IndexReader reader) {
>   if (reader == getReader()) {
>     return 0;
>   } else {
>     List readers = new ArrayList();
>     gatherSubReaders(readers);
>     Iterator iter = readers.iterator();
>     int maxDoc = 0;
>     while (iter.hasNext()) {
>       IndexReader r = (IndexReader)iter.next();
>       if (r == reader) {
>         return maxDoc;
>       } 
>       maxDoc += r.maxDoc();
>     } 
>   }
>   return -1; // reader not in searcher
> }
> {code}
> Notes:
> * This workaround makes it so you cannot serialize your custom Weight implementation

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1821) Weight.scorer() not passed doc offset for "sub reader"

Posted by "Mark Miller (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12745039#action_12745039 ] 

Mark Miller commented on LUCENE-1821:
-------------------------------------

I think thats a good idea. I think that last sentence needs a bit of work. Here is another attempt that I am still not quite happy with:

{code}13. LUCENE-1483: When searching over multiple segments, a new Scorer is created for each segment. 
        The Weight is created only once for the top level searcher. Each Scorer is passed the per-segment IndexReader.
        This will result in docids in the Scorer being internal to the per-segment IndexReader and there is currently no way
         to rebase these docids to the top level IndexReader. This will likely break any caches/filters in Scorers that rely on docids from the top 
         level IndexReader eg if you rely on the IndexReader to contain every doc id in the index.{code}

> Weight.scorer() not passed doc offset for "sub reader"
> ------------------------------------------------------
>
>                 Key: LUCENE-1821
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1821
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Search
>    Affects Versions: 2.9
>            Reporter: Tim Smith
>
> Now that searching is done on a per segment basis, there is no way for a Scorer to know the "actual" doc id for the document's it matches (only the relative doc offset into the segment)
> If using caches in your scorer that are based on the "entire" index (all segments), there is now no way to index into them properly from inside a Scorer because the scorer is not passed the needed offset to calculate the "real" docid
> suggest having Weight.scorer() method also take a integer for the doc offset
> Abstract Weight class should have a constructor that takes this offset as well as a method to get the offset
> All Weights that have "sub" weights must pass this offset down to created "sub" weights

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1821) Weight.scorer() not passed doc offset for "sub reader"

Posted by "Mark Miller (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12745044#action_12745044 ] 

Mark Miller commented on LUCENE-1821:
-------------------------------------

Looks great!

I still almost want to say rely on though:

bq. uses any caches/filters based on the top level IndexReader/Searcher

bq. uses any caches/filters that rely on being based on the top level IndexReader/Searcher

No? It seems like you could be based on a top level reader before, but not rely on the fact that it was a top level ...

> Weight.scorer() not passed doc offset for "sub reader"
> ------------------------------------------------------
>
>                 Key: LUCENE-1821
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1821
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Search
>    Affects Versions: 2.9
>            Reporter: Tim Smith
>         Attachments: LUCENE-1821.patch
>
>
> Now that searching is done on a per segment basis, there is no way for a Scorer to know the "actual" doc id for the document's it matches (only the relative doc offset into the segment)
> If using caches in your scorer that are based on the "entire" index (all segments), there is now no way to index into them properly from inside a Scorer because the scorer is not passed the needed offset to calculate the "real" docid
> suggest having Weight.scorer() method also take a integer for the doc offset
> Abstract Weight class should have a constructor that takes this offset as well as a method to get the offset
> All Weights that have "sub" weights must pass this offset down to created "sub" weights

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1821) Weight.scorer() not passed doc offset for "sub reader"

Posted by "Hoss Man (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12744824#action_12744824 ] 

Hoss Man commented on LUCENE-1821:
----------------------------------

bq. you are caching from external id to ord - its really not something I think we intend to support. The fact that we don't support it is why we were able to make this change. The FieldCache is the caching mechanism that Lucene supports with internal ids - and it supports it per segment.

I think Tim's got a valid point though about wanting an ordinal value across the entire index ... he's not using external ids, he's using the internal lucene docIds, and wants to know the ordinal value of a field for each doc across the entire index -- as he said, he's essentially using a FieldCache.StringIndex he just doesn't care about the String[] part.

Solr had/has the same problem with some of the function queries that wanted ordinal values (or the min/max field value for the whole index) that i think yonik just punted on and fetched the outermost field cache anyway ... we just weren't using it inside the Weight class, so we didn't encounter the specified problem Tim did.


> Weight.scorer() not passed doc offset for "sub reader"
> ------------------------------------------------------
>
>                 Key: LUCENE-1821
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1821
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Search
>    Affects Versions: 2.9
>            Reporter: Tim Smith
>
> Now that searching is done on a per segment basis, there is no way for a Scorer to know the "actual" doc id for the document's it matches (only the relative doc offset into the segment)
> If using caches in your scorer that are based on the "entire" index (all segments), there is now no way to index into them properly from inside a Scorer because the scorer is not passed the needed offset to calculate the "real" docid
> suggest having Weight.scorer() method also take a integer for the doc offset
> Abstract Weight class should have a constructor that takes this offset as well as a method to get the offset
> All Weights that have "sub" weights must pass this offset down to created "sub" weights

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1821) Weight.scorer() not passed doc offset for "sub reader"

Posted by "Mark Miller (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12745948#action_12745948 ] 

Mark Miller commented on LUCENE-1821:
-------------------------------------

bq. I'm OK with having to jump through some hoops in order to get back to the "full index" context

You never officially had the full index context - only because you jettison a large part of the API did you have it.

bq.  this would be best handled by adding a Searcher as the first arg to Weight.scorer()

The current API would not support this without back compat breaks up the wazoo - the MultiSearcher can be on the client - its not available on the server. Passing just the local Searcher does not jive with the API.


{quote}for string sorting, it makes a big difference - you now have to do a bunch of String.equals() calls, where you didn't have to in 2.4 (just used the ord index)
Given this reason, you should really be able to do string sorting 2 ways{quote}

This is only valid for those short circuiting the API and ignoring MultiSearcher and its affects on the API. As a project, we can't and shouldn't support this type of thing unless we can make it work with MultiSearcher or eventually pull MultiSearcher.

bq. In the end, it should be up to the application developer to choose what strategy works best for them, and their application (fast commits/fast cache loading may take a back seat to fast query execution)

You can pick, but we have to be true to the API or change it (not easy with our back compat policies)

> Weight.scorer() not passed doc offset for "sub reader"
> ------------------------------------------------------
>
>                 Key: LUCENE-1821
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1821
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Search
>    Affects Versions: 2.9
>            Reporter: Tim Smith
>             Fix For: 2.9
>
>         Attachments: LUCENE-1821.patch
>
>
> Now that searching is done on a per segment basis, there is no way for a Scorer to know the "actual" doc id for the document's it matches (only the relative doc offset into the segment)
> If using caches in your scorer that are based on the "entire" index (all segments), there is now no way to index into them properly from inside a Scorer because the scorer is not passed the needed offset to calculate the "real" docid
> suggest having Weight.scorer() method also take a integer for the doc offset
> Abstract Weight class should have a constructor that takes this offset as well as a method to get the offset
> All Weights that have "sub" weights must pass this offset down to created "sub" weights
> Details on workaround:
> In order to work around this, you must do the following:
> * Subclass IndexSearcher
> * Add "int getIndexReaderBase(IndexReader)" method to your subclass
> * during Weight creation, the Weight must hold onto a reference to the passed in Searcher (casted to your sub class)
> * during Scorer creation, the Scorer must be passed the result of YourSearcher.getIndexReaderBase(reader)
> * Scorer can now rebase any collected docids using this offset
> Example implementation of getIndexReaderBase():
> {code}
> // NOTE: more efficient implementation can be done if you cache the result if gatherSubReaders in your constructor
> public int getIndexReaderBase(IndexReader reader) {
>   if (reader == getReader()) {
>     return 0;
>   } else {
>     List readers = new ArrayList();
>     gatherSubReaders(readers);
>     Iterator iter = readers.iterator();
>     int maxDoc = 0;
>     while (iter.hasNext()) {
>       IndexReader r = (IndexReader)iter.next();
>       if (r == reader) {
>         return maxDoc;
>       } 
>       maxDoc += r.maxDoc();
>     } 
>   }
>   return -1; // reader not in searcher
> }
> {code}
> Notes:
> * This workaround makes it so you cannot serialize your custom Weight implementation

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1821) Weight.scorer() not passed doc offset for "sub reader"

Posted by "Tim Smith (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12745754#action_12745754 ] 

Tim Smith commented on LUCENE-1821:
-----------------------------------

in the case of the getDocIdSet() method, i would say you should pass "0" for the docBase
This is because in this case, you are asking for DocIdSet in the context of *reader*

howevever, this method is actually also a bit broken now with per segment searching
what if *reader* is a MultiReader
This could now incur the "double ram usage" penalty refered to in that "explain()" ticket i recall seeing last week
If  *query* has any ValueSource based queries, it'll result in getting the ValueSource in the context of the MultiReader (if i'm not mistaken)
So, this method in particular should probably be rewritten to return a DocIdSetIterator that will step through each segment in "reader" in turn

> Weight.scorer() not passed doc offset for "sub reader"
> ------------------------------------------------------
>
>                 Key: LUCENE-1821
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1821
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Search
>    Affects Versions: 2.9
>            Reporter: Tim Smith
>             Fix For: 2.9
>
>         Attachments: LUCENE-1821.patch
>
>
> Now that searching is done on a per segment basis, there is no way for a Scorer to know the "actual" doc id for the document's it matches (only the relative doc offset into the segment)
> If using caches in your scorer that are based on the "entire" index (all segments), there is now no way to index into them properly from inside a Scorer because the scorer is not passed the needed offset to calculate the "real" docid
> suggest having Weight.scorer() method also take a integer for the doc offset
> Abstract Weight class should have a constructor that takes this offset as well as a method to get the offset
> All Weights that have "sub" weights must pass this offset down to created "sub" weights
> Details on workaround:
> In order to work around this, you must do the following:
> * Subclass IndexSearcher
> * Add "int getIndexReaderBase(IndexReader)" method to your subclass
> * during Weight creation, the Weight must hold onto a reference to the passed in Searcher (casted to your sub class)
> * during Scorer creation, the Scorer must be passed the result of YourSearcher.getIndexReaderBase(reader)
> * Scorer can now rebase any collected docids using this offset
> Example implementation of getIndexReaderBase():
> {code}
> // NOTE: more efficient implementation can be done if you cache the result if gatherSubReaders in your constructor
> public int getIndexReaderBase(IndexReader reader) {
>   if (reader == getReader()) {
>     return 0;
>   } else {
>     List readers = new ArrayList();
>     gatherSubReaders(readers);
>     Iterator iter = readers.iterator();
>     int maxDoc = 0;
>     while (iter.hasNext()) {
>       IndexReader r = (IndexReader)iter.next();
>       if (r == reader) {
>         return maxDoc;
>       } 
>       maxDoc += r.maxDoc();
>     } 
>   }
>   return -1; // reader not in searcher
> }
> {code}
> Notes:
> * This workaround makes it so you cannot serialize your custom Weight implementation

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1821) Weight.scorer() not passed doc offset for "sub reader"

Posted by "Tim Smith (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12745036#action_12745036 ] 

Tim Smith commented on LUCENE-1821:
-----------------------------------

Concerning the changelog, i feel the below should be added to the "Changes in runtime behavior" section (it's kinda specified in "New features", however
it is also a rather substantial change in the runtime behavior and should be called out explicitly there)

{code}
 13. LUCENE-1483: When searching over multiple segments, a new Scorer is created for each segment. 
        The Weight is created only once for the top level searcher. Each Scorer is passed the per-segment IndexReader.
        This will result in docids in the Scorer being internal to the per-segment IndexReader and there is currently no way
         to rebase these docids to the top level IndexReader. This results in any caches/filters that use docids over the top 
         IndexReader to be broken.
{code}


> Weight.scorer() not passed doc offset for "sub reader"
> ------------------------------------------------------
>
>                 Key: LUCENE-1821
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1821
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Search
>    Affects Versions: 2.9
>            Reporter: Tim Smith
>
> Now that searching is done on a per segment basis, there is no way for a Scorer to know the "actual" doc id for the document's it matches (only the relative doc offset into the segment)
> If using caches in your scorer that are based on the "entire" index (all segments), there is now no way to index into them properly from inside a Scorer because the scorer is not passed the needed offset to calculate the "real" docid
> suggest having Weight.scorer() method also take a integer for the doc offset
> Abstract Weight class should have a constructor that takes this offset as well as a method to get the offset
> All Weights that have "sub" weights must pass this offset down to created "sub" weights

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1821) Weight.scorer() not passed doc offset for "sub reader"

Posted by "Tim Smith (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12745423#action_12745423 ] 

Tim Smith commented on LUCENE-1821:
-----------------------------------

I can work up another patch where the Searcher is passed into Weight.scorer() as well if that is an acceptable approach (this method was already changed alot in 2.9 anyway)

> Weight.scorer() not passed doc offset for "sub reader"
> ------------------------------------------------------
>
>                 Key: LUCENE-1821
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1821
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Search
>    Affects Versions: 2.9
>            Reporter: Tim Smith
>         Attachments: LUCENE-1821.patch
>
>
> Now that searching is done on a per segment basis, there is no way for a Scorer to know the "actual" doc id for the document's it matches (only the relative doc offset into the segment)
> If using caches in your scorer that are based on the "entire" index (all segments), there is now no way to index into them properly from inside a Scorer because the scorer is not passed the needed offset to calculate the "real" docid
> suggest having Weight.scorer() method also take a integer for the doc offset
> Abstract Weight class should have a constructor that takes this offset as well as a method to get the offset
> All Weights that have "sub" weights must pass this offset down to created "sub" weights
> Details on workaround:
> In order to work around this, you must do the following:
> * Subclass IndexSearcher
> * Add "int getIndexReaderBase(IndexReader)" method to your subclass
> * during Weight creation, the Weight must hold onto a reference to the passed in Searcher (casted to your sub class)
> * during Scorer creation, the Scorer must be passed the result of YourSearcher.getIndexReaderBase(reader)
> * Scorer can now rebase any collected docids using this offset
> Example implementation of getIndexReaderBase():
> {code}
> // NOTE: more efficient implementation can be done if you cache the result if gatherSubReaders in your constructor
> public int getIndexReaderBase(IndexReader reader) {
>   if (reader == getReader()) {
>     return 0;
>   } else {
>     List readers = new ArrayList();
>     gatherSubReaders(readers);
>     Iterator iter = readers.iterator();
>     int maxDoc = 0;
>     while (iter.hasNext()) {
>       IndexReader r = (IndexReader)iter.next();
>       if (r == reader) {
>         return maxDoc;
>       } 
>       maxDoc += r.maxDoc();
>     } 
>   }
>   return -1; // reader not in searcher
> }
> {code}
> Notes:
> * This workaround makes it so you cannot serialize your custom Weight implementation

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1821) Weight.scorer() not passed doc offset for "sub reader"

Posted by "Tim Smith (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12744718#action_12744718 ] 

Tim Smith commented on LUCENE-1821:
-----------------------------------

I have a really crazy cache that for performance and memory reasons has to be based on the Multi-Reader (and this is very important in my application)

looking closer at the per segment searching, i cannot upgrade to 2.9 (which i really want to do) because of this

> Weight.scorer() not passed doc offset for "sub reader"
> ------------------------------------------------------
>
>                 Key: LUCENE-1821
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1821
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Search
>    Affects Versions: 2.9
>            Reporter: Tim Smith
>
> Now that searching is done on a per segment basis, there is no way for a Scorer to know the "actual" doc id for the document's it matches (only the relative doc offset into the segment)
> If using caches in your scorer that are based on the "entire" index (all segments), there is now no way to index into them properly from inside a Scorer because the scorer is not passed the needed offset to calculate the "real" docid
> suggest having Weight.scorer() method also take a integer for the doc offset
> Abstract Weight class should have a constructor that takes this offset as well as a method to get the offset
> All Weights that have "sub" weights must pass this offset down to created "sub" weights

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1821) Weight.scorer() not passed doc offset for "sub reader"

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12746630#action_12746630 ] 

Yonik Seeley commented on LUCENE-1821:
--------------------------------------

bq. Filter.getDocIdSet(IndexSearcher, IndexReader).

This suggests that one needs an IndexSearcher to get the ids matching a filter.

> Weight.scorer() not passed doc offset for "sub reader"
> ------------------------------------------------------
>
>                 Key: LUCENE-1821
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1821
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Search
>    Affects Versions: 2.9
>            Reporter: Tim Smith
>             Fix For: 2.9
>
>         Attachments: LUCENE-1821.patch
>
>
> Now that searching is done on a per segment basis, there is no way for a Scorer to know the "actual" doc id for the document's it matches (only the relative doc offset into the segment)
> If using caches in your scorer that are based on the "entire" index (all segments), there is now no way to index into them properly from inside a Scorer because the scorer is not passed the needed offset to calculate the "real" docid
> suggest having Weight.scorer() method also take a integer for the doc offset
> Abstract Weight class should have a constructor that takes this offset as well as a method to get the offset
> All Weights that have "sub" weights must pass this offset down to created "sub" weights
> Details on workaround:
> In order to work around this, you must do the following:
> * Subclass IndexSearcher
> * Add "int getIndexReaderBase(IndexReader)" method to your subclass
> * during Weight creation, the Weight must hold onto a reference to the passed in Searcher (casted to your sub class)
> * during Scorer creation, the Scorer must be passed the result of YourSearcher.getIndexReaderBase(reader)
> * Scorer can now rebase any collected docids using this offset
> Example implementation of getIndexReaderBase():
> {code}
> // NOTE: more efficient implementation can be done if you cache the result if gatherSubReaders in your constructor
> public int getIndexReaderBase(IndexReader reader) {
>   if (reader == getReader()) {
>     return 0;
>   } else {
>     List readers = new ArrayList();
>     gatherSubReaders(readers);
>     Iterator iter = readers.iterator();
>     int maxDoc = 0;
>     while (iter.hasNext()) {
>       IndexReader r = (IndexReader)iter.next();
>       if (r == reader) {
>         return maxDoc;
>       } 
>       maxDoc += r.maxDoc();
>     } 
>   }
>   return -1; // reader not in searcher
> }
> {code}
> Notes:
> * This workaround makes it so you cannot serialize your custom Weight implementation

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1821) Weight.scorer() not passed doc offset for "sub reader"

Posted by "Mark Miller (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12746639#action_12746639 ] 

Mark Miller commented on LUCENE-1821:
-------------------------------------

Cool - I don't like it much either.

I say we push this issue from 2.9 for now.

> Weight.scorer() not passed doc offset for "sub reader"
> ------------------------------------------------------
>
>                 Key: LUCENE-1821
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1821
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Search
>    Affects Versions: 2.9
>            Reporter: Tim Smith
>             Fix For: 2.9
>
>         Attachments: LUCENE-1821.patch
>
>
> Now that searching is done on a per segment basis, there is no way for a Scorer to know the "actual" doc id for the document's it matches (only the relative doc offset into the segment)
> If using caches in your scorer that are based on the "entire" index (all segments), there is now no way to index into them properly from inside a Scorer because the scorer is not passed the needed offset to calculate the "real" docid
> suggest having Weight.scorer() method also take a integer for the doc offset
> Abstract Weight class should have a constructor that takes this offset as well as a method to get the offset
> All Weights that have "sub" weights must pass this offset down to created "sub" weights
> Details on workaround:
> In order to work around this, you must do the following:
> * Subclass IndexSearcher
> * Add "int getIndexReaderBase(IndexReader)" method to your subclass
> * during Weight creation, the Weight must hold onto a reference to the passed in Searcher (casted to your sub class)
> * during Scorer creation, the Scorer must be passed the result of YourSearcher.getIndexReaderBase(reader)
> * Scorer can now rebase any collected docids using this offset
> Example implementation of getIndexReaderBase():
> {code}
> // NOTE: more efficient implementation can be done if you cache the result if gatherSubReaders in your constructor
> public int getIndexReaderBase(IndexReader reader) {
>   if (reader == getReader()) {
>     return 0;
>   } else {
>     List readers = new ArrayList();
>     gatherSubReaders(readers);
>     Iterator iter = readers.iterator();
>     int maxDoc = 0;
>     while (iter.hasNext()) {
>       IndexReader r = (IndexReader)iter.next();
>       if (r == reader) {
>         return maxDoc;
>       } 
>       maxDoc += r.maxDoc();
>     } 
>   }
>   return -1; // reader not in searcher
> }
> {code}
> Notes:
> * This workaround makes it so you cannot serialize your custom Weight implementation

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Issue Comment Edited: (LUCENE-1821) Weight.scorer() not passed doc offset for "sub reader"

Posted by "Mark Miller (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12746513#action_12746513 ] 

Mark Miller edited comment on LUCENE-1821 at 8/22/09 2:46 PM:
--------------------------------------------------------------

{quote}Looks like Filter should have another method added getDocIdSet(IndexSearcher searcher, IndexReader reader) (deprecating getDocIdSet(IndexReader))
new method would call old method by default (with little harm done in general) {quote}

Its the corner cases. Someone's class calls the deprecated method directly - someone is using that, plus a new class that overrides the none deprecated method - which never gets called, cause the other code is calling the dep method directly. Technically, everything has to be covered (Depending on how consensus goes anyway ... always depending ...). Its a pain in the butt just thinking about it. 

*edit*

In your example deprecation this is actually the opposite - someone calls the new code directly, but other code you are using overrides the deprecated code. The override is now not called.

      was (Author: markrmiller@gmail.com):
    {quote}Looks like Filter should have another method added getDocIdSet(IndexSearcher searcher, IndexReader reader) (deprecating getDocIdSet(IndexReader))
new method would call old method by default (with little harm done in general) {quote}

Its the corner cases. Someone's class calls the deprecated method directly - someone is using that, plus a new class that overrides the none deprecated method - which never gets called, cause the other code is calling the dep method directly. Technically, everything has to be covered (Depending on how consensus goes anyway ... always depending ...). Its a pain in the butt just thinking about it. 
  
> Weight.scorer() not passed doc offset for "sub reader"
> ------------------------------------------------------
>
>                 Key: LUCENE-1821
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1821
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Search
>    Affects Versions: 2.9
>            Reporter: Tim Smith
>             Fix For: 2.9
>
>         Attachments: LUCENE-1821.patch
>
>
> Now that searching is done on a per segment basis, there is no way for a Scorer to know the "actual" doc id for the document's it matches (only the relative doc offset into the segment)
> If using caches in your scorer that are based on the "entire" index (all segments), there is now no way to index into them properly from inside a Scorer because the scorer is not passed the needed offset to calculate the "real" docid
> suggest having Weight.scorer() method also take a integer for the doc offset
> Abstract Weight class should have a constructor that takes this offset as well as a method to get the offset
> All Weights that have "sub" weights must pass this offset down to created "sub" weights
> Details on workaround:
> In order to work around this, you must do the following:
> * Subclass IndexSearcher
> * Add "int getIndexReaderBase(IndexReader)" method to your subclass
> * during Weight creation, the Weight must hold onto a reference to the passed in Searcher (casted to your sub class)
> * during Scorer creation, the Scorer must be passed the result of YourSearcher.getIndexReaderBase(reader)
> * Scorer can now rebase any collected docids using this offset
> Example implementation of getIndexReaderBase():
> {code}
> // NOTE: more efficient implementation can be done if you cache the result if gatherSubReaders in your constructor
> public int getIndexReaderBase(IndexReader reader) {
>   if (reader == getReader()) {
>     return 0;
>   } else {
>     List readers = new ArrayList();
>     gatherSubReaders(readers);
>     Iterator iter = readers.iterator();
>     int maxDoc = 0;
>     while (iter.hasNext()) {
>       IndexReader r = (IndexReader)iter.next();
>       if (r == reader) {
>         return maxDoc;
>       } 
>       maxDoc += r.maxDoc();
>     } 
>   }
>   return -1; // reader not in searcher
> }
> {code}
> Notes:
> * This workaround makes it so you cannot serialize your custom Weight implementation

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1821) Weight.scorer() not passed doc offset for "sub reader"

Posted by "Mark Miller (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12745989#action_12745989 ] 

Mark Miller commented on LUCENE-1821:
-------------------------------------

I don't think thats fully back compat - though it covers a lot of ground.

> Weight.scorer() not passed doc offset for "sub reader"
> ------------------------------------------------------
>
>                 Key: LUCENE-1821
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1821
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Search
>    Affects Versions: 2.9
>            Reporter: Tim Smith
>             Fix For: 2.9
>
>         Attachments: LUCENE-1821.patch
>
>
> Now that searching is done on a per segment basis, there is no way for a Scorer to know the "actual" doc id for the document's it matches (only the relative doc offset into the segment)
> If using caches in your scorer that are based on the "entire" index (all segments), there is now no way to index into them properly from inside a Scorer because the scorer is not passed the needed offset to calculate the "real" docid
> suggest having Weight.scorer() method also take a integer for the doc offset
> Abstract Weight class should have a constructor that takes this offset as well as a method to get the offset
> All Weights that have "sub" weights must pass this offset down to created "sub" weights
> Details on workaround:
> In order to work around this, you must do the following:
> * Subclass IndexSearcher
> * Add "int getIndexReaderBase(IndexReader)" method to your subclass
> * during Weight creation, the Weight must hold onto a reference to the passed in Searcher (casted to your sub class)
> * during Scorer creation, the Scorer must be passed the result of YourSearcher.getIndexReaderBase(reader)
> * Scorer can now rebase any collected docids using this offset
> Example implementation of getIndexReaderBase():
> {code}
> // NOTE: more efficient implementation can be done if you cache the result if gatherSubReaders in your constructor
> public int getIndexReaderBase(IndexReader reader) {
>   if (reader == getReader()) {
>     return 0;
>   } else {
>     List readers = new ArrayList();
>     gatherSubReaders(readers);
>     Iterator iter = readers.iterator();
>     int maxDoc = 0;
>     while (iter.hasNext()) {
>       IndexReader r = (IndexReader)iter.next();
>       if (r == reader) {
>         return maxDoc;
>       } 
>       maxDoc += r.maxDoc();
>     } 
>   }
>   return -1; // reader not in searcher
> }
> {code}
> Notes:
> * This workaround makes it so you cannot serialize your custom Weight implementation

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1821) Weight.scorer() not passed doc offset for "sub reader"

Posted by "Tim Smith (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12745991#action_12745991 ] 

Tim Smith commented on LUCENE-1821:
-----------------------------------

what class is this getDocIdSet method on (lacking the context of where its used)

> Weight.scorer() not passed doc offset for "sub reader"
> ------------------------------------------------------
>
>                 Key: LUCENE-1821
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1821
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Search
>    Affects Versions: 2.9
>            Reporter: Tim Smith
>             Fix For: 2.9
>
>         Attachments: LUCENE-1821.patch
>
>
> Now that searching is done on a per segment basis, there is no way for a Scorer to know the "actual" doc id for the document's it matches (only the relative doc offset into the segment)
> If using caches in your scorer that are based on the "entire" index (all segments), there is now no way to index into them properly from inside a Scorer because the scorer is not passed the needed offset to calculate the "real" docid
> suggest having Weight.scorer() method also take a integer for the doc offset
> Abstract Weight class should have a constructor that takes this offset as well as a method to get the offset
> All Weights that have "sub" weights must pass this offset down to created "sub" weights
> Details on workaround:
> In order to work around this, you must do the following:
> * Subclass IndexSearcher
> * Add "int getIndexReaderBase(IndexReader)" method to your subclass
> * during Weight creation, the Weight must hold onto a reference to the passed in Searcher (casted to your sub class)
> * during Scorer creation, the Scorer must be passed the result of YourSearcher.getIndexReaderBase(reader)
> * Scorer can now rebase any collected docids using this offset
> Example implementation of getIndexReaderBase():
> {code}
> // NOTE: more efficient implementation can be done if you cache the result if gatherSubReaders in your constructor
> public int getIndexReaderBase(IndexReader reader) {
>   if (reader == getReader()) {
>     return 0;
>   } else {
>     List readers = new ArrayList();
>     gatherSubReaders(readers);
>     Iterator iter = readers.iterator();
>     int maxDoc = 0;
>     while (iter.hasNext()) {
>       IndexReader r = (IndexReader)iter.next();
>       if (r == reader) {
>         return maxDoc;
>       } 
>       maxDoc += r.maxDoc();
>     } 
>   }
>   return -1; // reader not in searcher
> }
> {code}
> Notes:
> * This workaround makes it so you cannot serialize your custom Weight implementation

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1821) Weight.scorer() not passed doc offset for "sub reader"

Posted by "Mark Miller (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12745951#action_12745951 ] 

Mark Miller commented on LUCENE-1821:
-------------------------------------

The Searcher being passed to explain is also really a break - I almost think we should pull it.

We put it in because a break was already introduced when someone tried to add stats for the whole context in TermWeight - that wasn't legal though.

To prevent further spread, I actually think we need to pull that searcher and that extra explain info in TermWeight.

> Weight.scorer() not passed doc offset for "sub reader"
> ------------------------------------------------------
>
>                 Key: LUCENE-1821
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1821
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Search
>    Affects Versions: 2.9
>            Reporter: Tim Smith
>             Fix For: 2.9
>
>         Attachments: LUCENE-1821.patch
>
>
> Now that searching is done on a per segment basis, there is no way for a Scorer to know the "actual" doc id for the document's it matches (only the relative doc offset into the segment)
> If using caches in your scorer that are based on the "entire" index (all segments), there is now no way to index into them properly from inside a Scorer because the scorer is not passed the needed offset to calculate the "real" docid
> suggest having Weight.scorer() method also take a integer for the doc offset
> Abstract Weight class should have a constructor that takes this offset as well as a method to get the offset
> All Weights that have "sub" weights must pass this offset down to created "sub" weights
> Details on workaround:
> In order to work around this, you must do the following:
> * Subclass IndexSearcher
> * Add "int getIndexReaderBase(IndexReader)" method to your subclass
> * during Weight creation, the Weight must hold onto a reference to the passed in Searcher (casted to your sub class)
> * during Scorer creation, the Scorer must be passed the result of YourSearcher.getIndexReaderBase(reader)
> * Scorer can now rebase any collected docids using this offset
> Example implementation of getIndexReaderBase():
> {code}
> // NOTE: more efficient implementation can be done if you cache the result if gatherSubReaders in your constructor
> public int getIndexReaderBase(IndexReader reader) {
>   if (reader == getReader()) {
>     return 0;
>   } else {
>     List readers = new ArrayList();
>     gatherSubReaders(readers);
>     Iterator iter = readers.iterator();
>     int maxDoc = 0;
>     while (iter.hasNext()) {
>       IndexReader r = (IndexReader)iter.next();
>       if (r == reader) {
>         return maxDoc;
>       } 
>       maxDoc += r.maxDoc();
>     } 
>   }
>   return -1; // reader not in searcher
> }
> {code}
> Notes:
> * This workaround makes it so you cannot serialize your custom Weight implementation

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1821) Weight.scorer() not passed doc offset for "sub reader"

Posted by "Tim Smith (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12746662#action_12746662 ] 

Tim Smith commented on LUCENE-1821:
-----------------------------------

can i at least argue for it being tagged for 3.0 or 3.1 (just so it gets looked at again prior to the next releases)

I have workarounds for 2.9, so i'm ok with it not getting in then (just want to make sure my use cases won't be made impossible in future releases)

> Weight.scorer() not passed doc offset for "sub reader"
> ------------------------------------------------------
>
>                 Key: LUCENE-1821
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1821
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Search
>    Affects Versions: 2.9
>            Reporter: Tim Smith
>         Attachments: LUCENE-1821.patch
>
>
> Now that searching is done on a per segment basis, there is no way for a Scorer to know the "actual" doc id for the document's it matches (only the relative doc offset into the segment)
> If using caches in your scorer that are based on the "entire" index (all segments), there is now no way to index into them properly from inside a Scorer because the scorer is not passed the needed offset to calculate the "real" docid
> suggest having Weight.scorer() method also take a integer for the doc offset
> Abstract Weight class should have a constructor that takes this offset as well as a method to get the offset
> All Weights that have "sub" weights must pass this offset down to created "sub" weights
> Details on workaround:
> In order to work around this, you must do the following:
> * Subclass IndexSearcher
> * Add "int getIndexReaderBase(IndexReader)" method to your subclass
> * during Weight creation, the Weight must hold onto a reference to the passed in Searcher (casted to your sub class)
> * during Scorer creation, the Scorer must be passed the result of YourSearcher.getIndexReaderBase(reader)
> * Scorer can now rebase any collected docids using this offset
> Example implementation of getIndexReaderBase():
> {code}
> // NOTE: more efficient implementation can be done if you cache the result if gatherSubReaders in your constructor
> public int getIndexReaderBase(IndexReader reader) {
>   if (reader == getReader()) {
>     return 0;
>   } else {
>     List readers = new ArrayList();
>     gatherSubReaders(readers);
>     Iterator iter = readers.iterator();
>     int maxDoc = 0;
>     while (iter.hasNext()) {
>       IndexReader r = (IndexReader)iter.next();
>       if (r == reader) {
>         return maxDoc;
>       } 
>       maxDoc += r.maxDoc();
>     } 
>   }
>   return -1; // reader not in searcher
> }
> {code}
> Notes:
> * This workaround makes it so you cannot serialize your custom Weight implementation

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1821) Weight.scorer() not passed doc offset for "sub reader"

Posted by "Mark Miller (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12746513#action_12746513 ] 

Mark Miller commented on LUCENE-1821:
-------------------------------------

{quote}Looks like Filter should have another method added getDocIdSet(IndexSearcher searcher, IndexReader reader) (deprecating getDocIdSet(IndexReader))
new method would call old method by default (with little harm done in general) {quote}

Its the corner cases. Someone's class calls the deprecated method directly - someone is using that, plus a new class that overrides the none deprecated method - which never gets called, cause the other code is calling the dep method directly. Technically, everything has to be covered (Depending on how consensus goes anyway ... always depending ...). Its a pain in the butt just thinking about it. 

> Weight.scorer() not passed doc offset for "sub reader"
> ------------------------------------------------------
>
>                 Key: LUCENE-1821
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1821
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Search
>    Affects Versions: 2.9
>            Reporter: Tim Smith
>             Fix For: 2.9
>
>         Attachments: LUCENE-1821.patch
>
>
> Now that searching is done on a per segment basis, there is no way for a Scorer to know the "actual" doc id for the document's it matches (only the relative doc offset into the segment)
> If using caches in your scorer that are based on the "entire" index (all segments), there is now no way to index into them properly from inside a Scorer because the scorer is not passed the needed offset to calculate the "real" docid
> suggest having Weight.scorer() method also take a integer for the doc offset
> Abstract Weight class should have a constructor that takes this offset as well as a method to get the offset
> All Weights that have "sub" weights must pass this offset down to created "sub" weights
> Details on workaround:
> In order to work around this, you must do the following:
> * Subclass IndexSearcher
> * Add "int getIndexReaderBase(IndexReader)" method to your subclass
> * during Weight creation, the Weight must hold onto a reference to the passed in Searcher (casted to your sub class)
> * during Scorer creation, the Scorer must be passed the result of YourSearcher.getIndexReaderBase(reader)
> * Scorer can now rebase any collected docids using this offset
> Example implementation of getIndexReaderBase():
> {code}
> // NOTE: more efficient implementation can be done if you cache the result if gatherSubReaders in your constructor
> public int getIndexReaderBase(IndexReader reader) {
>   if (reader == getReader()) {
>     return 0;
>   } else {
>     List readers = new ArrayList();
>     gatherSubReaders(readers);
>     Iterator iter = readers.iterator();
>     int maxDoc = 0;
>     while (iter.hasNext()) {
>       IndexReader r = (IndexReader)iter.next();
>       if (r == reader) {
>         return maxDoc;
>       } 
>       maxDoc += r.maxDoc();
>     } 
>   }
>   return -1; // reader not in searcher
> }
> {code}
> Notes:
> * This workaround makes it so you cannot serialize your custom Weight implementation

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1821) Weight.scorer() not passed doc offset for "sub reader"

Posted by "Tim Smith (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12745988#action_12745988 ] 

Tim Smith commented on LUCENE-1821:
-----------------------------------

here's what you can do:

{code}
  /** @deprecated use {@link getDocIdSet(IndexSearcher, IndexReader)} */
  public DocIdSet getDocIdSet(final IndexReader reader) throws IOException {
    return getDocIdSet(new IndexSearcher(reader), reader);
  }

  public DocIdSet getDocIdSet(final IndexSearcher searcher, final IndexReader reader) {
    final Weight weight = query.weight(searcher);
    return new DocIdSet() {
      public DocIdSetIterator iterator() throws IOException {
        return weight.scorer(searcher, reader, true, false);
      }
    };
  }
{code}

and yeah, i'm all for tons warnings in javadoc explicitly defining the contracts


> Weight.scorer() not passed doc offset for "sub reader"
> ------------------------------------------------------
>
>                 Key: LUCENE-1821
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1821
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Search
>    Affects Versions: 2.9
>            Reporter: Tim Smith
>             Fix For: 2.9
>
>         Attachments: LUCENE-1821.patch
>
>
> Now that searching is done on a per segment basis, there is no way for a Scorer to know the "actual" doc id for the document's it matches (only the relative doc offset into the segment)
> If using caches in your scorer that are based on the "entire" index (all segments), there is now no way to index into them properly from inside a Scorer because the scorer is not passed the needed offset to calculate the "real" docid
> suggest having Weight.scorer() method also take a integer for the doc offset
> Abstract Weight class should have a constructor that takes this offset as well as a method to get the offset
> All Weights that have "sub" weights must pass this offset down to created "sub" weights
> Details on workaround:
> In order to work around this, you must do the following:
> * Subclass IndexSearcher
> * Add "int getIndexReaderBase(IndexReader)" method to your subclass
> * during Weight creation, the Weight must hold onto a reference to the passed in Searcher (casted to your sub class)
> * during Scorer creation, the Scorer must be passed the result of YourSearcher.getIndexReaderBase(reader)
> * Scorer can now rebase any collected docids using this offset
> Example implementation of getIndexReaderBase():
> {code}
> // NOTE: more efficient implementation can be done if you cache the result if gatherSubReaders in your constructor
> public int getIndexReaderBase(IndexReader reader) {
>   if (reader == getReader()) {
>     return 0;
>   } else {
>     List readers = new ArrayList();
>     gatherSubReaders(readers);
>     Iterator iter = readers.iterator();
>     int maxDoc = 0;
>     while (iter.hasNext()) {
>       IndexReader r = (IndexReader)iter.next();
>       if (r == reader) {
>         return maxDoc;
>       } 
>       maxDoc += r.maxDoc();
>     } 
>   }
>   return -1; // reader not in searcher
> }
> {code}
> Notes:
> * This workaround makes it so you cannot serialize your custom Weight implementation

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1821) Weight.scorer() not passed doc offset for "sub reader"

Posted by "Tim Smith (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12745041#action_12745041 ] 

Tim Smith commented on LUCENE-1821:
-----------------------------------

One more pass

{code}
13. LUCENE-1483: When searching over multiple segments, a new Scorer is created for each segment. 
        The Weight is created only once for the top level searcher. Each Scorer is passed the per-segment IndexReader.
        This will result in docids in the Scorer being internal to the per-segment IndexReader. If a custom Scorer implementation 
         uses any caches/filters based on the top level IndexReader/Searcher, it will need to be updated to use caches/filters on a 
         per segment basis. There is currently no way provided to rebase the docids in the Scorer to the top level IndexReader.
         See LUCENE-1821 for discussion on workarounds for this.
{code}

> Weight.scorer() not passed doc offset for "sub reader"
> ------------------------------------------------------
>
>                 Key: LUCENE-1821
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1821
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Search
>    Affects Versions: 2.9
>            Reporter: Tim Smith
>
> Now that searching is done on a per segment basis, there is no way for a Scorer to know the "actual" doc id for the document's it matches (only the relative doc offset into the segment)
> If using caches in your scorer that are based on the "entire" index (all segments), there is now no way to index into them properly from inside a Scorer because the scorer is not passed the needed offset to calculate the "real" docid
> suggest having Weight.scorer() method also take a integer for the doc offset
> Abstract Weight class should have a constructor that takes this offset as well as a method to get the offset
> All Weights that have "sub" weights must pass this offset down to created "sub" weights

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1821) Weight.scorer() not passed doc offset for "sub reader"

Posted by "Tim Smith (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12744711#action_12744711 ] 

Tim Smith commented on LUCENE-1821:
-----------------------------------

since Weight.explain() is passed the "sub reader", it also should be passed the offset in order to calculate the "real" docid

> Weight.scorer() not passed doc offset for "sub reader"
> ------------------------------------------------------
>
>                 Key: LUCENE-1821
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1821
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Search
>    Affects Versions: 2.9
>            Reporter: Tim Smith
>
> Now that searching is done on a per segment basis, there is no way for a Scorer to know the "actual" doc id for the document's it matches (only the relative doc offset into the segment)
> If using caches in your scorer that are based on the "entire" index (all segments), there is now no way to index into them properly from inside a Scorer because the scorer is not passed the needed offset to calculate the "real" docid
> suggest having Weight.scorer() method also take a integer for the doc offset
> Abstract Weight class should have a constructor that takes this offset as well as a method to get the offset
> All Weights that have "sub" weights must pass this offset down to created "sub" weights

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1821) Weight.scorer() not passed doc offset for "sub reader"

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12746633#action_12746633 ] 

Michael McCandless commented on LUCENE-1821:
--------------------------------------------

bq. one used an int[] ord index (the underlaying cache cannot be made per segment)

Could you compute the top-level ords, but then break it up
per-segment?  Ie, create your own map of IndexReader -> offset into
that large ord array?  This would make it "virtually" per-segment, but
allow you to continue computing at the top level.

BTW another option is to simply accumulate your own docBase, by adding
up the maxDoc() every time an IndexReader is passed to your
Weight.scorer().  EG this is what contrib/spatial is now doing.

This isn't a long-term solution, since the order in which Lucene
visits the readers isn't in general guaranteed, but it will work for
2.9 and buy time to figure out how to switch scoring to per-segment.


> Weight.scorer() not passed doc offset for "sub reader"
> ------------------------------------------------------
>
>                 Key: LUCENE-1821
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1821
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Search
>    Affects Versions: 2.9
>            Reporter: Tim Smith
>             Fix For: 2.9
>
>         Attachments: LUCENE-1821.patch
>
>
> Now that searching is done on a per segment basis, there is no way for a Scorer to know the "actual" doc id for the document's it matches (only the relative doc offset into the segment)
> If using caches in your scorer that are based on the "entire" index (all segments), there is now no way to index into them properly from inside a Scorer because the scorer is not passed the needed offset to calculate the "real" docid
> suggest having Weight.scorer() method also take a integer for the doc offset
> Abstract Weight class should have a constructor that takes this offset as well as a method to get the offset
> All Weights that have "sub" weights must pass this offset down to created "sub" weights
> Details on workaround:
> In order to work around this, you must do the following:
> * Subclass IndexSearcher
> * Add "int getIndexReaderBase(IndexReader)" method to your subclass
> * during Weight creation, the Weight must hold onto a reference to the passed in Searcher (casted to your sub class)
> * during Scorer creation, the Scorer must be passed the result of YourSearcher.getIndexReaderBase(reader)
> * Scorer can now rebase any collected docids using this offset
> Example implementation of getIndexReaderBase():
> {code}
> // NOTE: more efficient implementation can be done if you cache the result if gatherSubReaders in your constructor
> public int getIndexReaderBase(IndexReader reader) {
>   if (reader == getReader()) {
>     return 0;
>   } else {
>     List readers = new ArrayList();
>     gatherSubReaders(readers);
>     Iterator iter = readers.iterator();
>     int maxDoc = 0;
>     while (iter.hasNext()) {
>       IndexReader r = (IndexReader)iter.next();
>       if (r == reader) {
>         return maxDoc;
>       } 
>       maxDoc += r.maxDoc();
>     } 
>   }
>   return -1; // reader not in searcher
> }
> {code}
> Notes:
> * This workaround makes it so you cannot serialize your custom Weight implementation

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1821) Weight.scorer() not passed doc offset for "sub reader"

Posted by "Mark Miller (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12745430#action_12745430 ] 

Mark Miller commented on LUCENE-1821:
-------------------------------------

I'm still not sold on this - these use cases don't work with MultiSearcher either right? I think thats part of this not being officially supported before either ...

And passing the Searcher doesn't quite work right either - thats already kind of a bug with explain - you get the searcher with the doc - not a searcher that covers the whole multisearcher space.

> Weight.scorer() not passed doc offset for "sub reader"
> ------------------------------------------------------
>
>                 Key: LUCENE-1821
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1821
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Search
>    Affects Versions: 2.9
>            Reporter: Tim Smith
>         Attachments: LUCENE-1821.patch
>
>
> Now that searching is done on a per segment basis, there is no way for a Scorer to know the "actual" doc id for the document's it matches (only the relative doc offset into the segment)
> If using caches in your scorer that are based on the "entire" index (all segments), there is now no way to index into them properly from inside a Scorer because the scorer is not passed the needed offset to calculate the "real" docid
> suggest having Weight.scorer() method also take a integer for the doc offset
> Abstract Weight class should have a constructor that takes this offset as well as a method to get the offset
> All Weights that have "sub" weights must pass this offset down to created "sub" weights
> Details on workaround:
> In order to work around this, you must do the following:
> * Subclass IndexSearcher
> * Add "int getIndexReaderBase(IndexReader)" method to your subclass
> * during Weight creation, the Weight must hold onto a reference to the passed in Searcher (casted to your sub class)
> * during Scorer creation, the Scorer must be passed the result of YourSearcher.getIndexReaderBase(reader)
> * Scorer can now rebase any collected docids using this offset
> Example implementation of getIndexReaderBase():
> {code}
> // NOTE: more efficient implementation can be done if you cache the result if gatherSubReaders in your constructor
> public int getIndexReaderBase(IndexReader reader) {
>   if (reader == getReader()) {
>     return 0;
>   } else {
>     List readers = new ArrayList();
>     gatherSubReaders(readers);
>     Iterator iter = readers.iterator();
>     int maxDoc = 0;
>     while (iter.hasNext()) {
>       IndexReader r = (IndexReader)iter.next();
>       if (r == reader) {
>         return maxDoc;
>       } 
>       maxDoc += r.maxDoc();
>     } 
>   }
>   return -1; // reader not in searcher
> }
> {code}
> Notes:
> * This workaround makes it so you cannot serialize your custom Weight implementation

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1821) Weight.scorer() not passed doc offset for "sub reader"

Posted by "Mark Miller (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12744857#action_12744857 ] 

Mark Miller commented on LUCENE-1821:
-------------------------------------

You may get furthers with others than me Tim, so don't get too, too caught up with me. McCandless is still on vacation for one, and he may have ideas other than mine (and certainly better ideas even if they are not other) - others may still jump in too. The two of us did the majority of the per segment work though. Yonik has also fought through a lot of this type of stuff with Solr - I'm sure he has some stance on this stuff.

bq. Whereever an IndexReader is exposed in API calls, it should be possible to walk the IndexReader's parent IndexReaders until you get the top level reader in order to have the full context of that IndexReader

Solr now works this way to get around some of the per segment issues. I'm not sure if it makes sense to support that fully in Lucene or not. Perhaps so. Too late for me to properly think though - time for bed.

Any downsides I wonder ...

I think my main issue is that it will encourage people to work per top level reader and it will force us to take that into account ...



> Weight.scorer() not passed doc offset for "sub reader"
> ------------------------------------------------------
>
>                 Key: LUCENE-1821
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1821
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Search
>    Affects Versions: 2.9
>            Reporter: Tim Smith
>
> Now that searching is done on a per segment basis, there is no way for a Scorer to know the "actual" doc id for the document's it matches (only the relative doc offset into the segment)
> If using caches in your scorer that are based on the "entire" index (all segments), there is now no way to index into them properly from inside a Scorer because the scorer is not passed the needed offset to calculate the "real" docid
> suggest having Weight.scorer() method also take a integer for the doc offset
> Abstract Weight class should have a constructor that takes this offset as well as a method to get the offset
> All Weights that have "sub" weights must pass this offset down to created "sub" weights

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1821) Weight.scorer() not passed doc offset for "sub reader"

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12746636#action_12746636 ] 

Michael McCandless commented on LUCENE-1821:
--------------------------------------------

Net/net, I'm still nervous about pushing down "full context" plus
"context free" searcher/reader deep into Lucene's general searching
(scorer/filter) APIs.  I think these APIs should remain fully
context-free (even IndexSearcher still makes me nervous).

In some sense, Multi/RemoteSearcher keep us honest, in that they force
us to clearly separate out "stuff that has the luxury of full context"
(to be done on construction of Weight) from "the heavy lifting that
must be context free since it may not have access to the top searcher"
(scorer(), getDocIdSet()).


> Weight.scorer() not passed doc offset for "sub reader"
> ------------------------------------------------------
>
>                 Key: LUCENE-1821
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1821
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Search
>    Affects Versions: 2.9
>            Reporter: Tim Smith
>             Fix For: 2.9
>
>         Attachments: LUCENE-1821.patch
>
>
> Now that searching is done on a per segment basis, there is no way for a Scorer to know the "actual" doc id for the document's it matches (only the relative doc offset into the segment)
> If using caches in your scorer that are based on the "entire" index (all segments), there is now no way to index into them properly from inside a Scorer because the scorer is not passed the needed offset to calculate the "real" docid
> suggest having Weight.scorer() method also take a integer for the doc offset
> Abstract Weight class should have a constructor that takes this offset as well as a method to get the offset
> All Weights that have "sub" weights must pass this offset down to created "sub" weights
> Details on workaround:
> In order to work around this, you must do the following:
> * Subclass IndexSearcher
> * Add "int getIndexReaderBase(IndexReader)" method to your subclass
> * during Weight creation, the Weight must hold onto a reference to the passed in Searcher (casted to your sub class)
> * during Scorer creation, the Scorer must be passed the result of YourSearcher.getIndexReaderBase(reader)
> * Scorer can now rebase any collected docids using this offset
> Example implementation of getIndexReaderBase():
> {code}
> // NOTE: more efficient implementation can be done if you cache the result if gatherSubReaders in your constructor
> public int getIndexReaderBase(IndexReader reader) {
>   if (reader == getReader()) {
>     return 0;
>   } else {
>     List readers = new ArrayList();
>     gatherSubReaders(readers);
>     Iterator iter = readers.iterator();
>     int maxDoc = 0;
>     while (iter.hasNext()) {
>       IndexReader r = (IndexReader)iter.next();
>       if (r == reader) {
>         return maxDoc;
>       } 
>       maxDoc += r.maxDoc();
>     } 
>   }
>   return -1; // reader not in searcher
> }
> {code}
> Notes:
> * This workaround makes it so you cannot serialize your custom Weight implementation

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1821) Weight.scorer() not passed doc offset for "sub reader"

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12745911#action_12745911 ] 

Michael McCandless commented on LUCENE-1821:
--------------------------------------------

OK... pondering this some more, and on seeing just how much change
would be required, I'm now nervous about making deep changes to
Lucene's scoring/filtering APIS (Weight.scorer, Filter.getDocIdSet) to
enable access to top readers and/or a sub-readers doc base.

All of Lucene's core & contrib now operates "context free"
(per-segment), where each reader need not know its "context" in the
full searcher tree, and I think we should strongly encourage external
usage of these APIs to switch to context free as well.  Since there
are workarounds possible (accessing sub-readers via IndexSearcher),
external apps that have problems making the switch can use these
workarounds?


> Weight.scorer() not passed doc offset for "sub reader"
> ------------------------------------------------------
>
>                 Key: LUCENE-1821
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1821
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Search
>    Affects Versions: 2.9
>            Reporter: Tim Smith
>             Fix For: 2.9
>
>         Attachments: LUCENE-1821.patch
>
>
> Now that searching is done on a per segment basis, there is no way for a Scorer to know the "actual" doc id for the document's it matches (only the relative doc offset into the segment)
> If using caches in your scorer that are based on the "entire" index (all segments), there is now no way to index into them properly from inside a Scorer because the scorer is not passed the needed offset to calculate the "real" docid
> suggest having Weight.scorer() method also take a integer for the doc offset
> Abstract Weight class should have a constructor that takes this offset as well as a method to get the offset
> All Weights that have "sub" weights must pass this offset down to created "sub" weights
> Details on workaround:
> In order to work around this, you must do the following:
> * Subclass IndexSearcher
> * Add "int getIndexReaderBase(IndexReader)" method to your subclass
> * during Weight creation, the Weight must hold onto a reference to the passed in Searcher (casted to your sub class)
> * during Scorer creation, the Scorer must be passed the result of YourSearcher.getIndexReaderBase(reader)
> * Scorer can now rebase any collected docids using this offset
> Example implementation of getIndexReaderBase():
> {code}
> // NOTE: more efficient implementation can be done if you cache the result if gatherSubReaders in your constructor
> public int getIndexReaderBase(IndexReader reader) {
>   if (reader == getReader()) {
>     return 0;
>   } else {
>     List readers = new ArrayList();
>     gatherSubReaders(readers);
>     Iterator iter = readers.iterator();
>     int maxDoc = 0;
>     while (iter.hasNext()) {
>       IndexReader r = (IndexReader)iter.next();
>       if (r == reader) {
>         return maxDoc;
>       } 
>       maxDoc += r.maxDoc();
>     } 
>   }
>   return -1; // reader not in searcher
> }
> {code}
> Notes:
> * This workaround makes it so you cannot serialize your custom Weight implementation

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1821) Weight.scorer() not passed doc offset for "sub reader"

Posted by "Tim Smith (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12744759#action_12744759 ] 

Tim Smith commented on LUCENE-1821:
-----------------------------------

I'll prepare a patch in the morning (unless someone beats me to it) and look over the changelog then to suggest some more disclaimers (if what's there isn't sufficient)

I don't hold the contract that Weight be serializable (so i'm safe there)

i agree that per-segment is the way to go in general and should be as tight as possible (as long as i can get my mits on the "sub readers")

but there are use cases that still require looking at the index as a whole as well
especially if you need to know the number of unique terms for a field, or otherwise need documents in one segment to be aware of documents in other segments (i could probably come up a bunch more use cases there)


> Weight.scorer() not passed doc offset for "sub reader"
> ------------------------------------------------------
>
>                 Key: LUCENE-1821
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1821
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Search
>    Affects Versions: 2.9
>            Reporter: Tim Smith
>
> Now that searching is done on a per segment basis, there is no way for a Scorer to know the "actual" doc id for the document's it matches (only the relative doc offset into the segment)
> If using caches in your scorer that are based on the "entire" index (all segments), there is now no way to index into them properly from inside a Scorer because the scorer is not passed the needed offset to calculate the "real" docid
> suggest having Weight.scorer() method also take a integer for the doc offset
> Abstract Weight class should have a constructor that takes this offset as well as a method to get the offset
> All Weights that have "sub" weights must pass this offset down to created "sub" weights

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1821) Weight.scorer() not passed doc offset for "sub reader"

Posted by "Mark Miller (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12744812#action_12744812 ] 

Mark Miller commented on LUCENE-1821:
-------------------------------------

Sorting is internal. To allow this switch to per segment we implemented a new HitCollector that can collect from multiple readers - sorting across multiple segments still needed to be supported, and custom comparators still needed to be supported. All of the ids are manged internally though - when I say internally, I mean within Lucene. If you implement a custom FieldComparator, you are still respecting Lucene's internal id usage. We map priority queue values so that they can be compared with the values in a different Reader, but again, all of the ids are managed internally. All caching and everything is still done per segment. All FieldCaches are per Reader and per segment.

The goal is to move all caches to the segment level in Lucene - we don't want to encourage users to cache per multi-reader by providing API help to do so.

If you need index wide stats, you use the Weight.

You are trying to use the internal ids externally - you are caching from external id to ord - its really not something I think we intend to support. The fact that we don't support it is why we were able to make this change. The FieldCache is the caching mechanism that Lucene supports with internal ids - and it supports it per segment.

> Weight.scorer() not passed doc offset for "sub reader"
> ------------------------------------------------------
>
>                 Key: LUCENE-1821
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1821
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Search
>    Affects Versions: 2.9
>            Reporter: Tim Smith
>
> Now that searching is done on a per segment basis, there is no way for a Scorer to know the "actual" doc id for the document's it matches (only the relative doc offset into the segment)
> If using caches in your scorer that are based on the "entire" index (all segments), there is now no way to index into them properly from inside a Scorer because the scorer is not passed the needed offset to calculate the "real" docid
> suggest having Weight.scorer() method also take a integer for the doc offset
> Abstract Weight class should have a constructor that takes this offset as well as a method to get the offset
> All Weights that have "sub" weights must pass this offset down to created "sub" weights

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1821) Weight.scorer() not passed doc offset for "sub reader"

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12746617#action_12746617 ] 

Michael McCandless commented on LUCENE-1821:
--------------------------------------------

bq. You want to weigh in again Mike ? 

I do!  I'm trying desperately to catch up over here :)

> Weight.scorer() not passed doc offset for "sub reader"
> ------------------------------------------------------
>
>                 Key: LUCENE-1821
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1821
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Search
>    Affects Versions: 2.9
>            Reporter: Tim Smith
>             Fix For: 2.9
>
>         Attachments: LUCENE-1821.patch
>
>
> Now that searching is done on a per segment basis, there is no way for a Scorer to know the "actual" doc id for the document's it matches (only the relative doc offset into the segment)
> If using caches in your scorer that are based on the "entire" index (all segments), there is now no way to index into them properly from inside a Scorer because the scorer is not passed the needed offset to calculate the "real" docid
> suggest having Weight.scorer() method also take a integer for the doc offset
> Abstract Weight class should have a constructor that takes this offset as well as a method to get the offset
> All Weights that have "sub" weights must pass this offset down to created "sub" weights
> Details on workaround:
> In order to work around this, you must do the following:
> * Subclass IndexSearcher
> * Add "int getIndexReaderBase(IndexReader)" method to your subclass
> * during Weight creation, the Weight must hold onto a reference to the passed in Searcher (casted to your sub class)
> * during Scorer creation, the Scorer must be passed the result of YourSearcher.getIndexReaderBase(reader)
> * Scorer can now rebase any collected docids using this offset
> Example implementation of getIndexReaderBase():
> {code}
> // NOTE: more efficient implementation can be done if you cache the result if gatherSubReaders in your constructor
> public int getIndexReaderBase(IndexReader reader) {
>   if (reader == getReader()) {
>     return 0;
>   } else {
>     List readers = new ArrayList();
>     gatherSubReaders(readers);
>     Iterator iter = readers.iterator();
>     int maxDoc = 0;
>     while (iter.hasNext()) {
>       IndexReader r = (IndexReader)iter.next();
>       if (r == reader) {
>         return maxDoc;
>       } 
>       maxDoc += r.maxDoc();
>     } 
>   }
>   return -1; // reader not in searcher
> }
> {code}
> Notes:
> * This workaround makes it so you cannot serialize your custom Weight implementation

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1821) Weight.scorer() not passed doc offset for "sub reader"

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12745367#action_12745367 ] 

Michael McCandless commented on LUCENE-1821:
--------------------------------------------

I think we should in fact add this API to 2.9?  It can ease the transition for users doing expert stuff w/ Lucene today.  The current patch looks reasonable, but we should mark it as expert and note that one should switch one's app logic to be per-segment whenever possible, instead of operating in the composite reader's docID space.  We should also note that it's O(N) CPU cost, ie, you should not call it for every hit (for example), but rather once per-segment and then hold onto that base.

> Weight.scorer() not passed doc offset for "sub reader"
> ------------------------------------------------------
>
>                 Key: LUCENE-1821
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1821
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Search
>    Affects Versions: 2.9
>            Reporter: Tim Smith
>         Attachments: LUCENE-1821.patch
>
>
> Now that searching is done on a per segment basis, there is no way for a Scorer to know the "actual" doc id for the document's it matches (only the relative doc offset into the segment)
> If using caches in your scorer that are based on the "entire" index (all segments), there is now no way to index into them properly from inside a Scorer because the scorer is not passed the needed offset to calculate the "real" docid
> suggest having Weight.scorer() method also take a integer for the doc offset
> Abstract Weight class should have a constructor that takes this offset as well as a method to get the offset
> All Weights that have "sub" weights must pass this offset down to created "sub" weights
> Details on workaround:
> In order to work around this, you must do the following:
> * Subclass IndexSearcher
> * Add "int getIndexReaderBase(IndexReader)" method to your subclass
> * during Weight creation, the Weight must hold onto a reference to the passed in Searcher (casted to your sub class)
> * during Scorer creation, the Scorer must be passed the result of YourSearcher.getIndexReaderBase(reader)
> * Scorer can now rebase any collected docids using this offset
> Example implementation of getIndexReaderBase():
> {code}
> // NOTE: more efficient implementation can be done if you cache the result if gatherSubReaders in your constructor
> public int getIndexReaderBase(IndexReader reader) {
>   if (reader == getReader()) {
>     return 0;
>   } else {
>     List readers = new ArrayList();
>     gatherSubReaders(readers);
>     Iterator iter = readers.iterator();
>     int maxDoc = 0;
>     while (iter.hasNext()) {
>       IndexReader r = (IndexReader)iter.next();
>       if (r == reader) {
>         return maxDoc;
>       } 
>       maxDoc += r.maxDoc();
>     } 
>   }
>   return -1; // reader not in searcher
> }
> {code}
> Notes:
> * This workaround makes it so you cannot serialize your custom Weight implementation

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1821) Weight.scorer() not passed doc offset for "sub reader"

Posted by "Mark Miller (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12744827#action_12744827 ] 

Mark Miller commented on LUCENE-1821:
-------------------------------------

bq. I think Tim's got a valid point though about wanting an ordinal value across the entire index ...

I don't disagree about wanting them at all. Hes using them for a neat purpose. 

bq. he's not using external ids, he's using the internal lucene docIds

If he were respecting the internal ids, you wouldn't need to calculate the multi-reader id. Hes essentially caching the multi-reader ids - thats the same as using a filter that always allows doc 0 to pass - its using the internal ids externally. To use the ids correctly, you get a reader and an id space that starts at 0 for that reader. If you want to use the whole reader, you should work with the multi-reader. You can use the multi-reader without breaking it apart here as well if you need to.

I think its a slippery slope - we start having to support both the segment ids, plus the multi-reader ids. And as we work on real-time, we will have to count on users caching that way - I think its better to try and work all of our support towards per segment.

I'll leave it for smarter people to discuss for now - but I don't think its the right path. He can essentially do what he needs without built in support, and personally I think thats the way to go. I think its great that right now, other than the sorting/hitcollector, things don't know about the sub reader breakout.

> Weight.scorer() not passed doc offset for "sub reader"
> ------------------------------------------------------
>
>                 Key: LUCENE-1821
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1821
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Search
>    Affects Versions: 2.9
>            Reporter: Tim Smith
>
> Now that searching is done on a per segment basis, there is no way for a Scorer to know the "actual" doc id for the document's it matches (only the relative doc offset into the segment)
> If using caches in your scorer that are based on the "entire" index (all segments), there is now no way to index into them properly from inside a Scorer because the scorer is not passed the needed offset to calculate the "real" docid
> suggest having Weight.scorer() method also take a integer for the doc offset
> Abstract Weight class should have a constructor that takes this offset as well as a method to get the offset
> All Weights that have "sub" weights must pass this offset down to created "sub" weights

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1821) Weight.scorer() not passed doc offset for "sub reader"

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12744970#action_12744970 ] 

Michael McCandless commented on LUCENE-1821:
--------------------------------------------


BTW contrib/spatial has exactly this same problem.  It currently
builds up a cache, keyed on the "top" (MultiReader's) docID, of the
precise distance computed by its precise distance filters, to then be
used during sorting.  Right now it simply computes its own docBase and
increments it every time getDocIdSet() is called (which is messy).
Though I think it could (and should) switch to a per-segment cache.

I am torn.  On the one hand we don't want to encourage apps to be
using "top docIDs" anywhere "down low" (eg Weight/Scorer).  We'd like
all such per-segment swtiching to happen "up high".

But on the other hand, this is quite a sudden change, and most
advanced apps will be using the top docIDs by definition (since
per-segment docIDs only becomes an [easy] option in 2.9), so it'd be
more friendly to offer up a cleaner migration path for such apps where
Weight/Scorer is told its docBase.

And, having to migrate an ord index from "top" to "sub" docIDs is
truly a nightmare, having gone through that with Mark in getting
String sorting to work per segment!


> Weight.scorer() not passed doc offset for "sub reader"
> ------------------------------------------------------
>
>                 Key: LUCENE-1821
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1821
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Search
>    Affects Versions: 2.9
>            Reporter: Tim Smith
>
> Now that searching is done on a per segment basis, there is no way for a Scorer to know the "actual" doc id for the document's it matches (only the relative doc offset into the segment)
> If using caches in your scorer that are based on the "entire" index (all segments), there is now no way to index into them properly from inside a Scorer because the scorer is not passed the needed offset to calculate the "real" docid
> suggest having Weight.scorer() method also take a integer for the doc offset
> Abstract Weight class should have a constructor that takes this offset as well as a method to get the offset
> All Weights that have "sub" weights must pass this offset down to created "sub" weights

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1821) Weight.scorer() not passed doc offset for "sub reader"

Posted by "Tim Smith (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12746839#action_12746839 ] 

Tim Smith commented on LUCENE-1821:
-----------------------------------

bq. Have you done any benching here? I think we actually found that even most sorting cases were faster than in 2.4.1.

I haven't done any benchmarking. 
I'm not arguing that 2.9 string sorting is slower than 2.4 string sorting, it may well be faster for every case.
per segment searching and other improvements potentially added more gains in performance than the new string sorting added losses in performance.

But, i can say rather confidently, that a large index with a bunch of segments will result in string sorting being slower when using a per segment string sort cache instead of a full index sort cache (think worst case using \*:\* query)

bq. loading a field cache off a multi-segment index was dog slow
this is a trade off.
slower cache loading in order to get faster sorting
i plan to provide the ability to do both, and allow specific use cases to decide what is best for them

> Weight.scorer() not passed doc offset for "sub reader"
> ------------------------------------------------------
>
>                 Key: LUCENE-1821
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1821
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Search
>    Affects Versions: 2.9
>            Reporter: Tim Smith
>             Fix For: 3.1
>
>         Attachments: LUCENE-1821.patch
>
>
> Now that searching is done on a per segment basis, there is no way for a Scorer to know the "actual" doc id for the document's it matches (only the relative doc offset into the segment)
> If using caches in your scorer that are based on the "entire" index (all segments), there is now no way to index into them properly from inside a Scorer because the scorer is not passed the needed offset to calculate the "real" docid
> suggest having Weight.scorer() method also take a integer for the doc offset
> Abstract Weight class should have a constructor that takes this offset as well as a method to get the offset
> All Weights that have "sub" weights must pass this offset down to created "sub" weights
> Details on workaround:
> In order to work around this, you must do the following:
> * Subclass IndexSearcher
> * Add "int getIndexReaderBase(IndexReader)" method to your subclass
> * during Weight creation, the Weight must hold onto a reference to the passed in Searcher (casted to your sub class)
> * during Scorer creation, the Scorer must be passed the result of YourSearcher.getIndexReaderBase(reader)
> * Scorer can now rebase any collected docids using this offset
> Example implementation of getIndexReaderBase():
> {code}
> // NOTE: more efficient implementation can be done if you cache the result if gatherSubReaders in your constructor
> public int getIndexReaderBase(IndexReader reader) {
>   if (reader == getReader()) {
>     return 0;
>   } else {
>     List readers = new ArrayList();
>     gatherSubReaders(readers);
>     Iterator iter = readers.iterator();
>     int maxDoc = 0;
>     while (iter.hasNext()) {
>       IndexReader r = (IndexReader)iter.next();
>       if (r == reader) {
>         return maxDoc;
>       } 
>       maxDoc += r.maxDoc();
>     } 
>   }
>   return -1; // reader not in searcher
> }
> {code}
> Notes:
> * This workaround makes it so you cannot serialize your custom Weight implementation

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1821) Weight.scorer() not passed doc offset for "sub reader"

Posted by "Tim Smith (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12745436#action_12745436 ] 

Tim Smith commented on LUCENE-1821:
-----------------------------------

true, MultiSearcher does kink things up some (and the Searcher abstract class in general)

personally, this is not a problem for me (don't use MultiSearcher (not yet at least)), and i'm happy with being passed the IndexSearcher instance that directly contains the IndexReader i'm being passed

The contract could be marked that the Searcher provided is the direct container of the IndexReader also passed
at which point, both explain() and scorer() would be "accurate" in terms of this

I would almost like to see something different passed in instead of a Searcher/IndexReader pair

i would actually like to see a "SearchContext" sort of object passed in
this would represent the whole "tree" of Searchers/IndexReaders
this would allow access to the MultiSearcher, the direct IndexSearcher, and the sub IndexReader (which should actually be used for the scoring) (as well as any other Searcher's in the call stack) 
this SearchContext could also pass in the "topScorer/allowDocsInOrder" flags (but that would be more difficult as scorers have subscorers that need to sometimes be created with different flags for these), but this SearchContext could be used to pass more information throughout the Scorer API in general from the top level (like - always use constant score queries where possible, use scoring algorithm X, Y, or Z, and so on)

obviously this would impact the API of Searcher a good deal as it would have to maintain this stack as sub Searcher's search() methods are called)

> Weight.scorer() not passed doc offset for "sub reader"
> ------------------------------------------------------
>
>                 Key: LUCENE-1821
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1821
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Search
>    Affects Versions: 2.9
>            Reporter: Tim Smith
>         Attachments: LUCENE-1821.patch
>
>
> Now that searching is done on a per segment basis, there is no way for a Scorer to know the "actual" doc id for the document's it matches (only the relative doc offset into the segment)
> If using caches in your scorer that are based on the "entire" index (all segments), there is now no way to index into them properly from inside a Scorer because the scorer is not passed the needed offset to calculate the "real" docid
> suggest having Weight.scorer() method also take a integer for the doc offset
> Abstract Weight class should have a constructor that takes this offset as well as a method to get the offset
> All Weights that have "sub" weights must pass this offset down to created "sub" weights
> Details on workaround:
> In order to work around this, you must do the following:
> * Subclass IndexSearcher
> * Add "int getIndexReaderBase(IndexReader)" method to your subclass
> * during Weight creation, the Weight must hold onto a reference to the passed in Searcher (casted to your sub class)
> * during Scorer creation, the Scorer must be passed the result of YourSearcher.getIndexReaderBase(reader)
> * Scorer can now rebase any collected docids using this offset
> Example implementation of getIndexReaderBase():
> {code}
> // NOTE: more efficient implementation can be done if you cache the result if gatherSubReaders in your constructor
> public int getIndexReaderBase(IndexReader reader) {
>   if (reader == getReader()) {
>     return 0;
>   } else {
>     List readers = new ArrayList();
>     gatherSubReaders(readers);
>     Iterator iter = readers.iterator();
>     int maxDoc = 0;
>     while (iter.hasNext()) {
>       IndexReader r = (IndexReader)iter.next();
>       if (r == reader) {
>         return maxDoc;
>       } 
>       maxDoc += r.maxDoc();
>     } 
>   }
>   return -1; // reader not in searcher
> }
> {code}
> Notes:
> * This workaround makes it so you cannot serialize your custom Weight implementation

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1821) Weight.scorer() not passed doc offset for "sub reader"

Posted by "Mark Miller (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12744853#action_12744853 ] 

Mark Miller commented on LUCENE-1821:
-------------------------------------

bq. he's not using external ids, he's using the internal lucene docIds

Let me try a response to this one once more:

If you try and make a filter that always matches docs 0-10, you could have made a filter that just sets bits 0-10. You are technically using 'internal' lucene doc ids. 

With the new per segment search though, you will find that you match the first 10 docs in every segment, not just the first 10 docs in the multi-reader virtual id space. This is what I call using the internal doc ids externally. You are counting on a single id space covering the whole index for the reader. This was never promised though. So just like this type of filter was not *really* supported and no longer works - this method of relying on the IndexReader to support one id space across the whole index no longer works as well. The Searcher supports the whole index, but a given IndexReader was never promised to do so. We could have passed base doc ids to the filters so that they could reconstruct the multi-reader virtual ids, and then just actually match docs 0-10 - but thats exactly the opposite of what we are trying to achieve. We switched to per segment to get away from that.

> Weight.scorer() not passed doc offset for "sub reader"
> ------------------------------------------------------
>
>                 Key: LUCENE-1821
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1821
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Search
>    Affects Versions: 2.9
>            Reporter: Tim Smith
>
> Now that searching is done on a per segment basis, there is no way for a Scorer to know the "actual" doc id for the document's it matches (only the relative doc offset into the segment)
> If using caches in your scorer that are based on the "entire" index (all segments), there is now no way to index into them properly from inside a Scorer because the scorer is not passed the needed offset to calculate the "real" docid
> suggest having Weight.scorer() method also take a integer for the doc offset
> Abstract Weight class should have a constructor that takes this offset as well as a method to get the offset
> All Weights that have "sub" weights must pass this offset down to created "sub" weights

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1821) Weight.scorer() not passed doc offset for "sub reader"

Posted by "Mark Miller (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12745418#action_12745418 ] 

Mark Miller commented on LUCENE-1821:
-------------------------------------

bq. It can ease the transition for users doing expert stuff w/ Lucene today

I think it only helps if you counted on the reader having every doc id in it?

It needs more than the current patch I think - we can't rely on people being able to plant a Searcher on the Weight ...

> Weight.scorer() not passed doc offset for "sub reader"
> ------------------------------------------------------
>
>                 Key: LUCENE-1821
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1821
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Search
>    Affects Versions: 2.9
>            Reporter: Tim Smith
>         Attachments: LUCENE-1821.patch
>
>
> Now that searching is done on a per segment basis, there is no way for a Scorer to know the "actual" doc id for the document's it matches (only the relative doc offset into the segment)
> If using caches in your scorer that are based on the "entire" index (all segments), there is now no way to index into them properly from inside a Scorer because the scorer is not passed the needed offset to calculate the "real" docid
> suggest having Weight.scorer() method also take a integer for the doc offset
> Abstract Weight class should have a constructor that takes this offset as well as a method to get the offset
> All Weights that have "sub" weights must pass this offset down to created "sub" weights
> Details on workaround:
> In order to work around this, you must do the following:
> * Subclass IndexSearcher
> * Add "int getIndexReaderBase(IndexReader)" method to your subclass
> * during Weight creation, the Weight must hold onto a reference to the passed in Searcher (casted to your sub class)
> * during Scorer creation, the Scorer must be passed the result of YourSearcher.getIndexReaderBase(reader)
> * Scorer can now rebase any collected docids using this offset
> Example implementation of getIndexReaderBase():
> {code}
> // NOTE: more efficient implementation can be done if you cache the result if gatherSubReaders in your constructor
> public int getIndexReaderBase(IndexReader reader) {
>   if (reader == getReader()) {
>     return 0;
>   } else {
>     List readers = new ArrayList();
>     gatherSubReaders(readers);
>     Iterator iter = readers.iterator();
>     int maxDoc = 0;
>     while (iter.hasNext()) {
>       IndexReader r = (IndexReader)iter.next();
>       if (r == reader) {
>         return maxDoc;
>       } 
>       maxDoc += r.maxDoc();
>     } 
>   }
>   return -1; // reader not in searcher
> }
> {code}
> Notes:
> * This workaround makes it so you cannot serialize your custom Weight implementation

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Issue Comment Edited: (LUCENE-1821) Weight.scorer() not passed doc offset for "sub reader"

Posted by "Mark Miller (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12746838#action_12746838 ] 

Mark Miller edited comment on LUCENE-1821 at 8/24/09 5:18 AM:
--------------------------------------------------------------

bq.  faster commit/less memory vs faster sorting

Have you done any benching here? I think we actually found that even most sorting cases were faster than in 2.4.1.

Its a lot of poison to swallow top level - loading a field cache off a multi-segment index was dog slow. *edit* (I can never remember if Yonik fix that or not - he fixed something related, or at least made it better)

      was (Author: markrmiller@gmail.com):
    bq.  faster commit/less memory vs faster sorting

Have you done any benching here? I think we actually found that even most sorting cases were faster than in 2.4.1.

Its a lot of poison to swallow top level - loading a field cache off a multi-segment index was dog slow.
  
> Weight.scorer() not passed doc offset for "sub reader"
> ------------------------------------------------------
>
>                 Key: LUCENE-1821
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1821
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Search
>    Affects Versions: 2.9
>            Reporter: Tim Smith
>             Fix For: 3.1
>
>         Attachments: LUCENE-1821.patch
>
>
> Now that searching is done on a per segment basis, there is no way for a Scorer to know the "actual" doc id for the document's it matches (only the relative doc offset into the segment)
> If using caches in your scorer that are based on the "entire" index (all segments), there is now no way to index into them properly from inside a Scorer because the scorer is not passed the needed offset to calculate the "real" docid
> suggest having Weight.scorer() method also take a integer for the doc offset
> Abstract Weight class should have a constructor that takes this offset as well as a method to get the offset
> All Weights that have "sub" weights must pass this offset down to created "sub" weights
> Details on workaround:
> In order to work around this, you must do the following:
> * Subclass IndexSearcher
> * Add "int getIndexReaderBase(IndexReader)" method to your subclass
> * during Weight creation, the Weight must hold onto a reference to the passed in Searcher (casted to your sub class)
> * during Scorer creation, the Scorer must be passed the result of YourSearcher.getIndexReaderBase(reader)
> * Scorer can now rebase any collected docids using this offset
> Example implementation of getIndexReaderBase():
> {code}
> // NOTE: more efficient implementation can be done if you cache the result if gatherSubReaders in your constructor
> public int getIndexReaderBase(IndexReader reader) {
>   if (reader == getReader()) {
>     return 0;
>   } else {
>     List readers = new ArrayList();
>     gatherSubReaders(readers);
>     Iterator iter = readers.iterator();
>     int maxDoc = 0;
>     while (iter.hasNext()) {
>       IndexReader r = (IndexReader)iter.next();
>       if (r == reader) {
>         return maxDoc;
>       } 
>       maxDoc += r.maxDoc();
>     } 
>   }
>   return -1; // reader not in searcher
> }
> {code}
> Notes:
> * This workaround makes it so you cannot serialize your custom Weight implementation

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1821) Weight.scorer() not passed doc offset for "sub reader"

Posted by "Mark Miller (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12744715#action_12744715 ] 

Mark Miller commented on LUCENE-1821:
-------------------------------------

I think we would prefer that you didn't cache by multi-reader. We want to encourage cache by segment.

In my mind, I don't think we should try and squeeze this into 2.9. We can see if anyone disagrees.

> Weight.scorer() not passed doc offset for "sub reader"
> ------------------------------------------------------
>
>                 Key: LUCENE-1821
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1821
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Search
>    Affects Versions: 2.9
>            Reporter: Tim Smith
>
> Now that searching is done on a per segment basis, there is no way for a Scorer to know the "actual" doc id for the document's it matches (only the relative doc offset into the segment)
> If using caches in your scorer that are based on the "entire" index (all segments), there is now no way to index into them properly from inside a Scorer because the scorer is not passed the needed offset to calculate the "real" docid
> suggest having Weight.scorer() method also take a integer for the doc offset
> Abstract Weight class should have a constructor that takes this offset as well as a method to get the offset
> All Weights that have "sub" weights must pass this offset down to created "sub" weights

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1821) Weight.scorer() not passed doc offset for "sub reader"

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12746797#action_12746797 ] 

Michael McCandless commented on LUCENE-1821:
--------------------------------------------

{quote}
bq. decent comparator (StringOrdValComparator) that operates per segment.

Still, the StringOrdValComparator will have to break down and call String.equals() whenever it compars docs in different IndexReaders
{quote}

Agreed, it will be slower than a top-level ords cache, but I'm
wondering in practice, in your case, what impact that turns out to be.
Also, since Lucene has already done this, maybe you could use its
StringOrdValComparator instead of having to cutover yours to
segment-based.

Or, better, work up a patch for a "forced" top-level StringComparator
for apps that don't mind the slow commit time and possible risk of
burning memory, in exchange for faster sorting.

Actually sorting (during collection) already gives you the docBase so
shouldn't your app already have the context needed for this?

{quote}
The idea of this is to disable per segment searching?
I don't actually want to do that. I want to use per segment searching functionality to take advantage of caches on per segment basis where possible, and map docs to the IndexSearcher context when i can't do per segment caching.
{quote}

OK

{quote}
bq. Could you compute the top-level ords, but then break it up per-segment?

I think i see what your getting at here, and i've already thought of this as a potential solution. The cache will always need to be created at the top most level, but it will be pre-broken out into a per-segment cache whose context is the top level IndexSearcher/MultiReader. The biggest problem here is the complexity of actually creating such a cache, which i'm sure will translate to this cache loading slower (hard to say how much slower without implementing)
I do plan to try this approach, but i expect this will be at least a week or two out from now.

I've currently updated my code for this to work per-segment by adding the docBase when performing the lookup into this cache (which is per-IndexSearcher)
I did this using my getIndexReaderBase() funciton i added to my subclass of IndexSearcher during Scorer construction time (I can live with this, however i would like to see getIndexReaderBase() added to IndexSearcher, and the IndexSearcher passed to Weight.scorer() so i don't need to hold onto my IndexSearcher subclass in my Weight implementation)
{quote}

OK sounds like an at least workable solution.

{quote}
bq. just return the "virtual" per-segment DocIdSet.

Thats what i'm doing now. I use the docid base for the IndexReader, along with its maxDoc to have the Scorer represent a virtual slice for just the segment in question
The only real problem here is that during Scorer initialization for this i have to call fullDocIdSetIter.advance(docBase) in the Scorer constructor. If advance(int) for the DocIdSet in question is O(N), this adds an extra penalty per segment that did not exist before
{quote}

Hmm... is advance in fact costly for your DocIdSets?

{quote}
bq. This isn't a long-term solution, since the order in which Lucene visits the readers isn't in general guaranteed,

that's where IndexSearcher.getIndexReaderBase(IndexReader) comes into play. If you call this in your scorer to get the docBase, it doesn't matter what order the segments are searched in (as it'll always return the proper base (in the context of the IndexSearcher that is))
{quote}

I think adding that one method for 2.9 would make sense?  (Marking it
expert, subject to change).  Because... assuming your app is OK w/
somehow (privately, external to Lucene) having access to the top
IndexSearcher via it's custom Weight, this one method would allow you
to not have to subclass IndexSearcher.


> Weight.scorer() not passed doc offset for "sub reader"
> ------------------------------------------------------
>
>                 Key: LUCENE-1821
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1821
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Search
>    Affects Versions: 2.9
>            Reporter: Tim Smith
>             Fix For: 3.1
>
>         Attachments: LUCENE-1821.patch
>
>
> Now that searching is done on a per segment basis, there is no way for a Scorer to know the "actual" doc id for the document's it matches (only the relative doc offset into the segment)
> If using caches in your scorer that are based on the "entire" index (all segments), there is now no way to index into them properly from inside a Scorer because the scorer is not passed the needed offset to calculate the "real" docid
> suggest having Weight.scorer() method also take a integer for the doc offset
> Abstract Weight class should have a constructor that takes this offset as well as a method to get the offset
> All Weights that have "sub" weights must pass this offset down to created "sub" weights
> Details on workaround:
> In order to work around this, you must do the following:
> * Subclass IndexSearcher
> * Add "int getIndexReaderBase(IndexReader)" method to your subclass
> * during Weight creation, the Weight must hold onto a reference to the passed in Searcher (casted to your sub class)
> * during Scorer creation, the Scorer must be passed the result of YourSearcher.getIndexReaderBase(reader)
> * Scorer can now rebase any collected docids using this offset
> Example implementation of getIndexReaderBase():
> {code}
> // NOTE: more efficient implementation can be done if you cache the result if gatherSubReaders in your constructor
> public int getIndexReaderBase(IndexReader reader) {
>   if (reader == getReader()) {
>     return 0;
>   } else {
>     List readers = new ArrayList();
>     gatherSubReaders(readers);
>     Iterator iter = readers.iterator();
>     int maxDoc = 0;
>     while (iter.hasNext()) {
>       IndexReader r = (IndexReader)iter.next();
>       if (r == reader) {
>         return maxDoc;
>       } 
>       maxDoc += r.maxDoc();
>     } 
>   }
>   return -1; // reader not in searcher
> }
> {code}
> Notes:
> * This workaround makes it so you cannot serialize your custom Weight implementation

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1821) Weight.scorer() not passed doc offset for "sub reader"

Posted by "Tim Smith (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12744793#action_12744793 ] 

Tim Smith commented on LUCENE-1821:
-----------------------------------

My current plan of attack for this use case will be to:
* pull the cache using the MultiReader at createWeight() time (index into cache will be MultiReader docid)
* pull the base offset for the IndexReader at scorer() creation time (will need to add the getIndexReaderBase() method to my searcher to do so)
* when the scorer needs to hit the cache, it'll add the base to the scorer's docid to get the key for the cache lookup

I should be able to do this easily enough with a customized IndexSearcher (subclass)

there are use cases where documents from one segment need to be aware of documents from other segments
sorting is such a use case (this is just done at the Collector level, so there are more hooks to do the needed base offset stuff)
duplicate removal is another such use case (only return the first document for docs sharing a field value)

both these use cases can be done at the Collector level, however Duplicate Removal could potentially be done at the Query level in order to perform duplicate removal at any location in the query matching
also, efficient duplicate removal for a String field would require the int[] ord index in order to reduce overall memory requirements
Using the int[] ord index allows using a BitSet for the hash set required to mark if a document for a specified value has been encountered (would need a HashSet<String> otherwise (ugh))

my particular use case must be done at the query level in order to have full boolean query support, and the ability to layer multiple queries with all combinations of AND/OR/NOT, and any other query operators, and sadly i have yet to come up with any way to create a cache on a per segment level (without creating the cache at the MultiReader level)


> Weight.scorer() not passed doc offset for "sub reader"
> ------------------------------------------------------
>
>                 Key: LUCENE-1821
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1821
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Search
>    Affects Versions: 2.9
>            Reporter: Tim Smith
>
> Now that searching is done on a per segment basis, there is no way for a Scorer to know the "actual" doc id for the document's it matches (only the relative doc offset into the segment)
> If using caches in your scorer that are based on the "entire" index (all segments), there is now no way to index into them properly from inside a Scorer because the scorer is not passed the needed offset to calculate the "real" docid
> suggest having Weight.scorer() method also take a integer for the doc offset
> Abstract Weight class should have a constructor that takes this offset as well as a method to get the offset
> All Weights that have "sub" weights must pass this offset down to created "sub" weights

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1821) Weight.scorer() not passed doc offset for "sub reader"

Posted by "Tim Smith (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12744833#action_12744833 ] 

Tim Smith commented on LUCENE-1821:
-----------------------------------

bq. The goal is to move all caches to the segment level in Lucene - we don't want to encourage users to cache per multi-reader by providing API help to do so.
I agree that this is the goal, and that using per segment caches should be the encouraged route for field caching needs. 
I plan to update the vast majority of the caches i use to be loaded on a per segment basis once i switch to 2.9 to take advantage of this.
But it should still be possible for advanced users to do caching on the multireader level. This may require porting upon subsequent versions of lucene (as i'm seeing i will have to for 2.9), however this should remain possible

bq. If you need index wide stats, you use the Weight.
I'm currently using weight to get this cache on the multireader level, however with 2.9 i will have to jump through some more hoops in order to be able to use this cache on each sub reader's scorer

bq. You are trying to use the internal ids externally
All my usage of "internal" docids occurs inside Weight, Scorer, and HitCollector implementations. I don't see how this is really "external" as it is using published interfaces. Its just that the interpretation of these interfaces changed for 2.9 (i have no problem with this as long as i can port from 2.4 with minimal to moderate effort). The reason they were able to change was only because no implementations provided by vanilla lucene or in contrib required the "whollistic" view of the index

bq. The FieldCache is the caching mechanism that Lucene supports with internal ids - and it supports it per segment.
The FieldCache mechanism did not meet all my needs with regards to schema/retention policy/etc, so i have been doing caching in my own code base for quite some time. While the FieldCache usage should be encouraged, it should not be required of advanced users. It should be acceptable for advanced users to feel some pain on upgrading, but there should be a rather clear path for doing so (without a loss of functionality, and ideally without requiring custom patches on top of a released version of lucene)

bq. Sorting is internal.
While sorting is provided by lucene APIs, there is nothing (and should be nothing) stopping someone from performing sorting on their own terms via the Collector interface and their own priority queues/API



> Weight.scorer() not passed doc offset for "sub reader"
> ------------------------------------------------------
>
>                 Key: LUCENE-1821
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1821
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Search
>    Affects Versions: 2.9
>            Reporter: Tim Smith
>
> Now that searching is done on a per segment basis, there is no way for a Scorer to know the "actual" doc id for the document's it matches (only the relative doc offset into the segment)
> If using caches in your scorer that are based on the "entire" index (all segments), there is now no way to index into them properly from inside a Scorer because the scorer is not passed the needed offset to calculate the "real" docid
> suggest having Weight.scorer() method also take a integer for the doc offset
> Abstract Weight class should have a constructor that takes this offset as well as a method to get the offset
> All Weights that have "sub" weights must pass this offset down to created "sub" weights

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1821) Weight.scorer() not passed doc offset for "sub reader"

Posted by "Mark Miller (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12746840#action_12746840 ] 

Mark Miller commented on LUCENE-1821:
-------------------------------------

{quote}
this is a trade off.
slower cache loading in order to get faster sorting
i plan to provide the ability to do both, and allow specific use cases to decide what is best for them{quote}

Yes a trade off :) I meant it was bug slow though - as in it may take 80 seconds to load it when it should have taken 5. Unless the index was optimized. Could be fixed though - dropped off my radar now that we don't do it internally anymore.

> Weight.scorer() not passed doc offset for "sub reader"
> ------------------------------------------------------
>
>                 Key: LUCENE-1821
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1821
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Search
>    Affects Versions: 2.9
>            Reporter: Tim Smith
>             Fix For: 3.1
>
>         Attachments: LUCENE-1821.patch
>
>
> Now that searching is done on a per segment basis, there is no way for a Scorer to know the "actual" doc id for the document's it matches (only the relative doc offset into the segment)
> If using caches in your scorer that are based on the "entire" index (all segments), there is now no way to index into them properly from inside a Scorer because the scorer is not passed the needed offset to calculate the "real" docid
> suggest having Weight.scorer() method also take a integer for the doc offset
> Abstract Weight class should have a constructor that takes this offset as well as a method to get the offset
> All Weights that have "sub" weights must pass this offset down to created "sub" weights
> Details on workaround:
> In order to work around this, you must do the following:
> * Subclass IndexSearcher
> * Add "int getIndexReaderBase(IndexReader)" method to your subclass
> * during Weight creation, the Weight must hold onto a reference to the passed in Searcher (casted to your sub class)
> * during Scorer creation, the Scorer must be passed the result of YourSearcher.getIndexReaderBase(reader)
> * Scorer can now rebase any collected docids using this offset
> Example implementation of getIndexReaderBase():
> {code}
> // NOTE: more efficient implementation can be done if you cache the result if gatherSubReaders in your constructor
> public int getIndexReaderBase(IndexReader reader) {
>   if (reader == getReader()) {
>     return 0;
>   } else {
>     List readers = new ArrayList();
>     gatherSubReaders(readers);
>     Iterator iter = readers.iterator();
>     int maxDoc = 0;
>     while (iter.hasNext()) {
>       IndexReader r = (IndexReader)iter.next();
>       if (r == reader) {
>         return maxDoc;
>       } 
>       maxDoc += r.maxDoc();
>     } 
>   }
>   return -1; // reader not in searcher
> }
> {code}
> Notes:
> * This workaround makes it so you cannot serialize your custom Weight implementation

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1821) Weight.scorer() not passed doc offset for "sub reader"

Posted by "Mark Miller (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12744847#action_12744847 ] 

Mark Miller commented on LUCENE-1821:
-------------------------------------

The "internal" vs "external" is kind of confusing made up terms - my fault really.

When I think of using the ids 'internally' I'm thinking that you are taking the index reader and making no assumptions. You just use the single reader and its id space. You can use those ids to get values, and you can map from those ids to values.

The assumption being made here is that you can load up ords for every doc and that these ords will be comparable in a way that every document id across the whole index maps to the same ord if it has the same value for a field. Nothing in the API promised that to my knowledge - it just happened to be a happy side effect. 

bq. While sorting is provided by lucene APIs, there is nothing (and should be nothing) stopping someone from performing sorting on their own terms via the Collector interface and their own priority queues/API
 
Indeed - just like there is nothing stopping you from continuing to use a MultiReader for this functionality.

What I mean by sorting is internal is that we specifically support comparing ords/values across readers. I think we would prefer that you don't count on ids coming from the top reader or a sub reader in other cases. We don't promise one way or another. We just give a reader and say work with this reader.

Experts can generally jump around that if they need to - Solr does a bit of this - or you can choose to continue using Multi-Readers.

I'm not saying we should make it impossible for you to do this - but I don't think we should open a path for scorers to reconstruct multi-reader virtual ids. I don't think a Scorer should know or care why type of IndexReader it is working with.

> Weight.scorer() not passed doc offset for "sub reader"
> ------------------------------------------------------
>
>                 Key: LUCENE-1821
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1821
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Search
>    Affects Versions: 2.9
>            Reporter: Tim Smith
>
> Now that searching is done on a per segment basis, there is no way for a Scorer to know the "actual" doc id for the document's it matches (only the relative doc offset into the segment)
> If using caches in your scorer that are based on the "entire" index (all segments), there is now no way to index into them properly from inside a Scorer because the scorer is not passed the needed offset to calculate the "real" docid
> suggest having Weight.scorer() method also take a integer for the doc offset
> Abstract Weight class should have a constructor that takes this offset as well as a method to get the offset
> All Weights that have "sub" weights must pass this offset down to created "sub" weights

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1821) Weight.scorer() not passed doc offset for "sub reader"

Posted by "Tim Smith (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12746613#action_12746613 ] 

Tim Smith commented on LUCENE-1821:
-----------------------------------

bq. thats a tough bunch of code to decide to spread ...
at least it'll be able to go away real soon with 3.0

> Weight.scorer() not passed doc offset for "sub reader"
> ------------------------------------------------------
>
>                 Key: LUCENE-1821
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1821
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Search
>    Affects Versions: 2.9
>            Reporter: Tim Smith
>             Fix For: 2.9
>
>         Attachments: LUCENE-1821.patch
>
>
> Now that searching is done on a per segment basis, there is no way for a Scorer to know the "actual" doc id for the document's it matches (only the relative doc offset into the segment)
> If using caches in your scorer that are based on the "entire" index (all segments), there is now no way to index into them properly from inside a Scorer because the scorer is not passed the needed offset to calculate the "real" docid
> suggest having Weight.scorer() method also take a integer for the doc offset
> Abstract Weight class should have a constructor that takes this offset as well as a method to get the offset
> All Weights that have "sub" weights must pass this offset down to created "sub" weights
> Details on workaround:
> In order to work around this, you must do the following:
> * Subclass IndexSearcher
> * Add "int getIndexReaderBase(IndexReader)" method to your subclass
> * during Weight creation, the Weight must hold onto a reference to the passed in Searcher (casted to your sub class)
> * during Scorer creation, the Scorer must be passed the result of YourSearcher.getIndexReaderBase(reader)
> * Scorer can now rebase any collected docids using this offset
> Example implementation of getIndexReaderBase():
> {code}
> // NOTE: more efficient implementation can be done if you cache the result if gatherSubReaders in your constructor
> public int getIndexReaderBase(IndexReader reader) {
>   if (reader == getReader()) {
>     return 0;
>   } else {
>     List readers = new ArrayList();
>     gatherSubReaders(readers);
>     Iterator iter = readers.iterator();
>     int maxDoc = 0;
>     while (iter.hasNext()) {
>       IndexReader r = (IndexReader)iter.next();
>       if (r == reader) {
>         return maxDoc;
>       } 
>       maxDoc += r.maxDoc();
>     } 
>   }
>   return -1; // reader not in searcher
> }
> {code}
> Notes:
> * This workaround makes it so you cannot serialize your custom Weight implementation

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1821) Weight.scorer() not passed doc offset for "sub reader"

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12746635#action_12746635 ] 

Michael McCandless commented on LUCENE-1821:
--------------------------------------------

bq. one used a cached DocIdSet created over the top level MultiReader (should be able to have a DocIdSet per Segment reader here, but this will take some more thinking (source of the matching docids is from a separate index), will also need to know which sub docidset to use based on which IndexReader is passed to scorer() - shouldn't be any big deal)

I think, similarly, you could continue to create the top-level
DocIdSet, but then make a new DocIdSet that presents one segment's
"slice" out of this top-level DocIdSet.  Then, pre-build the mapping
of IndexReader -> docBase like above, then when scorer() is called in
your custom query, just return the "virtual" per-segment DocIdSet.
Would this work?

> Weight.scorer() not passed doc offset for "sub reader"
> ------------------------------------------------------
>
>                 Key: LUCENE-1821
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1821
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Search
>    Affects Versions: 2.9
>            Reporter: Tim Smith
>             Fix For: 2.9
>
>         Attachments: LUCENE-1821.patch
>
>
> Now that searching is done on a per segment basis, there is no way for a Scorer to know the "actual" doc id for the document's it matches (only the relative doc offset into the segment)
> If using caches in your scorer that are based on the "entire" index (all segments), there is now no way to index into them properly from inside a Scorer because the scorer is not passed the needed offset to calculate the "real" docid
> suggest having Weight.scorer() method also take a integer for the doc offset
> Abstract Weight class should have a constructor that takes this offset as well as a method to get the offset
> All Weights that have "sub" weights must pass this offset down to created "sub" weights
> Details on workaround:
> In order to work around this, you must do the following:
> * Subclass IndexSearcher
> * Add "int getIndexReaderBase(IndexReader)" method to your subclass
> * during Weight creation, the Weight must hold onto a reference to the passed in Searcher (casted to your sub class)
> * during Scorer creation, the Scorer must be passed the result of YourSearcher.getIndexReaderBase(reader)
> * Scorer can now rebase any collected docids using this offset
> Example implementation of getIndexReaderBase():
> {code}
> // NOTE: more efficient implementation can be done if you cache the result if gatherSubReaders in your constructor
> public int getIndexReaderBase(IndexReader reader) {
>   if (reader == getReader()) {
>     return 0;
>   } else {
>     List readers = new ArrayList();
>     gatherSubReaders(readers);
>     Iterator iter = readers.iterator();
>     int maxDoc = 0;
>     while (iter.hasNext()) {
>       IndexReader r = (IndexReader)iter.next();
>       if (r == reader) {
>         return maxDoc;
>       } 
>       maxDoc += r.maxDoc();
>     } 
>   }
>   return -1; // reader not in searcher
> }
> {code}
> Notes:
> * This workaround makes it so you cannot serialize your custom Weight implementation

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1821) Weight.scorer() not passed doc offset for "sub reader"

Posted by "Mark Miller (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12744771#action_12744771 ] 

Mark Miller commented on LUCENE-1821:
-------------------------------------

You should pull that info from a top level reader and somehow pass it to your scorer through your query impl or something than.

We are working very hard to make search per segment and discourage non per segment use - it doesn't seem like you want to be consulting the entire index on every call to scorer. Passing the base around does not help you with looking at the whole index either - just in terms of doc ids - which we don't support externally - and you are essentially caching them externally. The document ids should be purely internal - I don't think we want to support documents in one segment being aware of docs in another segment either.

createWeight on Query gives you the opportunity to grab stuff your Scorer may need top level (off the Searcher).

You could also make a new Query each time that takes a top level IndexReader and uses it.

You can also override IndexSearcher (as you have) and work around things there.



> Weight.scorer() not passed doc offset for "sub reader"
> ------------------------------------------------------
>
>                 Key: LUCENE-1821
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1821
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Search
>    Affects Versions: 2.9
>            Reporter: Tim Smith
>
> Now that searching is done on a per segment basis, there is no way for a Scorer to know the "actual" doc id for the document's it matches (only the relative doc offset into the segment)
> If using caches in your scorer that are based on the "entire" index (all segments), there is now no way to index into them properly from inside a Scorer because the scorer is not passed the needed offset to calculate the "real" docid
> suggest having Weight.scorer() method also take a integer for the doc offset
> Abstract Weight class should have a constructor that takes this offset as well as a method to get the offset
> All Weights that have "sub" weights must pass this offset down to created "sub" weights

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Issue Comment Edited: (LUCENE-1821) Weight.scorer() not passed doc offset for "sub reader"

Posted by "Mark Miller (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12745985#action_12745985 ] 

Mark Miller edited comment on LUCENE-1821 at 8/21/09 7:14 AM:
--------------------------------------------------------------

I certainly think IndexSearcher makes a lot more sense than Searchable or Searcher there - it somewhat handles the whole API break thing - your clearly limiting to an IndexSearcher, so its compatible with the current API and can be clearly explained with javadoc - I still worry about pushing the API towards things that leave MultiSearcher/Remote out in the cold on features. They are technically first class citizens. I think some expert warnings could sell me anyway though.

I still have a problem with this though:

{code}
public DocIdSet getDocIdSet(final IndexReader reader) throws IOException {
    final Weight weight = query.weight(new IndexSearcher(reader));
    return new DocIdSet() {
      public DocIdSetIterator iterator() throws IOException {
        return weight.scorer(reader, docBase?, true, false);
      }
    };
  }

{code}

That scorer call should get the right IndexSearcher and I don't see how it can without breaking back compat on this method and passing an IndexSearcher too.

* edit *

And even if we fix this here, what about outside code doing the same thing? They won't get the right IndexSearcher.

      was (Author: markrmiller@gmail.com):
    I certainly think IndexSearcher makes a lot more sense than Searchable or Searcher there - it somewhat handles the whole API break thing - your clearly limiting to an IndexSearcher, so its compatible with the current API and can be clearly explained with javadoc - I still worry about pushing the API towards things that leave MultiSearcher/Remote out in the cold on features. They are technically first class citizens. I think some expert warnings could sell me anyway though.

I still have a problem with this though:

{code}
public DocIdSet getDocIdSet(final IndexReader reader) throws IOException {
    final Weight weight = query.weight(new IndexSearcher(reader));
    return new DocIdSet() {
      public DocIdSetIterator iterator() throws IOException {
        return weight.scorer(reader, docBase?, true, false);
      }
    };
  }

{code}

That scorer call should get the right IndexSearcher and I don't see how it can without breaking back compat on this method and passing an IndexSearcher too.
  
> Weight.scorer() not passed doc offset for "sub reader"
> ------------------------------------------------------
>
>                 Key: LUCENE-1821
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1821
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Search
>    Affects Versions: 2.9
>            Reporter: Tim Smith
>             Fix For: 2.9
>
>         Attachments: LUCENE-1821.patch
>
>
> Now that searching is done on a per segment basis, there is no way for a Scorer to know the "actual" doc id for the document's it matches (only the relative doc offset into the segment)
> If using caches in your scorer that are based on the "entire" index (all segments), there is now no way to index into them properly from inside a Scorer because the scorer is not passed the needed offset to calculate the "real" docid
> suggest having Weight.scorer() method also take a integer for the doc offset
> Abstract Weight class should have a constructor that takes this offset as well as a method to get the offset
> All Weights that have "sub" weights must pass this offset down to created "sub" weights
> Details on workaround:
> In order to work around this, you must do the following:
> * Subclass IndexSearcher
> * Add "int getIndexReaderBase(IndexReader)" method to your subclass
> * during Weight creation, the Weight must hold onto a reference to the passed in Searcher (casted to your sub class)
> * during Scorer creation, the Scorer must be passed the result of YourSearcher.getIndexReaderBase(reader)
> * Scorer can now rebase any collected docids using this offset
> Example implementation of getIndexReaderBase():
> {code}
> // NOTE: more efficient implementation can be done if you cache the result if gatherSubReaders in your constructor
> public int getIndexReaderBase(IndexReader reader) {
>   if (reader == getReader()) {
>     return 0;
>   } else {
>     List readers = new ArrayList();
>     gatherSubReaders(readers);
>     Iterator iter = readers.iterator();
>     int maxDoc = 0;
>     while (iter.hasNext()) {
>       IndexReader r = (IndexReader)iter.next();
>       if (r == reader) {
>         return maxDoc;
>       } 
>       maxDoc += r.maxDoc();
>     } 
>   }
>   return -1; // reader not in searcher
> }
> {code}
> Notes:
> * This workaround makes it so you cannot serialize your custom Weight implementation

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1821) Weight.scorer() not passed doc offset for "sub reader"

Posted by "Tim Smith (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12744721#action_12744721 ] 

Tim Smith commented on LUCENE-1821:
-----------------------------------

I'm caching an "index" into the terms for the document (Think of the "StringIndex" without the String[])
all i care about is that integer index value

breaking this out into a per segment cache would then require me to also store along with the int[] index array a String[] 
this then gets really slow and nasty when trying to run through the required algorithms to use the cache for my application

> Weight.scorer() not passed doc offset for "sub reader"
> ------------------------------------------------------
>
>                 Key: LUCENE-1821
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1821
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Search
>    Affects Versions: 2.9
>            Reporter: Tim Smith
>
> Now that searching is done on a per segment basis, there is no way for a Scorer to know the "actual" doc id for the document's it matches (only the relative doc offset into the segment)
> If using caches in your scorer that are based on the "entire" index (all segments), there is now no way to index into them properly from inside a Scorer because the scorer is not passed the needed offset to calculate the "real" docid
> suggest having Weight.scorer() method also take a integer for the doc offset
> Abstract Weight class should have a constructor that takes this offset as well as a method to get the offset
> All Weights that have "sub" weights must pass this offset down to created "sub" weights

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1821) Weight.scorer() not passed doc offset for "sub reader"

Posted by "Tim Smith (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12746842#action_12746842 ] 

Tim Smith commented on LUCENE-1821:
-----------------------------------

I allow caches to be loaded at commit time (if configured), and recommend that frequently used caches be configured to be loaded at this time
this can result in slower commit times, but responsive queries as soon as the commit is finished

once i also add the option for per-segment caching for sorting and faceting (i'll probably put this on by default for sorting, faceting maybe not), this will allow full tunability for the end-user

> Weight.scorer() not passed doc offset for "sub reader"
> ------------------------------------------------------
>
>                 Key: LUCENE-1821
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1821
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Search
>    Affects Versions: 2.9
>            Reporter: Tim Smith
>             Fix For: 3.1
>
>         Attachments: LUCENE-1821.patch
>
>
> Now that searching is done on a per segment basis, there is no way for a Scorer to know the "actual" doc id for the document's it matches (only the relative doc offset into the segment)
> If using caches in your scorer that are based on the "entire" index (all segments), there is now no way to index into them properly from inside a Scorer because the scorer is not passed the needed offset to calculate the "real" docid
> suggest having Weight.scorer() method also take a integer for the doc offset
> Abstract Weight class should have a constructor that takes this offset as well as a method to get the offset
> All Weights that have "sub" weights must pass this offset down to created "sub" weights
> Details on workaround:
> In order to work around this, you must do the following:
> * Subclass IndexSearcher
> * Add "int getIndexReaderBase(IndexReader)" method to your subclass
> * during Weight creation, the Weight must hold onto a reference to the passed in Searcher (casted to your sub class)
> * during Scorer creation, the Scorer must be passed the result of YourSearcher.getIndexReaderBase(reader)
> * Scorer can now rebase any collected docids using this offset
> Example implementation of getIndexReaderBase():
> {code}
> // NOTE: more efficient implementation can be done if you cache the result if gatherSubReaders in your constructor
> public int getIndexReaderBase(IndexReader reader) {
>   if (reader == getReader()) {
>     return 0;
>   } else {
>     List readers = new ArrayList();
>     gatherSubReaders(readers);
>     Iterator iter = readers.iterator();
>     int maxDoc = 0;
>     while (iter.hasNext()) {
>       IndexReader r = (IndexReader)iter.next();
>       if (r == reader) {
>         return maxDoc;
>       } 
>       maxDoc += r.maxDoc();
>     } 
>   }
>   return -1; // reader not in searcher
> }
> {code}
> Notes:
> * This workaround makes it so you cannot serialize your custom Weight implementation

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1821) Weight.scorer() not passed doc offset for "sub reader"

Posted by "Mark Miller (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12746607#action_12746607 ] 

Mark Miller commented on LUCENE-1821:
-------------------------------------

You want to weigh in again Mike ? You still have the same stance as your last comment?

> Weight.scorer() not passed doc offset for "sub reader"
> ------------------------------------------------------
>
>                 Key: LUCENE-1821
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1821
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Search
>    Affects Versions: 2.9
>            Reporter: Tim Smith
>             Fix For: 2.9
>
>         Attachments: LUCENE-1821.patch
>
>
> Now that searching is done on a per segment basis, there is no way for a Scorer to know the "actual" doc id for the document's it matches (only the relative doc offset into the segment)
> If using caches in your scorer that are based on the "entire" index (all segments), there is now no way to index into them properly from inside a Scorer because the scorer is not passed the needed offset to calculate the "real" docid
> suggest having Weight.scorer() method also take a integer for the doc offset
> Abstract Weight class should have a constructor that takes this offset as well as a method to get the offset
> All Weights that have "sub" weights must pass this offset down to created "sub" weights
> Details on workaround:
> In order to work around this, you must do the following:
> * Subclass IndexSearcher
> * Add "int getIndexReaderBase(IndexReader)" method to your subclass
> * during Weight creation, the Weight must hold onto a reference to the passed in Searcher (casted to your sub class)
> * during Scorer creation, the Scorer must be passed the result of YourSearcher.getIndexReaderBase(reader)
> * Scorer can now rebase any collected docids using this offset
> Example implementation of getIndexReaderBase():
> {code}
> // NOTE: more efficient implementation can be done if you cache the result if gatherSubReaders in your constructor
> public int getIndexReaderBase(IndexReader reader) {
>   if (reader == getReader()) {
>     return 0;
>   } else {
>     List readers = new ArrayList();
>     gatherSubReaders(readers);
>     Iterator iter = readers.iterator();
>     int maxDoc = 0;
>     while (iter.hasNext()) {
>       IndexReader r = (IndexReader)iter.next();
>       if (r == reader) {
>         return maxDoc;
>       } 
>       maxDoc += r.maxDoc();
>     } 
>   }
>   return -1; // reader not in searcher
> }
> {code}
> Notes:
> * This workaround makes it so you cannot serialize your custom Weight implementation

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1821) Weight.scorer() not passed doc offset for "sub reader"

Posted by "Mark Miller (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12745735#action_12745735 ] 

Mark Miller commented on LUCENE-1821:
-------------------------------------

I'm still not a fan of giving access to the upper readers.

I think I could go for having the offset available with the appropriate warnings.

I tried this out, and after adjusting all scorer, explains to carry the offset as well, I ended up with one spot left:

{code}
  public DocIdSet getDocIdSet(final IndexReader reader) throws IOException {
    final Weight weight = query.weight(new IndexSearcher(reader));
    return new DocIdSet() {
      public DocIdSetIterator iterator() throws IOException {
        return weight.scorer(reader, docBase?, true, false);
      }
    };
  }
{code}

Trouble - in these cases, how do you pass the doc base? Its too much breakage to pass it with the reader *everywhere*. You almost want a class that holds the reader ref and the docBase, but you still break apis all over. You could deprecate everything, but then you can't count on getting a good offset (would have to guess 0? ).

> Weight.scorer() not passed doc offset for "sub reader"
> ------------------------------------------------------
>
>                 Key: LUCENE-1821
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1821
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Search
>    Affects Versions: 2.9
>            Reporter: Tim Smith
>             Fix For: 2.9
>
>         Attachments: LUCENE-1821.patch
>
>
> Now that searching is done on a per segment basis, there is no way for a Scorer to know the "actual" doc id for the document's it matches (only the relative doc offset into the segment)
> If using caches in your scorer that are based on the "entire" index (all segments), there is now no way to index into them properly from inside a Scorer because the scorer is not passed the needed offset to calculate the "real" docid
> suggest having Weight.scorer() method also take a integer for the doc offset
> Abstract Weight class should have a constructor that takes this offset as well as a method to get the offset
> All Weights that have "sub" weights must pass this offset down to created "sub" weights
> Details on workaround:
> In order to work around this, you must do the following:
> * Subclass IndexSearcher
> * Add "int getIndexReaderBase(IndexReader)" method to your subclass
> * during Weight creation, the Weight must hold onto a reference to the passed in Searcher (casted to your sub class)
> * during Scorer creation, the Scorer must be passed the result of YourSearcher.getIndexReaderBase(reader)
> * Scorer can now rebase any collected docids using this offset
> Example implementation of getIndexReaderBase():
> {code}
> // NOTE: more efficient implementation can be done if you cache the result if gatherSubReaders in your constructor
> public int getIndexReaderBase(IndexReader reader) {
>   if (reader == getReader()) {
>     return 0;
>   } else {
>     List readers = new ArrayList();
>     gatherSubReaders(readers);
>     Iterator iter = readers.iterator();
>     int maxDoc = 0;
>     while (iter.hasNext()) {
>       IndexReader r = (IndexReader)iter.next();
>       if (r == reader) {
>         return maxDoc;
>       } 
>       maxDoc += r.maxDoc();
>     } 
>   }
>   return -1; // reader not in searcher
> }
> {code}
> Notes:
> * This workaround makes it so you cannot serialize your custom Weight implementation

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1821) Weight.scorer() not passed doc offset for "sub reader"

Posted by "Mark Miller (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12746608#action_12746608 ] 

Mark Miller commented on LUCENE-1821:
-------------------------------------

bq. well, you could go the route similar to the 2.4 TokenStream api (next() vs next(Token))

thats a tough bunch of code to decide to spread ...

> Weight.scorer() not passed doc offset for "sub reader"
> ------------------------------------------------------
>
>                 Key: LUCENE-1821
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1821
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Search
>    Affects Versions: 2.9
>            Reporter: Tim Smith
>             Fix For: 2.9
>
>         Attachments: LUCENE-1821.patch
>
>
> Now that searching is done on a per segment basis, there is no way for a Scorer to know the "actual" doc id for the document's it matches (only the relative doc offset into the segment)
> If using caches in your scorer that are based on the "entire" index (all segments), there is now no way to index into them properly from inside a Scorer because the scorer is not passed the needed offset to calculate the "real" docid
> suggest having Weight.scorer() method also take a integer for the doc offset
> Abstract Weight class should have a constructor that takes this offset as well as a method to get the offset
> All Weights that have "sub" weights must pass this offset down to created "sub" weights
> Details on workaround:
> In order to work around this, you must do the following:
> * Subclass IndexSearcher
> * Add "int getIndexReaderBase(IndexReader)" method to your subclass
> * during Weight creation, the Weight must hold onto a reference to the passed in Searcher (casted to your sub class)
> * during Scorer creation, the Scorer must be passed the result of YourSearcher.getIndexReaderBase(reader)
> * Scorer can now rebase any collected docids using this offset
> Example implementation of getIndexReaderBase():
> {code}
> // NOTE: more efficient implementation can be done if you cache the result if gatherSubReaders in your constructor
> public int getIndexReaderBase(IndexReader reader) {
>   if (reader == getReader()) {
>     return 0;
>   } else {
>     List readers = new ArrayList();
>     gatherSubReaders(readers);
>     Iterator iter = readers.iterator();
>     int maxDoc = 0;
>     while (iter.hasNext()) {
>       IndexReader r = (IndexReader)iter.next();
>       if (r == reader) {
>         return maxDoc;
>       } 
>       maxDoc += r.maxDoc();
>     } 
>   }
>   return -1; // reader not in searcher
> }
> {code}
> Notes:
> * This workaround makes it so you cannot serialize your custom Weight implementation

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org