You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by "Shai Erera (JIRA)" <ji...@apache.org> on 2011/05/15 17:34:47 UTC

[jira] [Created] (LUCENE-3102) Few issues with CachingCollector

Few issues with CachingCollector
--------------------------------

Key: LUCENE-3102
URL: https://issues.apache.org/jira/browse/LUCENE-3102
Project: Lucene - Java
Issue Type: Bug
Components: contrib/*
Reporter: Shai Erera
Priority: Minor
Fix For: 3.2, 4.0

CachingCollector (introduced in LUCENE-1421) has few issues:
# Since the wrapped Collector may support out-of-order collection, the document IDs cached may be out-of-order (depends on the Query) and thus replay(Collector) will forward document IDs out-of-order to a Collector that may not support it.
# It does not clear cachedScores + cachedSegs upon exceeding RAM limits
# I think that instead of comparing curScores to null, in order to determine if scores are requested, we should have a specific boolean - for clarity
# This check "if (base + nextLength > maxDocsToCache)" (line 168) can be relaxed? E.g., what if nextLength is, say, 512K, and I cannot satisfy the maxDocsToCache constraint, but if it was 10K I would? Wouldn't we still want to try and cache them?

Also:
* The TODO in line 64 (having Collector specify needsScores()) -- why do we need that if CachingCollector ctor already takes a boolean "cacheScores"? I think it's better defined explicitly than implicitly?

* Let's introduce a factory method for creating a specialized version if scoring is requested / not (i.e., impl the TODO in line 189)

* I think it's a useful collector, which stands on its own and not specific to grouping. Can we move it to core?

* How about using OpenBitSet instead of int[] for doc IDs?
** If the number of hits is big, we'd gain some RAM back, and be able to cache more entries
** NOTE: OpenBitSet can only be used for in-order collection only. So we can use that if the wrapped Collector does not support out-of-order

* Do you think we can modify this Collector to not necessarily wrap another Collector? We have such Collector which stores (in-memory) all matching doc IDs + scores (if required). Those are later fed into several processes that operate on them (e.g. fetch more info from the index etc.). I am thinking, we can make CachingCollector *optionally* wrap another Collector and then someone can reuse it by setting RAM limit to unlimited (we should have a constant for that) in order to simply collect all matching docs + scores.

* I think a set of dedicated unit tests for this class alone would be good.

That's it so far. Perhaps, if we do all of the above, more things will pop up.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3102) Few issues with CachingCollector

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13035637#comment-13035637 ] 

Michael McCandless commented on LUCENE-3102:
--------------------------------------------

Thanks Shai -- this is awesome progress!

> Few issues with CachingCollector
> --------------------------------
>
>                 Key: LUCENE-3102
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3102
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: core/search
>            Reporter: Shai Erera
>            Assignee: Shai Erera
>            Priority: Minor
>             Fix For: 3.2, 4.0
>
>         Attachments: LUCENE-3102-factory.patch, LUCENE-3102-nowrap.patch, LUCENE-3102-nowrap.patch, LUCENE-3102.patch, LUCENE-3102.patch, LUCENE-3102.patch
>
>
> CachingCollector (introduced in LUCENE-1421) has few issues:
> # Since the wrapped Collector may support out-of-order collection, the document IDs cached may be out-of-order (depends on the Query) and thus replay(Collector) will forward document IDs out-of-order to a Collector that may not support it.
> # It does not clear cachedScores + cachedSegs upon exceeding RAM limits
> # I think that instead of comparing curScores to null, in order to determine if scores are requested, we should have a specific boolean - for clarity
> # This check "if (base + nextLength > maxDocsToCache)" (line 168) can be relaxed? E.g., what if nextLength is, say, 512K, and I cannot satisfy the maxDocsToCache constraint, but if it was 10K I would? Wouldn't we still want to try and cache them?
> Also:
> * The TODO in line 64 (having Collector specify needsScores()) -- why do we need that if CachingCollector ctor already takes a boolean "cacheScores"? I think it's better defined explicitly than implicitly?
> * Let's introduce a factory method for creating a specialized version if scoring is requested / not (i.e., impl the TODO in line 189)
> * I think it's a useful collector, which stands on its own and not specific to grouping. Can we move it to core?
> * How about using OpenBitSet instead of int[] for doc IDs?
> ** If the number of hits is big, we'd gain some RAM back, and be able to cache more entries
> ** NOTE: OpenBitSet can only be used for in-order collection only. So we can use that if the wrapped Collector does not support out-of-order
> * Do you think we can modify this Collector to not necessarily wrap another Collector? We have such Collector which stores (in-memory) all matching doc IDs + scores (if required). Those are later fed into several processes that operate on them (e.g. fetch more info from the index etc.). I am thinking, we can make CachingCollector *optionally* wrap another Collector and then someone can reuse it by setting RAM limit to unlimited (we should have a constant for that) in order to simply collect all matching docs + scores.
> * I think a set of dedicated unit tests for this class alone would be good.
> That's it so far. Perhaps, if we do all of the above, more things will pop up.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3102) Few issues with CachingCollector

Posted by "Shai Erera (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13034283#comment-13034283 ] 

Shai Erera commented on LUCENE-3102:
------------------------------------

Committed revision 1103870 (3x).
Committed revision 1103872 (trunk).

What's committed:
* Move CachingCollector to core
* Fix bugs
* Add TestCachingCollector
* Some refactoring

Moving on to next proposed changes.

> Few issues with CachingCollector
> --------------------------------
>
>                 Key: LUCENE-3102
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3102
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: modules/grouping
>            Reporter: Shai Erera
>            Priority: Minor
>             Fix For: 3.2, 4.0
>
>         Attachments: LUCENE-3102.patch, LUCENE-3102.patch
>
>
> CachingCollector (introduced in LUCENE-1421) has few issues:
> # Since the wrapped Collector may support out-of-order collection, the document IDs cached may be out-of-order (depends on the Query) and thus replay(Collector) will forward document IDs out-of-order to a Collector that may not support it.
> # It does not clear cachedScores + cachedSegs upon exceeding RAM limits
> # I think that instead of comparing curScores to null, in order to determine if scores are requested, we should have a specific boolean - for clarity
> # This check "if (base + nextLength > maxDocsToCache)" (line 168) can be relaxed? E.g., what if nextLength is, say, 512K, and I cannot satisfy the maxDocsToCache constraint, but if it was 10K I would? Wouldn't we still want to try and cache them?
> Also:
> * The TODO in line 64 (having Collector specify needsScores()) -- why do we need that if CachingCollector ctor already takes a boolean "cacheScores"? I think it's better defined explicitly than implicitly?
> * Let's introduce a factory method for creating a specialized version if scoring is requested / not (i.e., impl the TODO in line 189)
> * I think it's a useful collector, which stands on its own and not specific to grouping. Can we move it to core?
> * How about using OpenBitSet instead of int[] for doc IDs?
> ** If the number of hits is big, we'd gain some RAM back, and be able to cache more entries
> ** NOTE: OpenBitSet can only be used for in-order collection only. So we can use that if the wrapped Collector does not support out-of-order
> * Do you think we can modify this Collector to not necessarily wrap another Collector? We have such Collector which stores (in-memory) all matching doc IDs + scores (if required). Those are later fed into several processes that operate on them (e.g. fetch more info from the index etc.). I am thinking, we can make CachingCollector *optionally* wrap another Collector and then someone can reuse it by setting RAM limit to unlimited (we should have a constant for that) in order to simply collect all matching docs + scores.
> * I think a set of dedicated unit tests for this class alone would be good.
> That's it so far. Perhaps, if we do all of the above, more things will pop up.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Updated] (LUCENE-3102) Few issues with CachingCollector

Posted by "Shai Erera (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-3102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Shai Erera updated LUCENE-3102:
-------------------------------

    Lucene Fields: [New, Patch Available]  (was: [New])

> Few issues with CachingCollector
> --------------------------------
>
>                 Key: LUCENE-3102
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3102
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: contrib/*
>            Reporter: Shai Erera
>            Priority: Minor
>             Fix For: 3.2, 4.0
>
>         Attachments: LUCENE-3102.patch
>
>
> CachingCollector (introduced in LUCENE-1421) has few issues:
> # Since the wrapped Collector may support out-of-order collection, the document IDs cached may be out-of-order (depends on the Query) and thus replay(Collector) will forward document IDs out-of-order to a Collector that may not support it.
> # It does not clear cachedScores + cachedSegs upon exceeding RAM limits
> # I think that instead of comparing curScores to null, in order to determine if scores are requested, we should have a specific boolean - for clarity
> # This check "if (base + nextLength > maxDocsToCache)" (line 168) can be relaxed? E.g., what if nextLength is, say, 512K, and I cannot satisfy the maxDocsToCache constraint, but if it was 10K I would? Wouldn't we still want to try and cache them?
> Also:
> * The TODO in line 64 (having Collector specify needsScores()) -- why do we need that if CachingCollector ctor already takes a boolean "cacheScores"? I think it's better defined explicitly than implicitly?
> * Let's introduce a factory method for creating a specialized version if scoring is requested / not (i.e., impl the TODO in line 189)
> * I think it's a useful collector, which stands on its own and not specific to grouping. Can we move it to core?
> * How about using OpenBitSet instead of int[] for doc IDs?
> ** If the number of hits is big, we'd gain some RAM back, and be able to cache more entries
> ** NOTE: OpenBitSet can only be used for in-order collection only. So we can use that if the wrapped Collector does not support out-of-order
> * Do you think we can modify this Collector to not necessarily wrap another Collector? We have such Collector which stores (in-memory) all matching doc IDs + scores (if required). Those are later fed into several processes that operate on them (e.g. fetch more info from the index etc.). I am thinking, we can make CachingCollector *optionally* wrap another Collector and then someone can reuse it by setting RAM limit to unlimited (we should have a constant for that) in order to simply collect all matching docs + scores.
> * I think a set of dedicated unit tests for this class alone would be good.
> That's it so far. Perhaps, if we do all of the above, more things will pop up.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3102) Few issues with CachingCollector

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13033992#comment-13033992 ] 

Michael McCandless commented on LUCENE-3102:
--------------------------------------------

Patch looks awesome Shai!

Only thing is: I would be careful about directly setting those private fields of the cachedScorer; I think (not sure) this incurs an "access" check on each assignment.  Maybe make them package protected?  Or use a setter?

bq. Question – perhaps we can commit these changes incrementally?

+1 - progress not perfection!  These changes are great.

{quote}
Mike, there is another reason to separate Collector.needsScores() from cacheScores – it is possible someone will pass a Collector which needs scores, however won't want to have CachingCollector 'cache' them. In which case, the wrapped Collector should be delegated setScorer instead of cachedScorer.
{quote}

Ahh good point, because the 2nd pass collector may not need the scores.  So on the first pass we'd have to forward the .score() request to the wrapped collector but not cache it.

bq. I will leave Collector.needsScores() for a different issue though?

+1

Thanks for iterating on this Shai!

> Few issues with CachingCollector
> --------------------------------
>
>                 Key: LUCENE-3102
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3102
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: contrib/*
>            Reporter: Shai Erera
>            Priority: Minor
>             Fix For: 3.2, 4.0
>
>         Attachments: LUCENE-3102.patch
>
>
> CachingCollector (introduced in LUCENE-1421) has few issues:
> # Since the wrapped Collector may support out-of-order collection, the document IDs cached may be out-of-order (depends on the Query) and thus replay(Collector) will forward document IDs out-of-order to a Collector that may not support it.
> # It does not clear cachedScores + cachedSegs upon exceeding RAM limits
> # I think that instead of comparing curScores to null, in order to determine if scores are requested, we should have a specific boolean - for clarity
> # This check "if (base + nextLength > maxDocsToCache)" (line 168) can be relaxed? E.g., what if nextLength is, say, 512K, and I cannot satisfy the maxDocsToCache constraint, but if it was 10K I would? Wouldn't we still want to try and cache them?
> Also:
> * The TODO in line 64 (having Collector specify needsScores()) -- why do we need that if CachingCollector ctor already takes a boolean "cacheScores"? I think it's better defined explicitly than implicitly?
> * Let's introduce a factory method for creating a specialized version if scoring is requested / not (i.e., impl the TODO in line 189)
> * I think it's a useful collector, which stands on its own and not specific to grouping. Can we move it to core?
> * How about using OpenBitSet instead of int[] for doc IDs?
> ** If the number of hits is big, we'd gain some RAM back, and be able to cache more entries
> ** NOTE: OpenBitSet can only be used for in-order collection only. So we can use that if the wrapped Collector does not support out-of-order
> * Do you think we can modify this Collector to not necessarily wrap another Collector? We have such Collector which stores (in-memory) all matching doc IDs + scores (if required). Those are later fed into several processes that operate on them (e.g. fetch more info from the index etc.). I am thinking, we can make CachingCollector *optionally* wrap another Collector and then someone can reuse it by setting RAM limit to unlimited (we should have a constant for that) in order to simply collect all matching docs + scores.
> * I think a set of dedicated unit tests for this class alone would be good.
> That's it so far. Perhaps, if we do all of the above, more things will pop up.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Updated] (LUCENE-3102) Few issues with CachingCollector

Posted by "Shai Erera (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-3102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Shai Erera updated LUCENE-3102:
-------------------------------

    Component/s:     (was: contrib/*)
                 modules/grouping

> Few issues with CachingCollector
> --------------------------------
>
>                 Key: LUCENE-3102
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3102
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: modules/grouping
>            Reporter: Shai Erera
>            Priority: Minor
>             Fix For: 3.2, 4.0
>
>         Attachments: LUCENE-3102.patch, LUCENE-3102.patch
>
>
> CachingCollector (introduced in LUCENE-1421) has few issues:
> # Since the wrapped Collector may support out-of-order collection, the document IDs cached may be out-of-order (depends on the Query) and thus replay(Collector) will forward document IDs out-of-order to a Collector that may not support it.
> # It does not clear cachedScores + cachedSegs upon exceeding RAM limits
> # I think that instead of comparing curScores to null, in order to determine if scores are requested, we should have a specific boolean - for clarity
> # This check "if (base + nextLength > maxDocsToCache)" (line 168) can be relaxed? E.g., what if nextLength is, say, 512K, and I cannot satisfy the maxDocsToCache constraint, but if it was 10K I would? Wouldn't we still want to try and cache them?
> Also:
> * The TODO in line 64 (having Collector specify needsScores()) -- why do we need that if CachingCollector ctor already takes a boolean "cacheScores"? I think it's better defined explicitly than implicitly?
> * Let's introduce a factory method for creating a specialized version if scoring is requested / not (i.e., impl the TODO in line 189)
> * I think it's a useful collector, which stands on its own and not specific to grouping. Can we move it to core?
> * How about using OpenBitSet instead of int[] for doc IDs?
> ** If the number of hits is big, we'd gain some RAM back, and be able to cache more entries
> ** NOTE: OpenBitSet can only be used for in-order collection only. So we can use that if the wrapped Collector does not support out-of-order
> * Do you think we can modify this Collector to not necessarily wrap another Collector? We have such Collector which stores (in-memory) all matching doc IDs + scores (if required). Those are later fed into several processes that operate on them (e.g. fetch more info from the index etc.). I am thinking, we can make CachingCollector *optionally* wrap another Collector and then someone can reuse it by setting RAM limit to unlimited (we should have a constant for that) in order to simply collect all matching docs + scores.
> * I think a set of dedicated unit tests for this class alone would be good.
> That's it so far. Perhaps, if we do all of the above, more things will pop up.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3102) Few issues with CachingCollector

Posted by "Shai Erera (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13035188#comment-13035188 ] 

Shai Erera commented on LUCENE-3102:
------------------------------------

Committed revision 1104680 (3x).
Committed revision 1104683 (trunk).

> Few issues with CachingCollector
> --------------------------------
>
>                 Key: LUCENE-3102
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3102
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: core/search
>            Reporter: Shai Erera
>            Assignee: Shai Erera
>            Priority: Minor
>             Fix For: 3.2, 4.0
>
>         Attachments: LUCENE-3102-factory.patch, LUCENE-3102.patch, LUCENE-3102.patch
>
>
> CachingCollector (introduced in LUCENE-1421) has few issues:
> # Since the wrapped Collector may support out-of-order collection, the document IDs cached may be out-of-order (depends on the Query) and thus replay(Collector) will forward document IDs out-of-order to a Collector that may not support it.
> # It does not clear cachedScores + cachedSegs upon exceeding RAM limits
> # I think that instead of comparing curScores to null, in order to determine if scores are requested, we should have a specific boolean - for clarity
> # This check "if (base + nextLength > maxDocsToCache)" (line 168) can be relaxed? E.g., what if nextLength is, say, 512K, and I cannot satisfy the maxDocsToCache constraint, but if it was 10K I would? Wouldn't we still want to try and cache them?
> Also:
> * The TODO in line 64 (having Collector specify needsScores()) -- why do we need that if CachingCollector ctor already takes a boolean "cacheScores"? I think it's better defined explicitly than implicitly?
> * Let's introduce a factory method for creating a specialized version if scoring is requested / not (i.e., impl the TODO in line 189)
> * I think it's a useful collector, which stands on its own and not specific to grouping. Can we move it to core?
> * How about using OpenBitSet instead of int[] for doc IDs?
> ** If the number of hits is big, we'd gain some RAM back, and be able to cache more entries
> ** NOTE: OpenBitSet can only be used for in-order collection only. So we can use that if the wrapped Collector does not support out-of-order
> * Do you think we can modify this Collector to not necessarily wrap another Collector? We have such Collector which stores (in-memory) all matching doc IDs + scores (if required). Those are later fed into several processes that operate on them (e.g. fetch more info from the index etc.). I am thinking, we can make CachingCollector *optionally* wrap another Collector and then someone can reuse it by setting RAM limit to unlimited (we should have a constant for that) in order to simply collect all matching docs + scores.
> * I think a set of dedicated unit tests for this class alone would be good.
> That's it so far. Perhaps, if we do all of the above, more things will pop up.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3102) Few issues with CachingCollector

Posted by "Shai Erera (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13035223#comment-13035223 ] 

Shai Erera commented on LUCENE-3102:
------------------------------------

There are two things left to do:

(1) Use bit set instead of int[] for docIDs. If we do this, then it means the Collector cannot support out-of-order collections (which is not a big deal IMO). It also means for large indexes, we might consume more RAM than int[].

(2) Allow this Collector to stand on its own, w/o necessarily wrapping another Collector. There are several ways we can achieve that:
* Take a 'null' Collector and check other != null. Adds an 'if' but not a big deal IMO. Also, acceptDocsOutOfOrder will have to either return false (or true), or we take that as a parameter.
* Take a 'null' Collector and set this.other to a private static instance of a NoOpCollector. We'll still be delegating calls to it, but hopefully it won't be expensive. Same issue w/ out-of-order
* Create two specialized variants of CachingCollector.

Personally I'm not too much in favor of the last option - too much code dup for not much gain.

The option I like the most is the 2nd (introducing a NoOpCollector). We can even introduce it as a public static member of CachingCollector and let users decide if they want to use it or not. For ease of use, we can still allow 'null' to be passed to create().

What do you think?

> Few issues with CachingCollector
> --------------------------------
>
>                 Key: LUCENE-3102
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3102
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: core/search
>            Reporter: Shai Erera
>            Assignee: Shai Erera
>            Priority: Minor
>             Fix For: 3.2, 4.0
>
>         Attachments: LUCENE-3102-factory.patch, LUCENE-3102.patch, LUCENE-3102.patch
>
>
> CachingCollector (introduced in LUCENE-1421) has few issues:
> # Since the wrapped Collector may support out-of-order collection, the document IDs cached may be out-of-order (depends on the Query) and thus replay(Collector) will forward document IDs out-of-order to a Collector that may not support it.
> # It does not clear cachedScores + cachedSegs upon exceeding RAM limits
> # I think that instead of comparing curScores to null, in order to determine if scores are requested, we should have a specific boolean - for clarity
> # This check "if (base + nextLength > maxDocsToCache)" (line 168) can be relaxed? E.g., what if nextLength is, say, 512K, and I cannot satisfy the maxDocsToCache constraint, but if it was 10K I would? Wouldn't we still want to try and cache them?
> Also:
> * The TODO in line 64 (having Collector specify needsScores()) -- why do we need that if CachingCollector ctor already takes a boolean "cacheScores"? I think it's better defined explicitly than implicitly?
> * Let's introduce a factory method for creating a specialized version if scoring is requested / not (i.e., impl the TODO in line 189)
> * I think it's a useful collector, which stands on its own and not specific to grouping. Can we move it to core?
> * How about using OpenBitSet instead of int[] for doc IDs?
> ** If the number of hits is big, we'd gain some RAM back, and be able to cache more entries
> ** NOTE: OpenBitSet can only be used for in-order collection only. So we can use that if the wrapped Collector does not support out-of-order
> * Do you think we can modify this Collector to not necessarily wrap another Collector? We have such Collector which stores (in-memory) all matching doc IDs + scores (if required). Those are later fed into several processes that operate on them (e.g. fetch more info from the index etc.). I am thinking, we can make CachingCollector *optionally* wrap another Collector and then someone can reuse it by setting RAM limit to unlimited (we should have a constant for that) in order to simply collect all matching docs + scores.
> * I think a set of dedicated unit tests for this class alone would be good.
> That's it so far. Perhaps, if we do all of the above, more things will pop up.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Updated] (LUCENE-3102) Few issues with CachingCollector

Posted by "Shai Erera (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-3102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Shai Erera updated LUCENE-3102:
-------------------------------

    Attachment: LUCENE-3102-nowrap.patch

Patch against 3x:
* Adds a create() to CachingCollector which does not take a Collector to wrap. Internally, it creates a no-op collector, which ignores everything.
* Javadocs for create()
* matching test.

> Few issues with CachingCollector
> --------------------------------
>
>                 Key: LUCENE-3102
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3102
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: core/search
>            Reporter: Shai Erera
>            Assignee: Shai Erera
>            Priority: Minor
>             Fix For: 3.2, 4.0
>
>         Attachments: LUCENE-3102-factory.patch, LUCENE-3102-nowrap.patch, LUCENE-3102.patch, LUCENE-3102.patch
>
>
> CachingCollector (introduced in LUCENE-1421) has few issues:
> # Since the wrapped Collector may support out-of-order collection, the document IDs cached may be out-of-order (depends on the Query) and thus replay(Collector) will forward document IDs out-of-order to a Collector that may not support it.
> # It does not clear cachedScores + cachedSegs upon exceeding RAM limits
> # I think that instead of comparing curScores to null, in order to determine if scores are requested, we should have a specific boolean - for clarity
> # This check "if (base + nextLength > maxDocsToCache)" (line 168) can be relaxed? E.g., what if nextLength is, say, 512K, and I cannot satisfy the maxDocsToCache constraint, but if it was 10K I would? Wouldn't we still want to try and cache them?
> Also:
> * The TODO in line 64 (having Collector specify needsScores()) -- why do we need that if CachingCollector ctor already takes a boolean "cacheScores"? I think it's better defined explicitly than implicitly?
> * Let's introduce a factory method for creating a specialized version if scoring is requested / not (i.e., impl the TODO in line 189)
> * I think it's a useful collector, which stands on its own and not specific to grouping. Can we move it to core?
> * How about using OpenBitSet instead of int[] for doc IDs?
> ** If the number of hits is big, we'd gain some RAM back, and be able to cache more entries
> ** NOTE: OpenBitSet can only be used for in-order collection only. So we can use that if the wrapped Collector does not support out-of-order
> * Do you think we can modify this Collector to not necessarily wrap another Collector? We have such Collector which stores (in-memory) all matching doc IDs + scores (if required). Those are later fed into several processes that operate on them (e.g. fetch more info from the index etc.). I am thinking, we can make CachingCollector *optionally* wrap another Collector and then someone can reuse it by setting RAM limit to unlimited (we should have a constant for that) in order to simply collect all matching docs + scores.
> * I think a set of dedicated unit tests for this class alone would be good.
> That's it so far. Perhaps, if we do all of the above, more things will pop up.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3102) Few issues with CachingCollector

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13034841#comment-13034841 ] 

Michael McCandless commented on LUCENE-3102:
--------------------------------------------

Patch looks great!  But, can we rename curupto -> curUpto (and same for curbase)?  Ie, so it matches the other camelCaseVariables we have here...

Thank you!

> Few issues with CachingCollector
> --------------------------------
>
>                 Key: LUCENE-3102
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3102
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: core/search
>            Reporter: Shai Erera
>            Assignee: Shai Erera
>            Priority: Minor
>             Fix For: 3.2, 4.0
>
>         Attachments: LUCENE-3102-factory.patch, LUCENE-3102.patch, LUCENE-3102.patch
>
>
> CachingCollector (introduced in LUCENE-1421) has few issues:
> # Since the wrapped Collector may support out-of-order collection, the document IDs cached may be out-of-order (depends on the Query) and thus replay(Collector) will forward document IDs out-of-order to a Collector that may not support it.
> # It does not clear cachedScores + cachedSegs upon exceeding RAM limits
> # I think that instead of comparing curScores to null, in order to determine if scores are requested, we should have a specific boolean - for clarity
> # This check "if (base + nextLength > maxDocsToCache)" (line 168) can be relaxed? E.g., what if nextLength is, say, 512K, and I cannot satisfy the maxDocsToCache constraint, but if it was 10K I would? Wouldn't we still want to try and cache them?
> Also:
> * The TODO in line 64 (having Collector specify needsScores()) -- why do we need that if CachingCollector ctor already takes a boolean "cacheScores"? I think it's better defined explicitly than implicitly?
> * Let's introduce a factory method for creating a specialized version if scoring is requested / not (i.e., impl the TODO in line 189)
> * I think it's a useful collector, which stands on its own and not specific to grouping. Can we move it to core?
> * How about using OpenBitSet instead of int[] for doc IDs?
> ** If the number of hits is big, we'd gain some RAM back, and be able to cache more entries
> ** NOTE: OpenBitSet can only be used for in-order collection only. So we can use that if the wrapped Collector does not support out-of-order
> * Do you think we can modify this Collector to not necessarily wrap another Collector? We have such Collector which stores (in-memory) all matching doc IDs + scores (if required). Those are later fed into several processes that operate on them (e.g. fetch more info from the index etc.). I am thinking, we can make CachingCollector *optionally* wrap another Collector and then someone can reuse it by setting RAM limit to unlimited (we should have a constant for that) in order to simply collect all matching docs + scores.
> * I think a set of dedicated unit tests for this class alone would be good.
> That's it so far. Perhaps, if we do all of the above, more things will pop up.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Updated] (LUCENE-3102) Few issues with CachingCollector

Posted by "Shai Erera (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-3102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Shai Erera updated LUCENE-3102:
-------------------------------

    Attachment: LUCENE-3102.patch

bq. Only thing is: I would be careful about directly setting those private fields of the cachedScorer; I think (not sure) this incurs an "access" check on each assignment. Maybe make them package protected? Or use a setter?

Good catch Mike. I read about it some and found this nice webpage which explains the implications (http://www.glenmccl.com/jperf/). Indeed, if the member is private (whether it's in the inner or outer class), there is an access check. So the right think to do is to declare is protected / package-private, which I did. Thanks for the opportunity to get some education !

Patch fixes this. I intend to commit this shortly + move the class to core + apply to trunk. Then, I'll continue w/ the rest of the improvements.

> Few issues with CachingCollector
> --------------------------------
>
>                 Key: LUCENE-3102
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3102
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: contrib/*
>            Reporter: Shai Erera
>            Priority: Minor
>             Fix For: 3.2, 4.0
>
>         Attachments: LUCENE-3102.patch, LUCENE-3102.patch
>
>
> CachingCollector (introduced in LUCENE-1421) has few issues:
> # Since the wrapped Collector may support out-of-order collection, the document IDs cached may be out-of-order (depends on the Query) and thus replay(Collector) will forward document IDs out-of-order to a Collector that may not support it.
> # It does not clear cachedScores + cachedSegs upon exceeding RAM limits
> # I think that instead of comparing curScores to null, in order to determine if scores are requested, we should have a specific boolean - for clarity
> # This check "if (base + nextLength > maxDocsToCache)" (line 168) can be relaxed? E.g., what if nextLength is, say, 512K, and I cannot satisfy the maxDocsToCache constraint, but if it was 10K I would? Wouldn't we still want to try and cache them?
> Also:
> * The TODO in line 64 (having Collector specify needsScores()) -- why do we need that if CachingCollector ctor already takes a boolean "cacheScores"? I think it's better defined explicitly than implicitly?
> * Let's introduce a factory method for creating a specialized version if scoring is requested / not (i.e., impl the TODO in line 189)
> * I think it's a useful collector, which stands on its own and not specific to grouping. Can we move it to core?
> * How about using OpenBitSet instead of int[] for doc IDs?
> ** If the number of hits is big, we'd gain some RAM back, and be able to cache more entries
> ** NOTE: OpenBitSet can only be used for in-order collection only. So we can use that if the wrapped Collector does not support out-of-order
> * Do you think we can modify this Collector to not necessarily wrap another Collector? We have such Collector which stores (in-memory) all matching doc IDs + scores (if required). Those are later fed into several processes that operate on them (e.g. fetch more info from the index etc.). I am thinking, we can make CachingCollector *optionally* wrap another Collector and then someone can reuse it by setting RAM limit to unlimited (we should have a constant for that) in order to simply collect all matching docs + scores.
> * I think a set of dedicated unit tests for this class alone would be good.
> That's it so far. Perhaps, if we do all of the above, more things will pop up.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Updated] (LUCENE-3102) Few issues with CachingCollector

Posted by "Shai Erera (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-3102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Shai Erera updated LUCENE-3102:
-------------------------------

    Component/s:     (was: modules/grouping)
                 core/search

> Few issues with CachingCollector
> --------------------------------
>
>                 Key: LUCENE-3102
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3102
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: core/search
>            Reporter: Shai Erera
>            Assignee: Shai Erera
>            Priority: Minor
>             Fix For: 3.2, 4.0
>
>         Attachments: LUCENE-3102.patch, LUCENE-3102.patch
>
>
> CachingCollector (introduced in LUCENE-1421) has few issues:
> # Since the wrapped Collector may support out-of-order collection, the document IDs cached may be out-of-order (depends on the Query) and thus replay(Collector) will forward document IDs out-of-order to a Collector that may not support it.
> # It does not clear cachedScores + cachedSegs upon exceeding RAM limits
> # I think that instead of comparing curScores to null, in order to determine if scores are requested, we should have a specific boolean - for clarity
> # This check "if (base + nextLength > maxDocsToCache)" (line 168) can be relaxed? E.g., what if nextLength is, say, 512K, and I cannot satisfy the maxDocsToCache constraint, but if it was 10K I would? Wouldn't we still want to try and cache them?
> Also:
> * The TODO in line 64 (having Collector specify needsScores()) -- why do we need that if CachingCollector ctor already takes a boolean "cacheScores"? I think it's better defined explicitly than implicitly?
> * Let's introduce a factory method for creating a specialized version if scoring is requested / not (i.e., impl the TODO in line 189)
> * I think it's a useful collector, which stands on its own and not specific to grouping. Can we move it to core?
> * How about using OpenBitSet instead of int[] for doc IDs?
> ** If the number of hits is big, we'd gain some RAM back, and be able to cache more entries
> ** NOTE: OpenBitSet can only be used for in-order collection only. So we can use that if the wrapped Collector does not support out-of-order
> * Do you think we can modify this Collector to not necessarily wrap another Collector? We have such Collector which stores (in-memory) all matching doc IDs + scores (if required). Those are later fed into several processes that operate on them (e.g. fetch more info from the index etc.). I am thinking, we can make CachingCollector *optionally* wrap another Collector and then someone can reuse it by setting RAM limit to unlimited (we should have a constant for that) in order to simply collect all matching docs + scores.
> * I think a set of dedicated unit tests for this class alone would be good.
> That's it so far. Perhaps, if we do all of the above, more things will pop up.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Updated] (LUCENE-3102) Few issues with CachingCollector

Posted by "Shai Erera (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-3102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Shai Erera updated LUCENE-3102:
-------------------------------

    Attachment: LUCENE-3102-factory.patch

Patch against 3x which:

* Adds factory method to CachingCollector, specializing on cacheScores
* Clarify Collector.needScores() TODO

There are two remaining issues, let's address them after we iterate on this patch.

> Few issues with CachingCollector
> --------------------------------
>
>                 Key: LUCENE-3102
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3102
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: core/search
>            Reporter: Shai Erera
>            Assignee: Shai Erera
>            Priority: Minor
>             Fix For: 3.2, 4.0
>
>         Attachments: LUCENE-3102-factory.patch, LUCENE-3102.patch, LUCENE-3102.patch
>
>
> CachingCollector (introduced in LUCENE-1421) has few issues:
> # Since the wrapped Collector may support out-of-order collection, the document IDs cached may be out-of-order (depends on the Query) and thus replay(Collector) will forward document IDs out-of-order to a Collector that may not support it.
> # It does not clear cachedScores + cachedSegs upon exceeding RAM limits
> # I think that instead of comparing curScores to null, in order to determine if scores are requested, we should have a specific boolean - for clarity
> # This check "if (base + nextLength > maxDocsToCache)" (line 168) can be relaxed? E.g., what if nextLength is, say, 512K, and I cannot satisfy the maxDocsToCache constraint, but if it was 10K I would? Wouldn't we still want to try and cache them?
> Also:
> * The TODO in line 64 (having Collector specify needsScores()) -- why do we need that if CachingCollector ctor already takes a boolean "cacheScores"? I think it's better defined explicitly than implicitly?
> * Let's introduce a factory method for creating a specialized version if scoring is requested / not (i.e., impl the TODO in line 189)
> * I think it's a useful collector, which stands on its own and not specific to grouping. Can we move it to core?
> * How about using OpenBitSet instead of int[] for doc IDs?
> ** If the number of hits is big, we'd gain some RAM back, and be able to cache more entries
> ** NOTE: OpenBitSet can only be used for in-order collection only. So we can use that if the wrapped Collector does not support out-of-order
> * Do you think we can modify this Collector to not necessarily wrap another Collector? We have such Collector which stores (in-memory) all matching doc IDs + scores (if required). Those are later fed into several processes that operate on them (e.g. fetch more info from the index etc.). I am thinking, we can make CachingCollector *optionally* wrap another Collector and then someone can reuse it by setting RAM limit to unlimited (we should have a constant for that) in order to simply collect all matching docs + scores.
> * I think a set of dedicated unit tests for this class alone would be good.
> That's it so far. Perhaps, if we do all of the above, more things will pop up.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Resolved] (LUCENE-3102) Few issues with CachingCollector

Posted by "Shai Erera (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-3102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Shai Erera resolved LUCENE-3102.
--------------------------------

    Resolution: Fixed

Thanks Mike. Seems that TestGrouping is indeed fixed.

Committed revision 1124378 (3x).
Committed revision 1124379 (trunk).

Resolving this. We can tackle OBS and other optimizations in subsequent issues if the need arises.

Thanks Mike !

> Few issues with CachingCollector
> --------------------------------
>
>                 Key: LUCENE-3102
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3102
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: core/search
>            Reporter: Shai Erera
>            Assignee: Shai Erera
>            Priority: Minor
>             Fix For: 3.2, 4.0
>
>         Attachments: LUCENE-3102-factory.patch, LUCENE-3102-nowrap.patch, LUCENE-3102-nowrap.patch, LUCENE-3102.patch, LUCENE-3102.patch, LUCENE-3102.patch
>
>
> CachingCollector (introduced in LUCENE-1421) has few issues:
> # Since the wrapped Collector may support out-of-order collection, the document IDs cached may be out-of-order (depends on the Query) and thus replay(Collector) will forward document IDs out-of-order to a Collector that may not support it.
> # It does not clear cachedScores + cachedSegs upon exceeding RAM limits
> # I think that instead of comparing curScores to null, in order to determine if scores are requested, we should have a specific boolean - for clarity
> # This check "if (base + nextLength > maxDocsToCache)" (line 168) can be relaxed? E.g., what if nextLength is, say, 512K, and I cannot satisfy the maxDocsToCache constraint, but if it was 10K I would? Wouldn't we still want to try and cache them?
> Also:
> * The TODO in line 64 (having Collector specify needsScores()) -- why do we need that if CachingCollector ctor already takes a boolean "cacheScores"? I think it's better defined explicitly than implicitly?
> * Let's introduce a factory method for creating a specialized version if scoring is requested / not (i.e., impl the TODO in line 189)
> * I think it's a useful collector, which stands on its own and not specific to grouping. Can we move it to core?
> * How about using OpenBitSet instead of int[] for doc IDs?
> ** If the number of hits is big, we'd gain some RAM back, and be able to cache more entries
> ** NOTE: OpenBitSet can only be used for in-order collection only. So we can use that if the wrapped Collector does not support out-of-order
> * Do you think we can modify this Collector to not necessarily wrap another Collector? We have such Collector which stores (in-memory) all matching doc IDs + scores (if required). Those are later fed into several processes that operate on them (e.g. fetch more info from the index etc.). I am thinking, we can make CachingCollector *optionally* wrap another Collector and then someone can reuse it by setting RAM limit to unlimited (we should have a constant for that) in order to simply collect all matching docs + scores.
> * I think a set of dedicated unit tests for this class alone would be good.
> That's it so far. Perhaps, if we do all of the above, more things will pop up.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3102) Few issues with CachingCollector

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13035285#comment-13035285 ] 

Michael McCandless commented on LUCENE-3102:
--------------------------------------------

Patch to allow no wrapped collector looks good!  I wonder/hope hotspot can realize those method calls are no-ops...

Maybe change TestGrouping to randomly use this ctor?  Ie, randomly, you can use caching collector (not wrapped), then call its replay method twice (once against 1st pass, then against 2nd pass, collectors), and then assert results like normal.  This is also a good verification that replay works twice...

On the OBS, it makes me nervous to just always do this; I'd rather have it cutover at some point?  Or perhaps it's an expert optional arg to create, whether it should back w/ OBS vs int[]?

Or, ideally... we make a bit set impl that does this all under the hood (uses int[] when there are few docs, and "ugprades" to OBS once there are "enough" to justify it...), then we can just use that bit set here.

> Few issues with CachingCollector
> --------------------------------
>
>                 Key: LUCENE-3102
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3102
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: core/search
>            Reporter: Shai Erera
>            Assignee: Shai Erera
>            Priority: Minor
>             Fix For: 3.2, 4.0
>
>         Attachments: LUCENE-3102-factory.patch, LUCENE-3102-nowrap.patch, LUCENE-3102.patch, LUCENE-3102.patch
>
>
> CachingCollector (introduced in LUCENE-1421) has few issues:
> # Since the wrapped Collector may support out-of-order collection, the document IDs cached may be out-of-order (depends on the Query) and thus replay(Collector) will forward document IDs out-of-order to a Collector that may not support it.
> # It does not clear cachedScores + cachedSegs upon exceeding RAM limits
> # I think that instead of comparing curScores to null, in order to determine if scores are requested, we should have a specific boolean - for clarity
> # This check "if (base + nextLength > maxDocsToCache)" (line 168) can be relaxed? E.g., what if nextLength is, say, 512K, and I cannot satisfy the maxDocsToCache constraint, but if it was 10K I would? Wouldn't we still want to try and cache them?
> Also:
> * The TODO in line 64 (having Collector specify needsScores()) -- why do we need that if CachingCollector ctor already takes a boolean "cacheScores"? I think it's better defined explicitly than implicitly?
> * Let's introduce a factory method for creating a specialized version if scoring is requested / not (i.e., impl the TODO in line 189)
> * I think it's a useful collector, which stands on its own and not specific to grouping. Can we move it to core?
> * How about using OpenBitSet instead of int[] for doc IDs?
> ** If the number of hits is big, we'd gain some RAM back, and be able to cache more entries
> ** NOTE: OpenBitSet can only be used for in-order collection only. So we can use that if the wrapped Collector does not support out-of-order
> * Do you think we can modify this Collector to not necessarily wrap another Collector? We have such Collector which stores (in-memory) all matching doc IDs + scores (if required). Those are later fed into several processes that operate on them (e.g. fetch more info from the index etc.). I am thinking, we can make CachingCollector *optionally* wrap another Collector and then someone can reuse it by setting RAM limit to unlimited (we should have a constant for that) in order to simply collect all matching docs + scores.
> * I think a set of dedicated unit tests for this class alone would be good.
> That's it so far. Perhaps, if we do all of the above, more things will pop up.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3102) Few issues with CachingCollector

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13034091#comment-13034091 ] 

Michael McCandless commented on LUCENE-3102:
--------------------------------------------

Patch looks great Shai -- +1 to commit!!

Yes that is very sneaky about the private fields in inner/outer classes -- it's good you added a comment explaining it!

> Few issues with CachingCollector
> --------------------------------
>
>                 Key: LUCENE-3102
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3102
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: contrib/*
>            Reporter: Shai Erera
>            Priority: Minor
>             Fix For: 3.2, 4.0
>
>         Attachments: LUCENE-3102.patch, LUCENE-3102.patch
>
>
> CachingCollector (introduced in LUCENE-1421) has few issues:
> # Since the wrapped Collector may support out-of-order collection, the document IDs cached may be out-of-order (depends on the Query) and thus replay(Collector) will forward document IDs out-of-order to a Collector that may not support it.
> # It does not clear cachedScores + cachedSegs upon exceeding RAM limits
> # I think that instead of comparing curScores to null, in order to determine if scores are requested, we should have a specific boolean - for clarity
> # This check "if (base + nextLength > maxDocsToCache)" (line 168) can be relaxed? E.g., what if nextLength is, say, 512K, and I cannot satisfy the maxDocsToCache constraint, but if it was 10K I would? Wouldn't we still want to try and cache them?
> Also:
> * The TODO in line 64 (having Collector specify needsScores()) -- why do we need that if CachingCollector ctor already takes a boolean "cacheScores"? I think it's better defined explicitly than implicitly?
> * Let's introduce a factory method for creating a specialized version if scoring is requested / not (i.e., impl the TODO in line 189)
> * I think it's a useful collector, which stands on its own and not specific to grouping. Can we move it to core?
> * How about using OpenBitSet instead of int[] for doc IDs?
> ** If the number of hits is big, we'd gain some RAM back, and be able to cache more entries
> ** NOTE: OpenBitSet can only be used for in-order collection only. So we can use that if the wrapped Collector does not support out-of-order
> * Do you think we can modify this Collector to not necessarily wrap another Collector? We have such Collector which stores (in-memory) all matching doc IDs + scores (if required). Those are later fed into several processes that operate on them (e.g. fetch more info from the index etc.). I am thinking, we can make CachingCollector *optionally* wrap another Collector and then someone can reuse it by setting RAM limit to unlimited (we should have a constant for that) in order to simply collect all matching docs + scores.
> * I think a set of dedicated unit tests for this class alone would be good.
> That's it so far. Perhaps, if we do all of the above, more things will pop up.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3102) Few issues with CachingCollector

Posted by "Shai Erera (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13035344#comment-13035344 ] 

Shai Erera commented on LUCENE-3102:
------------------------------------

bq. The committed CHANGES has typo (reply should be replay).

Thanks, will include it in the next commit.

bq. I'd rather have it cutover at some point

This can only be done if out-of-order collection wasn't done so far, because otherwise, cutting to OBS will take cached doc IDs and scores out of sync.

bq. we make a bit set impl that does this all under the hood (uses int[] when there are few docs, and "ugprades" to OBS once there are "enough" to justify it...)

That's a good idea. I think we should leave the OBS stuff for another issue. See first how this performs and optimize only if needed.

I'll take a look at TestGrouping.

> Few issues with CachingCollector
> --------------------------------
>
>                 Key: LUCENE-3102
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3102
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: core/search
>            Reporter: Shai Erera
>            Assignee: Shai Erera
>            Priority: Minor
>             Fix For: 3.2, 4.0
>
>         Attachments: LUCENE-3102-factory.patch, LUCENE-3102-nowrap.patch, LUCENE-3102.patch, LUCENE-3102.patch
>
>
> CachingCollector (introduced in LUCENE-1421) has few issues:
> # Since the wrapped Collector may support out-of-order collection, the document IDs cached may be out-of-order (depends on the Query) and thus replay(Collector) will forward document IDs out-of-order to a Collector that may not support it.
> # It does not clear cachedScores + cachedSegs upon exceeding RAM limits
> # I think that instead of comparing curScores to null, in order to determine if scores are requested, we should have a specific boolean - for clarity
> # This check "if (base + nextLength > maxDocsToCache)" (line 168) can be relaxed? E.g., what if nextLength is, say, 512K, and I cannot satisfy the maxDocsToCache constraint, but if it was 10K I would? Wouldn't we still want to try and cache them?
> Also:
> * The TODO in line 64 (having Collector specify needsScores()) -- why do we need that if CachingCollector ctor already takes a boolean "cacheScores"? I think it's better defined explicitly than implicitly?
> * Let's introduce a factory method for creating a specialized version if scoring is requested / not (i.e., impl the TODO in line 189)
> * I think it's a useful collector, which stands on its own and not specific to grouping. Can we move it to core?
> * How about using OpenBitSet instead of int[] for doc IDs?
> ** If the number of hits is big, we'd gain some RAM back, and be able to cache more entries
> ** NOTE: OpenBitSet can only be used for in-order collection only. So we can use that if the wrapped Collector does not support out-of-order
> * Do you think we can modify this Collector to not necessarily wrap another Collector? We have such Collector which stores (in-memory) all matching doc IDs + scores (if required). Those are later fed into several processes that operate on them (e.g. fetch more info from the index etc.). I am thinking, we can make CachingCollector *optionally* wrap another Collector and then someone can reuse it by setting RAM limit to unlimited (we should have a constant for that) in order to simply collect all matching docs + scores.
> * I think a set of dedicated unit tests for this class alone would be good.
> That's it so far. Perhaps, if we do all of the above, more things will pop up.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3102) Few issues with CachingCollector

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13035283#comment-13035283 ] 

Michael McCandless commented on LUCENE-3102:
--------------------------------------------

The committed CHANGES has typo (reply should be replay).

> Few issues with CachingCollector
> --------------------------------
>
>                 Key: LUCENE-3102
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3102
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: core/search
>            Reporter: Shai Erera
>            Assignee: Shai Erera
>            Priority: Minor
>             Fix For: 3.2, 4.0
>
>         Attachments: LUCENE-3102-factory.patch, LUCENE-3102-nowrap.patch, LUCENE-3102.patch, LUCENE-3102.patch
>
>
> CachingCollector (introduced in LUCENE-1421) has few issues:
> # Since the wrapped Collector may support out-of-order collection, the document IDs cached may be out-of-order (depends on the Query) and thus replay(Collector) will forward document IDs out-of-order to a Collector that may not support it.
> # It does not clear cachedScores + cachedSegs upon exceeding RAM limits
> # I think that instead of comparing curScores to null, in order to determine if scores are requested, we should have a specific boolean - for clarity
> # This check "if (base + nextLength > maxDocsToCache)" (line 168) can be relaxed? E.g., what if nextLength is, say, 512K, and I cannot satisfy the maxDocsToCache constraint, but if it was 10K I would? Wouldn't we still want to try and cache them?
> Also:
> * The TODO in line 64 (having Collector specify needsScores()) -- why do we need that if CachingCollector ctor already takes a boolean "cacheScores"? I think it's better defined explicitly than implicitly?
> * Let's introduce a factory method for creating a specialized version if scoring is requested / not (i.e., impl the TODO in line 189)
> * I think it's a useful collector, which stands on its own and not specific to grouping. Can we move it to core?
> * How about using OpenBitSet instead of int[] for doc IDs?
> ** If the number of hits is big, we'd gain some RAM back, and be able to cache more entries
> ** NOTE: OpenBitSet can only be used for in-order collection only. So we can use that if the wrapped Collector does not support out-of-order
> * Do you think we can modify this Collector to not necessarily wrap another Collector? We have such Collector which stores (in-memory) all matching doc IDs + scores (if required). Those are later fed into several processes that operate on them (e.g. fetch more info from the index etc.). I am thinking, we can make CachingCollector *optionally* wrap another Collector and then someone can reuse it by setting RAM limit to unlimited (we should have a constant for that) in order to simply collect all matching docs + scores.
> * I think a set of dedicated unit tests for this class alone would be good.
> That's it so far. Perhaps, if we do all of the above, more things will pop up.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3102) Few issues with CachingCollector

Posted by "Shai Erera (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13033788#comment-13033788 ] 

Shai Erera commented on LUCENE-3102:
------------------------------------

bq. I will start w/ "svn mv" this to core

Or, we can iterate on all the changes here, then do the svn move as part of the commit. Both work for me.

> Few issues with CachingCollector
> --------------------------------
>
>                 Key: LUCENE-3102
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3102
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: contrib/*
>            Reporter: Shai Erera
>            Priority: Minor
>             Fix For: 3.2, 4.0
>
>
> CachingCollector (introduced in LUCENE-1421) has few issues:
> # Since the wrapped Collector may support out-of-order collection, the document IDs cached may be out-of-order (depends on the Query) and thus replay(Collector) will forward document IDs out-of-order to a Collector that may not support it.
> # It does not clear cachedScores + cachedSegs upon exceeding RAM limits
> # I think that instead of comparing curScores to null, in order to determine if scores are requested, we should have a specific boolean - for clarity
> # This check "if (base + nextLength > maxDocsToCache)" (line 168) can be relaxed? E.g., what if nextLength is, say, 512K, and I cannot satisfy the maxDocsToCache constraint, but if it was 10K I would? Wouldn't we still want to try and cache them?
> Also:
> * The TODO in line 64 (having Collector specify needsScores()) -- why do we need that if CachingCollector ctor already takes a boolean "cacheScores"? I think it's better defined explicitly than implicitly?
> * Let's introduce a factory method for creating a specialized version if scoring is requested / not (i.e., impl the TODO in line 189)
> * I think it's a useful collector, which stands on its own and not specific to grouping. Can we move it to core?
> * How about using OpenBitSet instead of int[] for doc IDs?
> ** If the number of hits is big, we'd gain some RAM back, and be able to cache more entries
> ** NOTE: OpenBitSet can only be used for in-order collection only. So we can use that if the wrapped Collector does not support out-of-order
> * Do you think we can modify this Collector to not necessarily wrap another Collector? We have such Collector which stores (in-memory) all matching doc IDs + scores (if required). Those are later fed into several processes that operate on them (e.g. fetch more info from the index etc.). I am thinking, we can make CachingCollector *optionally* wrap another Collector and then someone can reuse it by setting RAM limit to unlimited (we should have a constant for that) in order to simply collect all matching docs + scores.
> * I think a set of dedicated unit tests for this class alone would be good.
> That's it so far. Perhaps, if we do all of the above, more things will pop up.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Updated] (LUCENE-3102) Few issues with CachingCollector

Posted by "Shai Erera (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-3102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Shai Erera updated LUCENE-3102:
-------------------------------

    Attachment: LUCENE-3102-nowrap.patch

Patch adds random to TestGrouping and fixes the CHANGES typo.

Mike, TestGrouping fails w/ this seed: -Dtests.seed=7295196064099074191:-1632255311098421589 (it picks a no wrapping collector).

I guess I didn't insert the random thing properly. It's the only place where the test creates a CachingCollector though. I noticed that it fails on the 'doCache' but '!doAllGroups' case.

Can you please take a look? I'm not familiar with this test, and cannot debug it anymore today.

> Few issues with CachingCollector
> --------------------------------
>
>                 Key: LUCENE-3102
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3102
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: core/search
>            Reporter: Shai Erera
>            Assignee: Shai Erera
>            Priority: Minor
>             Fix For: 3.2, 4.0
>
>         Attachments: LUCENE-3102-factory.patch, LUCENE-3102-nowrap.patch, LUCENE-3102-nowrap.patch, LUCENE-3102.patch, LUCENE-3102.patch
>
>
> CachingCollector (introduced in LUCENE-1421) has few issues:
> # Since the wrapped Collector may support out-of-order collection, the document IDs cached may be out-of-order (depends on the Query) and thus replay(Collector) will forward document IDs out-of-order to a Collector that may not support it.
> # It does not clear cachedScores + cachedSegs upon exceeding RAM limits
> # I think that instead of comparing curScores to null, in order to determine if scores are requested, we should have a specific boolean - for clarity
> # This check "if (base + nextLength > maxDocsToCache)" (line 168) can be relaxed? E.g., what if nextLength is, say, 512K, and I cannot satisfy the maxDocsToCache constraint, but if it was 10K I would? Wouldn't we still want to try and cache them?
> Also:
> * The TODO in line 64 (having Collector specify needsScores()) -- why do we need that if CachingCollector ctor already takes a boolean "cacheScores"? I think it's better defined explicitly than implicitly?
> * Let's introduce a factory method for creating a specialized version if scoring is requested / not (i.e., impl the TODO in line 189)
> * I think it's a useful collector, which stands on its own and not specific to grouping. Can we move it to core?
> * How about using OpenBitSet instead of int[] for doc IDs?
> ** If the number of hits is big, we'd gain some RAM back, and be able to cache more entries
> ** NOTE: OpenBitSet can only be used for in-order collection only. So we can use that if the wrapped Collector does not support out-of-order
> * Do you think we can modify this Collector to not necessarily wrap another Collector? We have such Collector which stores (in-memory) all matching doc IDs + scores (if required). Those are later fed into several processes that operate on them (e.g. fetch more info from the index etc.). I am thinking, we can make CachingCollector *optionally* wrap another Collector and then someone can reuse it by setting RAM limit to unlimited (we should have a constant for that) in order to simply collect all matching docs + scores.
> * I think a set of dedicated unit tests for this class alone would be good.
> That's it so far. Perhaps, if we do all of the above, more things will pop up.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Updated] (LUCENE-3102) Few issues with CachingCollector

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-3102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated LUCENE-3102:
---------------------------------------

    Attachment: LUCENE-3102.patch

Patch.

I think I fixed TestGrouping to exercise the "no wrapped collector" and "replay twice" case for CachingCollector.

> Few issues with CachingCollector
> --------------------------------
>
>                 Key: LUCENE-3102
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3102
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: core/search
>            Reporter: Shai Erera
>            Assignee: Shai Erera
>            Priority: Minor
>             Fix For: 3.2, 4.0
>
>         Attachments: LUCENE-3102-factory.patch, LUCENE-3102-nowrap.patch, LUCENE-3102-nowrap.patch, LUCENE-3102.patch, LUCENE-3102.patch, LUCENE-3102.patch
>
>
> CachingCollector (introduced in LUCENE-1421) has few issues:
> # Since the wrapped Collector may support out-of-order collection, the document IDs cached may be out-of-order (depends on the Query) and thus replay(Collector) will forward document IDs out-of-order to a Collector that may not support it.
> # It does not clear cachedScores + cachedSegs upon exceeding RAM limits
> # I think that instead of comparing curScores to null, in order to determine if scores are requested, we should have a specific boolean - for clarity
> # This check "if (base + nextLength > maxDocsToCache)" (line 168) can be relaxed? E.g., what if nextLength is, say, 512K, and I cannot satisfy the maxDocsToCache constraint, but if it was 10K I would? Wouldn't we still want to try and cache them?
> Also:
> * The TODO in line 64 (having Collector specify needsScores()) -- why do we need that if CachingCollector ctor already takes a boolean "cacheScores"? I think it's better defined explicitly than implicitly?
> * Let's introduce a factory method for creating a specialized version if scoring is requested / not (i.e., impl the TODO in line 189)
> * I think it's a useful collector, which stands on its own and not specific to grouping. Can we move it to core?
> * How about using OpenBitSet instead of int[] for doc IDs?
> ** If the number of hits is big, we'd gain some RAM back, and be able to cache more entries
> ** NOTE: OpenBitSet can only be used for in-order collection only. So we can use that if the wrapped Collector does not support out-of-order
> * Do you think we can modify this Collector to not necessarily wrap another Collector? We have such Collector which stores (in-memory) all matching doc IDs + scores (if required). Those are later fed into several processes that operate on them (e.g. fetch more info from the index etc.). I am thinking, we can make CachingCollector *optionally* wrap another Collector and then someone can reuse it by setting RAM limit to unlimited (we should have a constant for that) in order to simply collect all matching docs + scores.
> * I think a set of dedicated unit tests for this class alone would be good.
> That's it so far. Perhaps, if we do all of the above, more things will pop up.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Updated] (LUCENE-3102) Few issues with CachingCollector

Posted by "Shai Erera (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-3102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Shai Erera updated LUCENE-3102:
-------------------------------

    Attachment: LUCENE-3102.patch

Patch includes the bug fixes + test. Still none of the items I listed after 'Also ...'. I plan to tackle that next, in subsequent patches.

Question -- perhaps we can commit these changes incrementally? I.e., after we iterate on the changes in this patch, if they are ok, commit them, then do the rest of the stuff? Or a single commit w/ everything is preferable?

Mike, there is another reason to separate Collector.needsScores() from cacheScores -- it is possible someone will pass a Collector which needs scores, however won't want to have CachingCollector 'cache' them. In which case, the wrapped Collector should be delegated setScorer instead of cachedScorer.

I will leave Collector.needsScores() for a different issue though?

> Few issues with CachingCollector
> --------------------------------
>
>                 Key: LUCENE-3102
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3102
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: contrib/*
>            Reporter: Shai Erera
>            Priority: Minor
>             Fix For: 3.2, 4.0
>
>         Attachments: LUCENE-3102.patch
>
>
> CachingCollector (introduced in LUCENE-1421) has few issues:
> # Since the wrapped Collector may support out-of-order collection, the document IDs cached may be out-of-order (depends on the Query) and thus replay(Collector) will forward document IDs out-of-order to a Collector that may not support it.
> # It does not clear cachedScores + cachedSegs upon exceeding RAM limits
> # I think that instead of comparing curScores to null, in order to determine if scores are requested, we should have a specific boolean - for clarity
> # This check "if (base + nextLength > maxDocsToCache)" (line 168) can be relaxed? E.g., what if nextLength is, say, 512K, and I cannot satisfy the maxDocsToCache constraint, but if it was 10K I would? Wouldn't we still want to try and cache them?
> Also:
> * The TODO in line 64 (having Collector specify needsScores()) -- why do we need that if CachingCollector ctor already takes a boolean "cacheScores"? I think it's better defined explicitly than implicitly?
> * Let's introduce a factory method for creating a specialized version if scoring is requested / not (i.e., impl the TODO in line 189)
> * I think it's a useful collector, which stands on its own and not specific to grouping. Can we move it to core?
> * How about using OpenBitSet instead of int[] for doc IDs?
> ** If the number of hits is big, we'd gain some RAM back, and be able to cache more entries
> ** NOTE: OpenBitSet can only be used for in-order collection only. So we can use that if the wrapped Collector does not support out-of-order
> * Do you think we can modify this Collector to not necessarily wrap another Collector? We have such Collector which stores (in-memory) all matching doc IDs + scores (if required). Those are later fed into several processes that operate on them (e.g. fetch more info from the index etc.). I am thinking, we can make CachingCollector *optionally* wrap another Collector and then someone can reuse it by setting RAM limit to unlimited (we should have a constant for that) in order to simply collect all matching docs + scores.
> * I think a set of dedicated unit tests for this class alone would be good.
> That's it so far. Perhaps, if we do all of the above, more things will pop up.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Assigned] (LUCENE-3102) Few issues with CachingCollector

Posted by "Shai Erera (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-3102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Shai Erera reassigned LUCENE-3102:
----------------------------------

    Assignee: Shai Erera

> Few issues with CachingCollector
> --------------------------------
>
>                 Key: LUCENE-3102
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3102
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: core/search
>            Reporter: Shai Erera
>            Assignee: Shai Erera
>            Priority: Minor
>             Fix For: 3.2, 4.0
>
>         Attachments: LUCENE-3102.patch, LUCENE-3102.patch
>
>
> CachingCollector (introduced in LUCENE-1421) has few issues:
> # Since the wrapped Collector may support out-of-order collection, the document IDs cached may be out-of-order (depends on the Query) and thus replay(Collector) will forward document IDs out-of-order to a Collector that may not support it.
> # It does not clear cachedScores + cachedSegs upon exceeding RAM limits
> # I think that instead of comparing curScores to null, in order to determine if scores are requested, we should have a specific boolean - for clarity
> # This check "if (base + nextLength > maxDocsToCache)" (line 168) can be relaxed? E.g., what if nextLength is, say, 512K, and I cannot satisfy the maxDocsToCache constraint, but if it was 10K I would? Wouldn't we still want to try and cache them?
> Also:
> * The TODO in line 64 (having Collector specify needsScores()) -- why do we need that if CachingCollector ctor already takes a boolean "cacheScores"? I think it's better defined explicitly than implicitly?
> * Let's introduce a factory method for creating a specialized version if scoring is requested / not (i.e., impl the TODO in line 189)
> * I think it's a useful collector, which stands on its own and not specific to grouping. Can we move it to core?
> * How about using OpenBitSet instead of int[] for doc IDs?
> ** If the number of hits is big, we'd gain some RAM back, and be able to cache more entries
> ** NOTE: OpenBitSet can only be used for in-order collection only. So we can use that if the wrapped Collector does not support out-of-order
> * Do you think we can modify this Collector to not necessarily wrap another Collector? We have such Collector which stores (in-memory) all matching doc IDs + scores (if required). Those are later fed into several processes that operate on them (e.g. fetch more info from the index etc.). I am thinking, we can make CachingCollector *optionally* wrap another Collector and then someone can reuse it by setting RAM limit to unlimited (we should have a constant for that) in order to simply collect all matching docs + scores.
> * I think a set of dedicated unit tests for this class alone would be good.
> That's it so far. Perhaps, if we do all of the above, more things will pop up.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3102) Few issues with CachingCollector

Posted by "Shai Erera (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13033787#comment-13033787 ] 

Shai Erera commented on LUCENE-3102:
------------------------------------

bq. Hmm, I think it does clear cachedScores? (But not cachedSegs).

Sorry, I meant curScores.

bq. +1 (on move to core)

I will start w/ "svn mv" this to core, so that later patches on this issue will be applied easily. Moving to core has nothing to do w/ resolving the other issues.

bq. we could also "upgrade" to a bit set part way through, since it's so clear where the cutoff is

you're right, but cutting off to OBS is dangerous, b/c by doing that we:
# Suddenly halt search when we create and populate OBS
# Lose the ability to support out-of-order docs (in fact, depending on the mode and how the query was executed so far, we might not even be able to do the cut-off at all).
So I prefer that we make that decision up front, perhaps through another parameter to the factory method.

bq. but eg you could mess up (pass cacheScores = false but then pass a collector that calls .score())

Oh I see, so this TODO is about the use of cachedScorer (vs. just delegating setScorer to other). I agree.

BTW, this version of cachedScorer is very optimized and clean, but we do have ScoreCachingWrappingScorer which achieves the same goal, only w/ 1-2 more 'if'. Perhaps we should reuse it? But then again, for the purpose of this Collector, cachedScorer is the most optimized it can be.

bq. ie, I don't want to make it easy to be unlimited? It seems too dangerous... I'd rather your code has to spell out 10*1024 so you realize you're saying 10 GB (for example).

What if you run w/ 16GB Heap? ;)

But ok, I don't mind, we can spell it out clearly in the jdocs.

bq. Are you planning to work up a patch for these...?

I think so. I'll try to squeeze it in my schedule in the next couple of days. If I see I don't get to it, I'll update the issue.


> Few issues with CachingCollector
> --------------------------------
>
>                 Key: LUCENE-3102
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3102
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: contrib/*
>            Reporter: Shai Erera
>            Priority: Minor
>             Fix For: 3.2, 4.0
>
>
> CachingCollector (introduced in LUCENE-1421) has few issues:
> # Since the wrapped Collector may support out-of-order collection, the document IDs cached may be out-of-order (depends on the Query) and thus replay(Collector) will forward document IDs out-of-order to a Collector that may not support it.
> # It does not clear cachedScores + cachedSegs upon exceeding RAM limits
> # I think that instead of comparing curScores to null, in order to determine if scores are requested, we should have a specific boolean - for clarity
> # This check "if (base + nextLength > maxDocsToCache)" (line 168) can be relaxed? E.g., what if nextLength is, say, 512K, and I cannot satisfy the maxDocsToCache constraint, but if it was 10K I would? Wouldn't we still want to try and cache them?
> Also:
> * The TODO in line 64 (having Collector specify needsScores()) -- why do we need that if CachingCollector ctor already takes a boolean "cacheScores"? I think it's better defined explicitly than implicitly?
> * Let's introduce a factory method for creating a specialized version if scoring is requested / not (i.e., impl the TODO in line 189)
> * I think it's a useful collector, which stands on its own and not specific to grouping. Can we move it to core?
> * How about using OpenBitSet instead of int[] for doc IDs?
> ** If the number of hits is big, we'd gain some RAM back, and be able to cache more entries
> ** NOTE: OpenBitSet can only be used for in-order collection only. So we can use that if the wrapped Collector does not support out-of-order
> * Do you think we can modify this Collector to not necessarily wrap another Collector? We have such Collector which stores (in-memory) all matching doc IDs + scores (if required). Those are later fed into several processes that operate on them (e.g. fetch more info from the index etc.). I am thinking, we can make CachingCollector *optionally* wrap another Collector and then someone can reuse it by setting RAM limit to unlimited (we should have a constant for that) in order to simply collect all matching docs + scores.
> * I think a set of dedicated unit tests for this class alone would be good.
> That's it so far. Perhaps, if we do all of the above, more things will pop up.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3102) Few issues with CachingCollector

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13033784#comment-13033784 ] 

Michael McCandless commented on LUCENE-3102:
--------------------------------------------

Great catches Shai -- thanks for the thorough review!

bq. Since the wrapped Collector may support out-of-order collection, the document IDs cached may be out-of-order (depends on the Query) and thus replay(Collector) will forward document IDs out-of-order to a Collector that may not support it.

Ahh... maybe we throw IllegalArgExc if the replay'd collector requires
in-order but the first pass collector didn't?

bq. It does not clear cachedScores + cachedSegs upon exceeding RAM limits

Hmm, I think it does clear cachedScores?  (But not cachedSegs).

bq. I think that instead of comparing curScores to null, in order to determine if scores are requested, we should have a specific boolean - for clarity

That sounds great!

bq. This check "if (base + nextLength > maxDocsToCache)" (line 168) can be relaxed? E.g., what if nextLength is, say, 512K, and I cannot satisfy the maxDocsToCache constraint, but if it was 10K I would? Wouldn't we still want to try and cache them?

Oh you mean for the last "chunk" we could alloc right up to the limit?
Good!

bq. The TODO in line 64 (having Collector specify needsScores()) – why do we need that if CachingCollector ctor already takes a boolean "cacheScores"? I think it's better defined explicitly than implicitly?

Yes, I think we should keep the explicit boolean (cacheScores), but eg
you could mess up (pass cacheScores = false but then pass a collector
that calls .score()) -- that's why I added to TODO.  Ie, it'd be nice
if we could "verify" that the collector agrees we don't need scores.

I think there were other places in Lucene where knowing this up front
could help us... can't remember the details.

bq. Let's introduce a factory method for creating a specialized version if scoring is requested / not (i.e., impl the TODO in line 189)

+1

bq. I think it's a useful collector, which stands on its own and not specific to grouping. Can we move it to core?

+1

{quote}
How about using OpenBitSet instead of int[] for doc IDs?
If the number of hits is big, we'd gain some RAM back, and be able to cache more entries
NOTE: OpenBitSet can only be used for in-order collection only. So we can use that if the wrapped Collector does not support out-of-order
{quote}

Hmm but if the number of hits is small we spend un-needed RAM/CPU,
but, then that tradeoff is maybe OK?  I'm just worried about indices
w/ lots of docs... we could also "upgrade" to a bit set part way
through, since it's so clear where the cutoff is.

bq. Do you think we can modify this Collector to not necessarily wrap another Collector? We have such Collector which stores (in-memory) all matching doc IDs + scores (if required). Those are later fed into several processes that operate on them (e.g. fetch more info from the index etc.). I am thinking, we can make CachingCollector optionally wrap another Collector and then someone can reuse it by setting RAM limit to unlimited (we should have a constant for that) in order to simply collect all matching docs + scores.

I'd actually rather not have the constant -- ie, I don't want to make
it easy to be unlimited?  It seems too dangerous... I'd rather your
code has to spell out 10*1024 so you realize you're saying 10 GB (for
example).

bq. I think a set of dedicated unit tests for this class alone would be good.

+1

Awesome feedback!!  Are you planning to work up a patch for these...?


> Few issues with CachingCollector
> --------------------------------
>
>                 Key: LUCENE-3102
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3102
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: contrib/*
>            Reporter: Shai Erera
>            Priority: Minor
>             Fix For: 3.2, 4.0
>
>
> CachingCollector (introduced in LUCENE-1421) has few issues:
> # Since the wrapped Collector may support out-of-order collection, the document IDs cached may be out-of-order (depends on the Query) and thus replay(Collector) will forward document IDs out-of-order to a Collector that may not support it.
> # It does not clear cachedScores + cachedSegs upon exceeding RAM limits
> # I think that instead of comparing curScores to null, in order to determine if scores are requested, we should have a specific boolean - for clarity
> # This check "if (base + nextLength > maxDocsToCache)" (line 168) can be relaxed? E.g., what if nextLength is, say, 512K, and I cannot satisfy the maxDocsToCache constraint, but if it was 10K I would? Wouldn't we still want to try and cache them?
> Also:
> * The TODO in line 64 (having Collector specify needsScores()) -- why do we need that if CachingCollector ctor already takes a boolean "cacheScores"? I think it's better defined explicitly than implicitly?
> * Let's introduce a factory method for creating a specialized version if scoring is requested / not (i.e., impl the TODO in line 189)
> * I think it's a useful collector, which stands on its own and not specific to grouping. Can we move it to core?
> * How about using OpenBitSet instead of int[] for doc IDs?
> ** If the number of hits is big, we'd gain some RAM back, and be able to cache more entries
> ** NOTE: OpenBitSet can only be used for in-order collection only. So we can use that if the wrapped Collector does not support out-of-order
> * Do you think we can modify this Collector to not necessarily wrap another Collector? We have such Collector which stores (in-memory) all matching doc IDs + scores (if required). Those are later fed into several processes that operate on them (e.g. fetch more info from the index etc.). I am thinking, we can make CachingCollector *optionally* wrap another Collector and then someone can reuse it by setting RAM limit to unlimited (we should have a constant for that) in order to simply collect all matching docs + scores.
> * I think a set of dedicated unit tests for this class alone would be good.
> That's it so far. Perhaps, if we do all of the above, more things will pop up.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org