You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by "Robert Muir (Created) (JIRA)" <ji...@apache.org> on 2012/03/13 16:34:42 UTC

[jira] [Created] (SOLR-3240) add spellcheck 'approximate collation count' mode

add spellcheck 'approximate collation count' mode
-------------------------------------------------

                 Key: SOLR-3240
                 URL: https://issues.apache.org/jira/browse/SOLR-3240
             Project: Solr
          Issue Type: Improvement
          Components: spellchecker
            Reporter: Robert Muir


SpellCheck's Collation in Solr is a way to ensure spellcheck/suggestions
will actually net results (taking into account context like filtering).

In order to do this (from my understanding), it generates candidate queries,
executes them, and saves the total hit count: collation.setHits(hits).

For a large index it seems this might be doing too much work: in particular
I'm interested in ensuring this feature can work fast enough/well for autosuggesters.

So I think we should offer an 'approximate' mode that uses an early-terminating
Collector, collect()ing only N docs (e.g. n=1), and we approximate this result
count based on docid space. 

I'm not sure what needs to happen on the solr side (possibly support for custom collectors?),
but I think this could help and should possibly be the default.


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (SOLR-3240) add spellcheck 'approximate collation count' mode

Posted by "Robert Muir (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-3240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13228504#comment-13228504 ] 

Robert Muir commented on SOLR-3240:
-----------------------------------

{quote}
Beyond this, there are also some dead-simple optimizations we can make by simply removing any sorting & boosting parameters from the query before testing the collation.
{quote}

Right, as a custom collector we effectively get this too though, we wouldnt be sorting or scoring anything.
                
> add spellcheck 'approximate collation count' mode
> -------------------------------------------------
>
>                 Key: SOLR-3240
>                 URL: https://issues.apache.org/jira/browse/SOLR-3240
>             Project: Solr
>          Issue Type: Improvement
>          Components: spellchecker
>            Reporter: Robert Muir
>
> SpellCheck's Collation in Solr is a way to ensure spellcheck/suggestions
> will actually net results (taking into account context like filtering).
> In order to do this (from my understanding), it generates candidate queries,
> executes them, and saves the total hit count: collation.setHits(hits).
> For a large index it seems this might be doing too much work: in particular
> I'm interested in ensuring this feature can work fast enough/well for autosuggesters.
> So I think we should offer an 'approximate' mode that uses an early-terminating
> Collector, collect()ing only N docs (e.g. n=1), and we approximate this result
> count based on docid space. 
> I'm not sure what needs to happen on the solr side (possibly support for custom collectors?),
> but I think this could help and should possibly be the default.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (SOLR-3240) add spellcheck 'approximate collation count' mode

Posted by "James Dyer (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-3240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13228492#comment-13228492 ] 

James Dyer commented on SOLR-3240:
----------------------------------

collation.hits is just metadata for the user, so I think what you want to do would be entirely valid.  

The estimates would only be good if the hits are somewhat evenly distributed across the index, right?  For instance, if you're indexing something by topic and all and then a bunch of new docs get added on the same topic around the same time, you'd get a cluster of hits in one place.  

Even so, like you say, many (most) people would rather improve performance than have an accurate (any) hit count returned.

Beyond this, there are also some dead-simple optimizations we can make by simply removing any sorting & boosting parameters from the query before testing the collation.
                
> add spellcheck 'approximate collation count' mode
> -------------------------------------------------
>
>                 Key: SOLR-3240
>                 URL: https://issues.apache.org/jira/browse/SOLR-3240
>             Project: Solr
>          Issue Type: Improvement
>          Components: spellchecker
>            Reporter: Robert Muir
>
> SpellCheck's Collation in Solr is a way to ensure spellcheck/suggestions
> will actually net results (taking into account context like filtering).
> In order to do this (from my understanding), it generates candidate queries,
> executes them, and saves the total hit count: collation.setHits(hits).
> For a large index it seems this might be doing too much work: in particular
> I'm interested in ensuring this feature can work fast enough/well for autosuggesters.
> So I think we should offer an 'approximate' mode that uses an early-terminating
> Collector, collect()ing only N docs (e.g. n=1), and we approximate this result
> count based on docid space. 
> I'm not sure what needs to happen on the solr side (possibly support for custom collectors?),
> but I think this could help and should possibly be the default.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (SOLR-3240) add spellcheck 'approximate collation count' mode

Posted by "Robert Muir (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-3240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13228469#comment-13228469 ] 

Robert Muir commented on SOLR-3240:
-----------------------------------

Yes, but I'm saying that we can also still approximate the hit count.

for example, for n=1, if you have 20,000 docs, and the first docid is '100', we estimate there are 200 matching docs.
you can increase n (max # of collected docs), to increase the accuracy at the cost of performance.
currently n=infinity and its always exact :)

James can you tell me how collation.hits is used? Does collation use this directly as a heuristic for re-ranking suggestions? 
Or is it only metadata supplied to the user.

The idea here is that exact numbers are probably not needed for most use cases: they would probably rather have
inexact hit counts but faster performance.
                
> add spellcheck 'approximate collation count' mode
> -------------------------------------------------
>
>                 Key: SOLR-3240
>                 URL: https://issues.apache.org/jira/browse/SOLR-3240
>             Project: Solr
>          Issue Type: Improvement
>          Components: spellchecker
>            Reporter: Robert Muir
>
> SpellCheck's Collation in Solr is a way to ensure spellcheck/suggestions
> will actually net results (taking into account context like filtering).
> In order to do this (from my understanding), it generates candidate queries,
> executes them, and saves the total hit count: collation.setHits(hits).
> For a large index it seems this might be doing too much work: in particular
> I'm interested in ensuring this feature can work fast enough/well for autosuggesters.
> So I think we should offer an 'approximate' mode that uses an early-terminating
> Collector, collect()ing only N docs (e.g. n=1), and we approximate this result
> count based on docid space. 
> I'm not sure what needs to happen on the solr side (possibly support for custom collectors?),
> but I think this could help and should possibly be the default.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (SOLR-3240) add spellcheck 'approximate collation count' mode

Posted by "James Dyer (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-3240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13228463#comment-13228463 ] 

James Dyer commented on SOLR-3240:
----------------------------------

Are you saying that if a user only cares that a collation will yield some hits, but doesn't care how many, then we can short-circuit these queries to quit once one document is collected?  (alternatively, quit after n docs are collected is the user doesn't care if it is "greater than n" ?)
                
> add spellcheck 'approximate collation count' mode
> -------------------------------------------------
>
>                 Key: SOLR-3240
>                 URL: https://issues.apache.org/jira/browse/SOLR-3240
>             Project: Solr
>          Issue Type: Improvement
>          Components: spellchecker
>            Reporter: Robert Muir
>
> SpellCheck's Collation in Solr is a way to ensure spellcheck/suggestions
> will actually net results (taking into account context like filtering).
> In order to do this (from my understanding), it generates candidate queries,
> executes them, and saves the total hit count: collation.setHits(hits).
> For a large index it seems this might be doing too much work: in particular
> I'm interested in ensuring this feature can work fast enough/well for autosuggesters.
> So I think we should offer an 'approximate' mode that uses an early-terminating
> Collector, collect()ing only N docs (e.g. n=1), and we approximate this result
> count based on docid space. 
> I'm not sure what needs to happen on the solr side (possibly support for custom collectors?),
> but I think this could help and should possibly be the default.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (SOLR-3240) add spellcheck 'approximate collation count' mode

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-3240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13293099#comment-13293099 ] 

Robert Muir commented on SOLR-3240:
-----------------------------------

I think so! 

We could optimize it further by ensuring that we arent collecting scores or anything (e.g. i think we should be wrapping something like a TotalHitCollector (http://svn.apache.org/viewvc/lucene/dev/trunk/lucene/core/src/java/org/apache/lucene/search/TotalHitCountCollector.java?view=markup), or just not wrapping any collector at all?

But this patch is probably a good improvement for the worst case.
                
> add spellcheck 'approximate collation count' mode
> -------------------------------------------------
>
>                 Key: SOLR-3240
>                 URL: https://issues.apache.org/jira/browse/SOLR-3240
>             Project: Solr
>          Issue Type: Improvement
>          Components: spellchecker
>            Reporter: Robert Muir
>         Attachments: SOLR-3240.patch
>
>
> SpellCheck's Collation in Solr is a way to ensure spellcheck/suggestions
> will actually net results (taking into account context like filtering).
> In order to do this (from my understanding), it generates candidate queries,
> executes them, and saves the total hit count: collation.setHits(hits).
> For a large index it seems this might be doing too much work: in particular
> I'm interested in ensuring this feature can work fast enough/well for autosuggesters.
> So I think we should offer an 'approximate' mode that uses an early-terminating
> Collector, collect()ing only N docs (e.g. n=1), and we approximate this result
> count based on docid space. 
> I'm not sure what needs to happen on the solr side (possibly support for custom collectors?),
> but I think this could help and should possibly be the default.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Updated] (SOLR-3240) add spellcheck 'approximate collation count' mode

Posted by "James Dyer (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-3240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

James Dyer updated SOLR-3240:
-----------------------------

    Attachment: SOLR-3240.patch

Ok.  I think I have a version here that will never compute scores, without having to write a lot of special code for it.

Best I can tell, when "collateMaxCollectDocs" is 0 or not specified, it will use the first inner-class Collector in SolrIndexSearcher#getDocListNC (this one is almost identical to TotalHitCountCollector).  Otherwise, it will use OneComparatorNonScoringCollector with the sort being on "<id>".  These queries will also make use of the Solr filter cache & query result cache when they can, etc.

The one thing is that the unit tests make it easy to determine if it is giving the estimate you'd expect, etc.  What I can't so easily test is if I turn off hit reporting entirely (collateExtendedResults=false), is it still picking a non-scoring collector.  I would like to add a test that does this but not so sure what the least-invasive approach would be.

I'm also thinking I can safely get rid of the "forceInorderCollection" flag because requesting docs sorted by doc-id would enforce the same thing, right?
                
> add spellcheck 'approximate collation count' mode
> -------------------------------------------------
>
>                 Key: SOLR-3240
>                 URL: https://issues.apache.org/jira/browse/SOLR-3240
>             Project: Solr
>          Issue Type: Improvement
>          Components: spellchecker
>            Reporter: Robert Muir
>         Attachments: SOLR-3240.patch, SOLR-3240.patch
>
>
> SpellCheck's Collation in Solr is a way to ensure spellcheck/suggestions
> will actually net results (taking into account context like filtering).
> In order to do this (from my understanding), it generates candidate queries,
> executes them, and saves the total hit count: collation.setHits(hits).
> For a large index it seems this might be doing too much work: in particular
> I'm interested in ensuring this feature can work fast enough/well for autosuggesters.
> So I think we should offer an 'approximate' mode that uses an early-terminating
> Collector, collect()ing only N docs (e.g. n=1), and we approximate this result
> count based on docid space. 
> I'm not sure what needs to happen on the solr side (possibly support for custom collectors?),
> but I think this could help and should possibly be the default.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Updated] (SOLR-3240) add spellcheck 'approximate collation count' mode

Posted by "James Dyer (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-3240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

James Dyer updated SOLR-3240:
-----------------------------

    Attachment: SOLR-3240.patch

Here's a patch for this one.

Robert, is this something like what you had in mind when you opened this issue?
                
> add spellcheck 'approximate collation count' mode
> -------------------------------------------------
>
>                 Key: SOLR-3240
>                 URL: https://issues.apache.org/jira/browse/SOLR-3240
>             Project: Solr
>          Issue Type: Improvement
>          Components: spellchecker
>            Reporter: Robert Muir
>         Attachments: SOLR-3240.patch
>
>
> SpellCheck's Collation in Solr is a way to ensure spellcheck/suggestions
> will actually net results (taking into account context like filtering).
> In order to do this (from my understanding), it generates candidate queries,
> executes them, and saves the total hit count: collation.setHits(hits).
> For a large index it seems this might be doing too much work: in particular
> I'm interested in ensuring this feature can work fast enough/well for autosuggesters.
> So I think we should offer an 'approximate' mode that uses an early-terminating
> Collector, collect()ing only N docs (e.g. n=1), and we approximate this result
> count based on docid space. 
> I'm not sure what needs to happen on the solr side (possibly support for custom collectors?),
> but I think this could help and should possibly be the default.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org