You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Robert Muir (Commented) (JIRA)" <ji...@apache.org> on 2012/03/05 18:25:59 UTC

[jira] [Commented] (SOLR-2660) omitPositions improvements

    [ https://issues.apache.org/jira/browse/SOLR-2660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13222461#comment-13222461 ] 

Robert Muir commented on SOLR-2660:
-----------------------------------

I think this could be a good option (in combination with shingles as mentioned), to accelerate 
the phrase queries that solr query parsers generate in order to boost closer matches.

Again the idea is to omit positions entirely, and instead use shinglefilter (unigrams and bigrams), approximating phrase 
queries with n-gram conjunctions. I think for the sloppy case, we should use an n-gram disjunction, perhaps interpreting 
slop factor as minNrShouldmatch?

This basically means you are substituting levenshtein distance for an n-gram approximation in both cases.

In general its a classic indexing/search tradeoff, in my tests on wikipedia indexing takes ~ twice as long with the shingles,
but the tradeoff is that for a lot of these use cases you don't need to consult the positions file at all.

As a parameter to the fieldtype its easily pluggable without messing with any queryparsers, and ordinary queries (term, boolean, etc)
are totally 'pass-thru', *however* the thing I guess I don't like about this patch is the fact that this is really a different 
'query intent', in other words, I think its a perfect approach when you just want to boost scores of close matches 
(e.g. when generated by dismax queryparser), but when your 'intent' is to actually limit matches to a phrase 
(e.g. when keyed in by a user directly), then this approximation isn't as good of a fit.

Either way I'm open to other opinions before doing anything (if we decide to do it, next step I think is to update the patch with 
the SloppyPhraseQuery approximation).

                
> omitPositions improvements
> --------------------------
>
>                 Key: SOLR-2660
>                 URL: https://issues.apache.org/jira/browse/SOLR-2660
>             Project: Solr
>          Issue Type: Improvement
>    Affects Versions: 3.3, 4.0
>            Reporter: Robert Muir
>            Priority: Minor
>         Attachments: SOLR-2660.patch
>
>
> followup to LUCENE-2048:
> Adds factory methods getPhraseQuery/getMultiPhraseQuery to QP, this way you can subclass it and customize behavior, particularly
> * by default, Solr throws exception here if the fieldtype omits positions: rather than 3.x's silent failure of no results, and even for trunk its nicer to fail during query parsing rather than waiting for lucene's failure during execution.
> * adds phraseAsBoolean, which allows you to downgrade these phrase/multiphrase queries to boolean queries: this is a nice option in conjunction with our word n-gram filters (shingle/commongrams/etc)for a fast "approximation", if your application can tolerate some false positives, e.g. "foo bar" -> termQuery(foo_bar), "foo bar baz" -> BQ(foo_bar AND bar_baz)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org