You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Hoss Man (JIRA)" <ji...@apache.org> on 2015/06/27 03:04:05 UTC

[jira] [Commented] (LUCENE-6624) provide a BookendFilter to make the "exact match against an entire (tokenized) field value" usecase easy

    [ https://issues.apache.org/jira/browse/LUCENE-6624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14603864#comment-14603864 ] 

Hoss Man commented on LUCENE-6624:
----------------------------------

(creating in response to a conversation i had earlier today that made me realize we still don't offer anything out of the box to really address this type of problem ... probably won't have time to get to it soon, but i wanted to file the issue with the overall gist of the goal/idea so it's actually written down somewhere)

> provide a BookendFilter to make the "exact match against an entire (tokenized) field value" usecase easy
> --------------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-6624
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6624
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Hoss Man
>
> A question that seems to pop up every now and then is how to require an "exact match" against "an entire field value" even while using some sort of analysis feature (ie: stopwords, or lowercasing, or whitespace normalization).
> In other words: instead of a literal, byte for byte, "exact match" (eg: {{new StringField(f, val, Store.NO)}} at index time; {{new TermQuery(new Term(f, val))}} at query time) some folks want to use some Tokenizer and TokenFilter but then require that a "PhraseQuery" (or SpanNearQuery) on the input matches the entire field value, w/o any terms left over.
> Example: they want a (phrase) queries like {{"The Quick Brown Dog"}} and {{"quick BROWN dog"}} to both match a document indexed with a field value "{{The Quick Brown Dog.}}" because their analyzer tokenizes both the query & the field value into {{quick | brown | dog}} (standard tokenizer + stopword & lowercase filters) -- BUT -- on the other hand they don't want either of those phrase queries to match a document with a field value of "{{I Love the Quick Brown Dog}}" because that field value includes additional terms not covered by the query.
> A suggestion i've seen for years in response to this type of question is that folks can "inject marker tokens" at the begining and end of both the field values & query, and then (as long as there is no "slop" on the phrase queries) they should get the matches they expect.  The hackish way to do this being to just prepend and append some strings that won' be found in their data and won't be striped out by their tokenizer or any token filters (eg: {{new TextField(f, "VAL_START_XYZABC " + val + " VAL_END_XYZABC", Store.NO)}} at index time; {{queryBuilder.createPhraseQuery(f, "VAL_START_XYZABC " + val + " VAL_END_XYZABC")}} at query time).
> Unless i'm missing something, it should be fairly trivial to write a "BookendFilter" that that does this automatically for users:
> * the first time {{incrementToken()}} is called, produce a synthetic "start"  token with some CharTermAttribute that is uses a non-printing unicode sequence (overridable by user config)
> * after that, all calls to {{incrementToken()}} proxy to the wrapped stream until it's exhausted
> * after that, when {{incrementToken()}} is called, produce a synthetic "end" token with some CharTermAttribute that is uses a non-printing unicode sequence (overridable by user config)
> * both synthetic tokens should have KeywordAttribute == true
> ...At index time the sythetic tokens will be indexed as terms, and if the same analyzer is used at query time to build a PhraseQuery those terms will be the first and last terms in the PhraseQuery.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org