You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Hoss Man (JIRA)" <ji...@apache.org> on 2015/06/27 03:03:04 UTC

[jira] [Created] (LUCENE-6624) provide a BookendFilter to make the "exact match against an entire (tokenized) field value" usecase easy

Hoss Man created LUCENE-6624:
--------------------------------

             Summary: provide a BookendFilter to make the "exact match against an entire (tokenized) field value" usecase easy
                 Key: LUCENE-6624
                 URL: https://issues.apache.org/jira/browse/LUCENE-6624
             Project: Lucene - Core
          Issue Type: Improvement
            Reporter: Hoss Man


A question that seems to pop up every now and then is how to require an "exact match" against "an entire field value" even while using some sort of analysis feature (ie: stopwords, or lowercasing, or whitespace normalization).

In other words: instead of a literal, byte for byte, "exact match" (eg: {{new StringField(f, val, Store.NO)}} at index time; {{new TermQuery(new Term(f, val))}} at query time) some folks want to use some Tokenizer and TokenFilter but then require that a "PhraseQuery" (or SpanNearQuery) on the input matches the entire field value, w/o any terms left over.

Example: they want a (phrase) queries like {{"The Quick Brown Dog"}} and {{"quick BROWN dog"}} to both match a document indexed with a field value "{{The Quick Brown Dog.}}" because their analyzer tokenizes both the query & the field value into {{quick | brown | dog}} (standard tokenizer + stopword & lowercase filters) -- BUT -- on the other hand they don't want either of those phrase queries to match a document with a field value of "{{I Love the Quick Brown Dog}}" because that field value includes additional terms not covered by the query.


A suggestion i've seen for years in response to this type of question is that folks can "inject marker tokens" at the begining and end of both the field values & query, and then (as long as there is no "slop" on the phrase queries) they should get the matches they expect.  The hackish way to do this being to just prepend and append some strings that won' be found in their data and won't be striped out by their tokenizer or any token filters (eg: {{new TextField(f, "VAL_START_XYZABC " + val + " VAL_END_XYZABC", Store.NO)}} at index time; {{queryBuilder.createPhraseQuery(f, "VAL_START_XYZABC " + val + " VAL_END_XYZABC")}} at query time).


Unless i'm missing something, it should be fairly trivial to write a "BookendFilter" that that does this automatically for users:

* the first time {{incrementToken()}} is called, produce a synthetic "start"  token with some CharTermAttribute that is uses a non-printing unicode sequence (overridable by user config)
* after that, all calls to {{incrementToken()}} proxy to the wrapped stream until it's exhausted
* after that, when {{incrementToken()}} is called, produce a synthetic "end" token with some CharTermAttribute that is uses a non-printing unicode sequence (overridable by user config)
* both synthetic tokens should have KeywordAttribute == true

...At index time the sythetic tokens will be indexed as terms, and if the same analyzer is used at query time to build a PhraseQuery those terms will be the first and last terms in the PhraseQuery.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org