You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by John Berryman <jf...@gmail.com> on 2012/06/11 05:03:40 UTC

Issues with whitespace tokenization in QueryParser

According to https://issues.apache.org/jira/browse/LUCENE-2605, the Lucene
QueryParser tokenizes on white space before giving any text to the
Analyzer. This makes it impossible to use multi-term synonyms because the
SynonymFilter only receives one word at a time.

Resolution to this would really help with my current project. My project
client sells clothing and accessories online. They have plenty of examples
of compound words e.g."rain coat". But some of these compound words are
really tripping them up. A prime example is that a search for "dress shoes"
returns a list of dresses and random shoes (not necessarily dress shoes). I
wish that I was able to synonym compound words to single tokens (e.g.
"dress shoes => dress_shoes"), but with this whitespace tokenization issue,
it's impossible.

Has anything happened with this bug recently? For a short time I've got a
client that would be willing to pay for this issues to be fixed if it's not
too much of a rabbit hole. Anyone care to catch me up with what this might
entail?

-- 
LinkedIn <http://www.linkedin.com/pub/john-berryman/13/b17/864>
Twitter <http://twitter.com/#!/jnbrymn>

Re: Issues with whitespace tokenization in QueryParser

Posted by Chris Hostetter <ho...@fucit.org>.

: 
: NOTE: I definitely don't want to discourage you from tackling this
: issue, but I think its fair to mention there is a workaround, and
: thats if you can preprocess your queries yourself (maybe you dont
: allow all the lucene syntax to your users or something like that), you
: can escape the whitespace yourself such as rain\ coat, and I think
: your synonyms will work as expected.

Alternatively: use a QueryParser that doesn't know/care about any special 
markup and just analyzes the entire input against a single (configured) 
field and generates the appropriate query -- Solr's "FieldQParser" works 
this way for example.

You have to pick a tradeoff between "i want to support query operators 
like ':', '+', '-', and ' ' that let me build up BooleanQuery objects and 
query specific fields" vs "i want the entire query string analyzed as one 
chunk"

: > really tripping them up. A prime example is that a search for "dress shoes"
: > returns a list of dresses and random shoes (not necessarily dress shoes). I
: > wish that I was able to synonym compound words to single tokens (e.g. "dress
: > shoes => dress_shoes"), but with this whitespace tokenization issue, it's
: > impossible.

this is one of the main use cases of the DismaxQParser (and now 
EDismaxQParser as well) with the "pf" param in solr ... you can have it 
query for both "dress" and/or "shoes" in som set of fields (qf) but also 
for the entire phrase "dress shoes" in a distinct set of fields (pf) which 
get a higher score.

http://wiki.apache.org/solr/DisMax
http://wiki.apache.org/solr/DisMaxQParserPlugin
http://www.lucidimagination.com/blog/2010/05/23/whats-a-dismax/



-Hoss

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Issues with whitespace tokenization in QueryParser

Posted by Robert Muir <rc...@gmail.com>.

Welcome John!

Basically the tricky part about this issue is how Analyzer integrates
into the parsing workflow: It is as hossman says on the issue.

You can edit the .jflex file so that _TERM_CHAR is defined differently
and regenerate, and you will see what i mean by the tests that fail.

The crux of the problem is that currently if you have +foo bar -baz,
we split on whitespace, applying operators, then run the analyzer on
each portion.
so you get +foo, bar, -baz, then we analyze foo, bar, and baz respectively.

But if you just remove the whitespace tokenization, you will get +foo
bar, -baz, which is different.

so to make this kind of thing work as expected, I think the analyzer
would be integrated at an earlier stage here before the operators are
applied, e.g. its part of the lexing process.

NOTE: I definitely don't want to discourage you from tackling this
issue, but I think its fair to mention there is a workaround, and
thats if you can preprocess your queries yourself (maybe you dont
allow all the lucene syntax to your users or something like that), you
can escape the whitespace yourself such as rain\ coat, and I think
your synonyms will work as expected.

On Sun, Jun 10, 2012 at 11:03 PM, John Berryman <jf...@gmail.com> wrote:
> According to https://issues.apache.org/jira/browse/LUCENE-2605, the Lucene
> QueryParser tokenizes on white space before giving any text to the Analyzer.
> This makes it impossible to use multi-term synonyms because the
> SynonymFilter only receives one word at a time.
>
> Resolution to this would really help with my current project. My project
> client sells clothing and accessories online. They have plenty of examples
> of compound words e.g."rain coat". But some of these compound words are
> really tripping them up. A prime example is that a search for "dress shoes"
> returns a list of dresses and random shoes (not necessarily dress shoes). I
> wish that I was able to synonym compound words to single tokens (e.g. "dress
> shoes => dress_shoes"), but with this whitespace tokenization issue, it's
> impossible.
>
> Has anything happened with this bug recently? For a short time I've got a
> client that would be willing to pay for this issues to be fixed if it's not
> too much of a rabbit hole. Anyone care to catch me up with what this might
> entail?
>
> --
> LinkedIn
> Twitter
>

-- 
lucidimagination.com

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org