You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Eric Jain <Er...@isb-sib.ch> on 2006/06/25 18:11:27 UTC

Splitting and matching words

I'd like to have "PowerShot", "powershot" and "power-shot" match each 
other. Solr has a WordDelimiterFilter, which works quite well, except that 
"powershot" still won't match "PowerShot" (tokenized into "power (shot 
powershot)", so "power powershot" would match..."). Any suggestions?

Re: Splitting and matching words

Posted by Chris Hostetter <ho...@fucit.org>.

: perhaps span queries could avoid generating all the possibilities...

I remember coming up for a design for dealing with cases like this a while
back ... it did involve using SpanNear/SpanOr queries -- but it also
required added information in the Tokens at query time to resolve the
"lap/0 top/1 notebook/1" ambiguity.

I'll see if i can dig that up (not sure if i ever typed it up, or if it
was just a whiteboard thing that got erased when i never did anything with
it).



-Hoss

Re: Splitting and matching words

Posted by Yonik Seeley <ys...@gmail.com>.

On 6/25/06, Yonik Seeley <ys...@gmail.com> wrote:
>   1) a new QueryParser smart enough to make a boolean query instead of
> a MultiPhraseQuery.   "Power Shot" OR "PowerShot"

Thinking about this option a bit more...
The problem is ambiguity.  Sometimes a MultiPhraseQuery is the correct
interpretation and sometimes a boolean query is needed.  The same
problem exists on the query side for multi-token synonyms.  There
isn't enough information about what the "synonyms" actually are.

Take the case of lap/0 top/1 notebook/1  (where /0 and /1 are token positions).
There isn't enough info to understand if notebook is a synonym for
"top" or for "lap top".
Even if we added extra info (I recently committed a Lucene patch to
allow subclassing Token), it's not an easy problem.

Consider something like "my PowerShot lap-top", and trying to
represent that with a boolean query of phrase queries... you need all
the possibilities.

"my Power Shot lap top"
"my PowerShot lap top"
"my Power Shot laptop"
"my PowerShot laptop"

perhaps span queries could avoid generating all the possibilities...

-Yonik

Re: Splitting and matching words

Posted by Yonik Seeley <ys...@gmail.com>.

On 6/25/06, Eric Jain <Er...@isb-sib.ch> wrote:
> I'd like to have "PowerShot", "powershot" and "power-shot" match each
> other. Solr has a WordDelimiterFilter, which works quite well, except that
> "powershot" still won't match "PowerShot" (tokenized into "power (shot
> powershot)", so "power powershot" would match..."). Any suggestions?

You mean if the indexed text was "powershot" and the query text was
"PowerShot" then it wouldn't match (but the reverse case will).

That is a problem... if one does both catenation and splitting on the
query side, you end up with "Power" in the first position, and both
"Shot" and "PowerShot" in the second.  While this works fine for the
indexing side, on the query side it's interpreted as a
MultiPhraseQuery meaning "Power" followed by either "Shot" or
"PowerShot".

Workarounds:
  1) a new QueryParser smart enough to make a boolean query instead of
a MultiPhraseQuery.   "Power Shot" OR "PowerShot"
  2) index the field a second time via copyField, but have the query
analyzer catenate instead of split subwords.  query across both
fields.
  3) do more client-side processing... change "PowerShot" to
      "PowerShot" OR "powershot" (i.e. create a boolean query with the
second option
     removing subword delimiters yourself).

(1) is much harder to do in a generic way, but would be most useful.
(2) is much easier and can be done now.

-Yonik

Re: Splitting and matching words

Posted by Eric Jain <Er...@isb-sib.ch>.

Eric Jain wrote:
> I'd like to have "PowerShot", "powershot" and "power-shot" match each 
> other. Solr has a WordDelimiterFilter, which works quite well, except 
> that "powershot" still won't match "PowerShot" (tokenized into "power 
> (shot powershot)", so "power powershot" would match..."). Any suggestions?

The workaround I'll probably use for the time being is to lowercase the 
tokens before applying the WordDelimiterFilter, in the analyzer that is 
used for parsing queries (but for indexing the order remains unchanged).

This way matches are case-insensitive, which is essential for our 
application. "power-shot" (query) still won't match "powershot" (index), 
but all the other combinations should work.