You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Tobias Dittrich <di...@wave-computer.de> on 2009/03/02 14:32:05 UTC
Compound word search (maybe DisMaxQueryPaser problem)
Hi all,
I know there are a lot of topics about compound word search
already but I haven't found anything for my specific problem
yet. So if this is already answered (which would be nice :))
then any hints or search phrases for the mail archive would
be apreciated.
Bascially I want users to be able to search my index for
compound words that are not really compounds but merely
terms that can be written in several ways.
For example I have the categories "usb" and "cable" in my
index and I want the user to be able to search for
"usbcable" or "usb-cable" etc. Also there is "bluetooth" in
the index and I want the search for "blue tooth" to return
the corresponding documents.
My approach is to use ShingleFilterFactory followed by
WordDelimiterFilterFactory to index all possible
combinations of words and get rid of intra-word delimiters.
This nicely covers the first part of my requirements since
the terms "usb" and "cable" somewhere along the process get
concatenated and "usbcable" is in the index.
Now I also want use this on the query side, so the user
input "blue tooth" (not as phrase) would become "bluetooth"
for this field and produce a hit. But this never happens
since with the DisMax Searcher the parser produces a query
like this:
((category:blue | name:blue)~0.1 (category:tooth |
name:tooth)~0.1)
And the filters and analysers for this field never get to
see the whole user query and cannot perform their shingle
and delimiter tasks :(
So my question now is: how can I get this working? Is there
a preferable way to deal with this compound word problem? Is
there another query parser that already does the trick?
Or would it make sense to write my own query parser that
passes the user query "as is" to the several fields?
Any hints on this are welcome.
Thanks in advance
Tobias
--
Tobias Dittrich
- Leiter Internet-Entwicklung -
_________________________________
WAVE Computersysteme GmbH
Philipp-Reis-Str. 9
35440 Linden
Geschäftsführer: Carsten Kellmann
Registergericht Gießen HRB 1823
Fon: +49 (0) 6403 / 9050 6001
Fax: +49 (0) 6403 / 9050 5089
mailto:dittrich@wave-computer.de
http://www.wave-computer.de
Re: Compound word search (maybe DisMaxQueryPaser problem)
Posted by Tobias Dittrich <di...@wave-computer.de>.
Oh my ... thinking even more about it, I have to admit
you're right :) But that leaves me somewhat clueless again.
So I'll just try and share my thoughts on this. Maybe
someone will read this and can point me to a possible
solution ... or tell me where I'm wrong.
Say we have a schema with fields f1, f2 and f3. And the user
queries for "a b c" (without the quotes). What I would
expect as resulting query would be (leaving out the details
like tie, boosting etc.):
((f1:a OR f2:a OR f3:a) AND (f1:b OR f2:b OR f3:b) AND (f1:c
OR f2:c OR f3:c))
OR
((f1:ab OR f2:ab OR f3:ab) AND (f1:c f2:c f3:c))
OR
((f1:a OR f2:a OR f3:a) AND (f1:bc f2:bc f3:bc))
(possibly also f1:abc OR f2:abc .. and/or f1:a b c OR f2:a
b c etc. )
So every possibility of how to write compound words is covered.
But then there is the problem that there are fields that
require exact matching (something like EAN, manufacturer
code or product serial number. Unfortunately these can
contain whitespaces etc. So a b c can also be a valid
manufacturer code which sould match as a whole).
So I modeled the fields in the schema accordingly: making
exact match fields string and add ShingleFilter and
WordDelimiterFiler for content fields. And I thought the
fields analyzer stack would take care of how to process the
user input.
But when I pass the user query as phrase to the DisMax
Handler (so that every field gets to see the whole user
query and can tokenize and shingle it) I get a query like this:
(f1:a b c)
OR
(f2:a OR f2:ab OR f2:b OR f2:bc OR f2:c)
OR
(f3:a OR f3:ab OR f3:b OR f3:bc OR f3:c)
which apparently is not what I need as it also would find
for example documents that only contain a or b etc. When
using phrase fields this query is just added to the normal
query and therefore the query fails to find the compound words.
Also using the FieldQuery Analyzer does not yield the
desired results as the parsed queries as a matter of fact
look like the phrase queries from the DisMax parser.
I tried dozends of variations and I'm still pretty sure that
there must be a way to get this working. It doesn't look
that hard. But for now I will settle this for the weekend :)
Have a nice weekend all and thanks in advance for any
comments or replies.
Tobi
Chris Hostetter schrieb:
> : Many thanks for your explanation. That really helped me a lot in understanding
> : DisMax - and finally I realized that DisMax is not at all what I need.
> : Actually I do not want results where "blue" is in one field and "tooth" in
> : another (imagine you search for a notebook with blue tooth and get some blue
> : products that accidentally have tooth in some field).
>
> except that if you use the "pf" param as well, a search for...
>
> blue tooth
>
> can score products where "blue tooth" appears in one field higher then
> products where "blue" apears in one field and "tooth" appears in another
> field.
>
>
> The approach you are describing might give you you better precisions (ie:
> less total results) but it will have a loss in precision, a query like
> this...
>
> blue tooth notebook
>
> ...probably won't be able to find documents matching the terms
> "product_type:notebook features:blue features:tooth" ... but dismax
> can.
>
>
> -Hoss
>
Re: Compound word search (maybe DisMaxQueryPaser problem)
Posted by Chris Hostetter <ho...@fucit.org>.
: Many thanks for your explanation. That really helped me a lot in understanding
: DisMax - and finally I realized that DisMax is not at all what I need.
: Actually I do not want results where "blue" is in one field and "tooth" in
: another (imagine you search for a notebook with blue tooth and get some blue
: products that accidentally have tooth in some field).
except that if you use the "pf" param as well, a search for...
blue tooth
can score products where "blue tooth" appears in one field higher then
products where "blue" apears in one field and "tooth" appears in another
field.
The approach you are describing might give you you better precisions (ie:
less total results) but it will have a loss in precision, a query like
this...
blue tooth notebook
...probably won't be able to find documents matching the terms
"product_type:notebook features:blue features:tooth" ... but dismax
can.
-Hoss
Re: Compound word search (maybe DisMaxQueryPaser problem)
Posted by Tobias Dittrich <di...@wave-computer.de>.
Many thanks for your explanation. That really helped me a
lot in understanding DisMax - and finally I realized that
DisMax is not at all what I need. Actually I do not want
results where "blue" is in one field and "tooth" in another
(imagine you search for a notebook with blue tooth and get
some blue products that accidentally have tooth in some field).
My feeling already was that I have to come up with my own
solution mixing parts of DisMax (distribute the query among
the fields) and FieldQParserPlugin. So now I will try that out.
Many thanks
Tobi
Chris Hostetter schrieb:
> : My original assumption for the DisMax Handler was, that it will just take the
> : original query string and pass it to every field in its fieldlist using the
> : fields configured analyzer stack. Maybe in the end add some stuff for the
> : special options and so ... and then send the query to lucene. Can you explain
> : why this approach was not choosen?
>
> because then it wouldn't be the DisMaxRequestHandler.
>
> seriously: the point of dismax is to build up a DisjunctionMaxQuery for
> each "chunk" in the query string and populate those DisjunctionMaxQueries
> with the Queries produced by analyzing that "chunk" against each field in
> the qf -- then all of the DisjunctionMaxQueries are grouped into a
> BooleanQuery with a minNrSHouldMatch.
>
> if you look at the query toString from debugQuery (using a non trivial qf
> param and a q string containing more then one "chunk") you can see what i
> mean. your example shows it pretty well actaully...
>
> : > : > : > ((category:blue | name:blue)~0.1 (category:tooth | name:tooth)~0.1)
>
> the point is to build those DisjunctionMaxQueries -- so that each "chunk"
> only contributes significantly based on the highest scoring field that
> chunk appears in ... if your example someone typing "blue tooth" can get a
> match when a doc matches blue in one field and tooth in another -- that
> wouldn't be possible with the appraoch you describe. the Query structure
> also means that a doc where "tooth" appears in both the category and name
> fields but "blue" doesn't appear at all won't score as high as a doc that
> matches "blue" in category and "tooth" in name (allthough you have to look
> at the score explanations to really see hwat i mean by that)
>
>
> There are certainly a lot of improvements that could be made to dismax ...
> more customiation in terms of how the querystrings is parsed before
> building up the DisjunctionMaxQueries and calling the individual field
> analyzers would certainly be one way it could improve ... but so far no
> one has attempted anything like that.
>
>
>
>
> -Hoss
>
Re: Compound word search (maybe DisMaxQueryPaser problem)
Posted by Chris Hostetter <ho...@fucit.org>.
: My original assumption for the DisMax Handler was, that it will just take the
: original query string and pass it to every field in its fieldlist using the
: fields configured analyzer stack. Maybe in the end add some stuff for the
: special options and so ... and then send the query to lucene. Can you explain
: why this approach was not choosen?
because then it wouldn't be the DisMaxRequestHandler.
seriously: the point of dismax is to build up a DisjunctionMaxQuery for
each "chunk" in the query string and populate those DisjunctionMaxQueries
with the Queries produced by analyzing that "chunk" against each field in
the qf -- then all of the DisjunctionMaxQueries are grouped into a
BooleanQuery with a minNrSHouldMatch.
if you look at the query toString from debugQuery (using a non trivial qf
param and a q string containing more then one "chunk") you can see what i
mean. your example shows it pretty well actaully...
: > : > : > ((category:blue | name:blue)~0.1 (category:tooth | name:tooth)~0.1)
the point is to build those DisjunctionMaxQueries -- so that each "chunk"
only contributes significantly based on the highest scoring field that
chunk appears in ... if your example someone typing "blue tooth" can get a
match when a doc matches blue in one field and tooth in another -- that
wouldn't be possible with the appraoch you describe. the Query structure
also means that a doc where "tooth" appears in both the category and name
fields but "blue" doesn't appear at all won't score as high as a doc that
matches "blue" in category and "tooth" in name (allthough you have to look
at the score explanations to really see hwat i mean by that)
There are certainly a lot of improvements that could be made to dismax ...
more customiation in terms of how the querystrings is parsed before
building up the DisjunctionMaxQueries and calling the individual field
analyzers would certainly be one way it could improve ... but so far no
one has attempted anything like that.
-Hoss
Re: Compound word search (maybe DisMaxQueryPaser problem)
Posted by Tobias Dittrich <di...@wave-computer.de>.
First of all: sorry Chris, Walter .. I did not mean to put
pressure on anyone. It's just that if you're stuck with
something and you have that little needle stinging saying:
maybe you're just too damn stupid for this ... :) So, thanks
a lot for your answers.
As for index time expansion using synonyms: I think this is
not an option for me since it would mean that I have to a)
find all such words that might cause problems and b) find
every variant that might possibly be used by customers. And
then in the end I have to keep all my synonym files
up-to-date. But the main design goal for my search
implementation is little to no maintainance.
My original assumption for the DisMax Handler was, that it
will just take the original query string and pass it to
every field in its fieldlist using the fields configured
analyzer stack. Maybe in the end add some stuff for the
special options and so ... and then send the query to
lucene. Can you explain why this approach was not choosen?
Thanks
Tobi
Chris Hostetter schrieb:
> : Hmmm was my mail so weird or my question so stupid ... or is there simply
> : noone with an answer? Not even a hint? :(
>
> patience my freind, i've got a backlog of ~~500 Lucene related messages in
> my INBOX, and i was just reading your original email when this reply came
> in.
>
> In generally this is a fairly hard problem ... the easiest solution i know
> of that works in most cases is to do index time expansion using the
> SYnonymFilter, so regardless of wether a document contains "usbcable"
> "usb-cable" or "usb cable" all three varients get indexed, and then the
> user can search for any of them.
>
> the downside is that it can throw off your tf/idf stats for some terms (if
> they apear by themselves, and as part of a compound) and it can result in
> false positives for esoteric phrase searches (but that tends to be more of
> a theoretical problem then an actual one.
>
> : > But this never happens since with the DisMax Searcher the parser produces a
> : > query like this:
> : >
> : > ((category:blue | name:blue)~0.1 (category:tooth | name:tooth)~0.1)
> ...
> : > to deal with this compound word problem? Is there another query parser that
> : > already does the trick?
>
> take a look at the FieldQParserPlugin ... it passes the raw query string
> to the analyser of a specified field -- this would let your TokenFilters
> see the "stream" of tokens (which isn't possible with the conventional
> QueryParser tokenization rules) but it doesn't have any of the
> "field/query matric cross product" goodness of dismax -- you'd only be
> able to query the one field.
>
> (Hmmm.... i wonder if DisMaxQParser 2.0 could have an option to let you
> specify a FieldType whose analyzer was used to tokenize the query string
> instead of using the Lucene QueryParser JavaCC tokenization, and *then*
> the tokens resulting from that initial analyzer could be passed to the
> analyzers of the various qf fields ... hmmm, that might be just crazy
> enough to be too crazy to work)
>
>
>
>
> -Hoss
>
>
Re: Compound word search (maybe DisMaxQueryPaser problem)
Posted by Chris Hostetter <ho...@fucit.org>.
: Hmmm was my mail so weird or my question so stupid ... or is there simply
: noone with an answer? Not even a hint? :(
patience my freind, i've got a backlog of ~~500 Lucene related messages in
my INBOX, and i was just reading your original email when this reply came
in.
In generally this is a fairly hard problem ... the easiest solution i know
of that works in most cases is to do index time expansion using the
SYnonymFilter, so regardless of wether a document contains "usbcable"
"usb-cable" or "usb cable" all three varients get indexed, and then the
user can search for any of them.
the downside is that it can throw off your tf/idf stats for some terms (if
they apear by themselves, and as part of a compound) and it can result in
false positives for esoteric phrase searches (but that tends to be more of
a theoretical problem then an actual one.
: > But this never happens since with the DisMax Searcher the parser produces a
: > query like this:
: >
: > ((category:blue | name:blue)~0.1 (category:tooth | name:tooth)~0.1)
...
: > to deal with this compound word problem? Is there another query parser that
: > already does the trick?
take a look at the FieldQParserPlugin ... it passes the raw query string
to the analyser of a specified field -- this would let your TokenFilters
see the "stream" of tokens (which isn't possible with the conventional
QueryParser tokenization rules) but it doesn't have any of the
"field/query matric cross product" goodness of dismax -- you'd only be
able to query the one field.
(Hmmm.... i wonder if DisMaxQParser 2.0 could have an option to let you
specify a FieldType whose analyzer was used to tokenize the query string
instead of using the Lucene QueryParser JavaCC tokenization, and *then*
the tokens resulting from that initial analyzer could be passed to the
analyzers of the various qf fields ... hmmm, that might be just crazy
enough to be too crazy to work)
-Hoss
Re: Compound word search (maybe DisMaxQueryPaser problem)
Posted by Walter Underwood <wu...@netflix.com>.
Sorry, I missed this. We have the same problem.
None of our customers use query syntax, so I have considered making a
full-text query parser. Use the analyzer chain, then convert the result
into a big OR query, then pass it to the rest of Dismax. Shingles and
synonyms should work at query time with that approach.
This question should probably go to a Lucene list, too.
wunder
On 3/11/09 2:54 AM, "Tobias Dittrich" <di...@wave-computer.de> wrote:
> Hmmm was my mail so weird or my question so stupid ... or is
> there simply noone with an answer? Not even a hint? :(
>
> Tobias Dittrich schrieb:
>> Hi all,
>>
>> I know there are a lot of topics about compound word search already but
>> I haven't found anything for my specific problem yet. So if this is
>> already answered (which would be nice :)) then any hints or search
>> phrases for the mail archive would be apreciated.
>>
>> Bascially I want users to be able to search my index for compound words
>> that are not really compounds but merely terms that can be written in
>> several ways.
>>
>> For example I have the categories "usb" and "cable" in my index and I
>> want the user to be able to search for "usbcable" or "usb-cable" etc.
>> Also there is "bluetooth" in the index and I want the search for "blue
>> tooth" to return the corresponding documents.
>>
>> My approach is to use ShingleFilterFactory followed by
>> WordDelimiterFilterFactory to index all possible combinations of words
>> and get rid of intra-word delimiters. This nicely covers the first part
>> of my requirements since the terms "usb" and "cable" somewhere along the
>> process get concatenated and "usbcable" is in the index.
>>
>> Now I also want use this on the query side, so the user input "blue
>> tooth" (not as phrase) would become "bluetooth" for this field and
>> produce a hit. But this never happens since with the DisMax Searcher the
>> parser produces a query like this:
>>
>> ((category:blue | name:blue)~0.1 (category:tooth | name:tooth)~0.1)
>>
>> And the filters and analysers for this field never get to see the whole
>> user query and cannot perform their shingle and delimiter tasks :(
>>
>> So my question now is: how can I get this working? Is there a preferable
>> way to deal with this compound word problem? Is there another query
>> parser that already does the trick?
>>
>> Or would it make sense to write my own query parser that passes the user
>> query "as is" to the several fields?
>>
>> Any hints on this are welcome.
>>
>> Thanks in advance
>> Tobias
>>
Re: Compound word search (maybe DisMaxQueryPaser problem)
Posted by Tobias Dittrich <di...@wave-computer.de>.
Hmmm was my mail so weird or my question so stupid ... or is
there simply noone with an answer? Not even a hint? :(
Tobias Dittrich schrieb:
> Hi all,
>
> I know there are a lot of topics about compound word search already but
> I haven't found anything for my specific problem yet. So if this is
> already answered (which would be nice :)) then any hints or search
> phrases for the mail archive would be apreciated.
>
> Bascially I want users to be able to search my index for compound words
> that are not really compounds but merely terms that can be written in
> several ways.
>
> For example I have the categories "usb" and "cable" in my index and I
> want the user to be able to search for "usbcable" or "usb-cable" etc.
> Also there is "bluetooth" in the index and I want the search for "blue
> tooth" to return the corresponding documents.
>
> My approach is to use ShingleFilterFactory followed by
> WordDelimiterFilterFactory to index all possible combinations of words
> and get rid of intra-word delimiters. This nicely covers the first part
> of my requirements since the terms "usb" and "cable" somewhere along the
> process get concatenated and "usbcable" is in the index.
>
> Now I also want use this on the query side, so the user input "blue
> tooth" (not as phrase) would become "bluetooth" for this field and
> produce a hit. But this never happens since with the DisMax Searcher the
> parser produces a query like this:
>
> ((category:blue | name:blue)~0.1 (category:tooth | name:tooth)~0.1)
>
> And the filters and analysers for this field never get to see the whole
> user query and cannot perform their shingle and delimiter tasks :(
>
> So my question now is: how can I get this working? Is there a preferable
> way to deal with this compound word problem? Is there another query
> parser that already does the trick?
>
> Or would it make sense to write my own query parser that passes the user
> query "as is" to the several fields?
>
> Any hints on this are welcome.
>
> Thanks in advance
> Tobias
>
--
Tobias Dittrich
- Leiter Internet-Entwicklung -
_________________________________
WAVE Computersysteme GmbH
Philipp-Reis-Str. 9
35440 Linden
Geschäftsführer: Carsten Kellmann
Registergericht Gießen HRB 1823
Fon: +49 (0) 6403 / 9050 6001
Fax: +49 (0) 6403 / 9050 5089
mailto:dittrich@wave-computer.de
http://www.wave-computer.de