You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Tobias Dittrich <di...@wave-computer.de> on 2009/03/02 14:32:05 UTC

Compound word search (maybe DisMaxQueryPaser problem)

Hi all,

I know there are a lot of topics about compound word search 
already but I haven't found anything for my specific problem 
yet. So if this is already answered (which would be nice :)) 
then any hints or search phrases for the mail archive would 
be apreciated.

Bascially I want users to be able to search my index for 
compound words that are not really compounds but merely 
terms that can be written in several ways.

For example I have the categories "usb" and "cable" in my 
index and I want the user to be able to search for 
"usbcable" or "usb-cable" etc. Also there is "bluetooth" in 
the index and I want the search for "blue tooth" to return 
the corresponding documents.

My approach is to use ShingleFilterFactory followed by 
WordDelimiterFilterFactory to index all possible 
combinations of words and get rid of intra-word delimiters. 
This nicely covers the first part of my requirements since 
the terms "usb" and "cable" somewhere along the process get 
concatenated and "usbcable" is in the index.

Now I also want use this on the query side, so the user 
input "blue tooth" (not as phrase) would become "bluetooth" 
for this field and produce a hit. But this never happens 
since with the DisMax Searcher the parser produces a query 
like this:

((category:blue | name:blue)~0.1 (category:tooth | 
name:tooth)~0.1)

And the filters and analysers for this field never get to 
see the whole user query and cannot perform their shingle 
and delimiter tasks :(

So my question now is: how can I get this working? Is there 
a preferable way to deal with this compound word problem? Is 
there another query parser that already does the trick?

Or would it make sense to write my own query parser that 
passes the user query "as is" to the several fields?

Any hints on this are welcome.

Thanks in advance
Tobias

-- 
Tobias Dittrich
- Leiter Internet-Entwicklung -
_________________________________
WAVE Computersysteme GmbH

Philipp-Reis-Str. 9
35440 Linden

Geschäftsführer: Carsten Kellmann
Registergericht Gießen HRB 1823

Fon: +49 (0) 6403 / 9050 6001
Fax: +49 (0) 6403 / 9050 5089
mailto:dittrich@wave-computer.de
http://www.wave-computer.de


Re: Compound word search (maybe DisMaxQueryPaser problem)

Posted by Tobias Dittrich <di...@wave-computer.de>.
Oh my ... thinking even more about it, I have to admit 
you're right :) But that leaves me somewhat clueless again.

So I'll just try and share my thoughts on this. Maybe 
someone will read this and can point me to a possible 
solution ... or tell me where I'm wrong.

Say we have a schema with fields f1, f2 and f3. And the user 
queries for "a b c" (without the quotes). What I would 
expect as resulting query would be (leaving out the details 
like tie, boosting etc.):

((f1:a OR f2:a OR f3:a) AND (f1:b OR f2:b OR f3:b) AND (f1:c 
OR f2:c OR f3:c))
OR
((f1:ab OR f2:ab OR f3:ab) AND (f1:c f2:c f3:c))
OR
((f1:a OR f2:a OR f3:a) AND (f1:bc f2:bc f3:bc))

(possibly also f1:abc OR f2:abc ..  and/or f1:a b c OR f2:a 
b c etc. )

So every possibility of how to write compound words is covered.

But then there is the problem that there are fields that 
require exact matching (something like EAN, manufacturer 
code or product serial number. Unfortunately these can 
contain whitespaces etc. So a b c can also be a valid 
manufacturer code which sould match as a whole).

So I modeled the fields in the schema accordingly: making 
exact match fields string and add ShingleFilter and 
WordDelimiterFiler for content fields. And I thought the 
fields analyzer stack would take care of how to process the 
user input.

But when I pass the user query as phrase to the DisMax 
Handler (so that every field gets to see the whole user 
query and can tokenize and shingle it) I get a query like this:
(f1:a b c)
OR
(f2:a OR f2:ab OR f2:b OR f2:bc OR f2:c)
OR
(f3:a OR f3:ab OR f3:b OR f3:bc OR f3:c)

which apparently is not what I need as it also would find 
for example documents that only contain a or b etc. When 
using phrase fields this query is just added to the normal 
query and therefore the query fails to find the compound words.

Also using the FieldQuery Analyzer does not yield the 
desired results as the parsed queries as a matter of fact 
look like the phrase queries from the DisMax parser.

I tried dozends of variations and I'm still pretty sure that 
there must be a way to get this working. It doesn't look 
that hard. But for now I will settle this for the weekend :)

Have a nice weekend all and thanks in advance for any 
comments or replies.

Tobi


Chris Hostetter schrieb:
> : Many thanks for your explanation. That really helped me a lot in understanding
> : DisMax - and finally I realized that DisMax is not at all what I need.
> : Actually I do not want results where "blue" is in one field and "tooth" in
> : another (imagine you search for a notebook with blue tooth and get some blue
> : products that accidentally have tooth in some field).
> 
> except that if you use the "pf" param as well, a search for...
> 
> 	blue tooth
> 
> can score products where "blue tooth" appears in one field higher then 
> products where "blue" apears in one field and "tooth" appears in another 
> field.
>    
> 
> The approach you are describing might give you you better precisions (ie: 
> less total results) but it will have a loss in precision, a query like 
> this...
> 
> 	blue tooth notebook
> 
> ...probably won't be able to find documents matching the terms 
> "product_type:notebook features:blue features:tooth" ... but dismax 
> can.
> 
> 
> -Hoss
> 


Re: Compound word search (maybe DisMaxQueryPaser problem)

Posted by Chris Hostetter <ho...@fucit.org>.
: Many thanks for your explanation. That really helped me a lot in understanding
: DisMax - and finally I realized that DisMax is not at all what I need.
: Actually I do not want results where "blue" is in one field and "tooth" in
: another (imagine you search for a notebook with blue tooth and get some blue
: products that accidentally have tooth in some field).

except that if you use the "pf" param as well, a search for...

	blue tooth

can score products where "blue tooth" appears in one field higher then 
products where "blue" apears in one field and "tooth" appears in another 
field.
   

The approach you are describing might give you you better precisions (ie: 
less total results) but it will have a loss in precision, a query like 
this...

	blue tooth notebook

...probably won't be able to find documents matching the terms 
"product_type:notebook features:blue features:tooth" ... but dismax 
can.


-Hoss


Re: Compound word search (maybe DisMaxQueryPaser problem)

Posted by Tobias Dittrich <di...@wave-computer.de>.
Many thanks for your explanation. That really helped me a 
lot in understanding DisMax - and finally I realized that 
DisMax is not at all what I need. Actually I do not want 
results where "blue" is in one field and "tooth" in another 
(imagine you search for a notebook with blue tooth and get 
some blue products that accidentally have tooth in some field).

My feeling already was that I have to come up with my own 
solution mixing parts of DisMax (distribute the query among 
the fields) and FieldQParserPlugin. So now I will try that out.

Many thanks
Tobi

Chris Hostetter schrieb:
> : My original assumption for the DisMax Handler was, that it will just take the
> : original query string and pass it to every field in its fieldlist using the
> : fields configured analyzer stack. Maybe in the end add some stuff for the
> : special options and so ... and then send the query to lucene. Can you explain
> : why this approach was not choosen?
> 
> because then it wouldn't be the DisMaxRequestHandler.
> 
> seriously: the point of dismax is to build up a DisjunctionMaxQuery for 
> each "chunk" in the query string and populate those DisjunctionMaxQueries 
> with the Queries produced by analyzing that "chunk" against each field in 
> the qf -- then all of the DisjunctionMaxQueries are grouped into a 
> BooleanQuery with a minNrSHouldMatch.
> 
> if you look at the query toString from debugQuery (using a non trivial qf 
> param and a q string containing more then one "chunk") you can see what i 
> mean.  your example shows it pretty well actaully...
> 
> : > : > : > ((category:blue | name:blue)~0.1 (category:tooth | name:tooth)~0.1)
> 
> the point is to build those DisjunctionMaxQueries -- so that each "chunk" 
> only contributes significantly based on the highest scoring field that 
> chunk appears in ... if your example someone typing "blue tooth" can get a 
> match when a doc matches blue in one field and tooth in another -- that 
> wouldn't be possible with the appraoch you describe.  the Query structure 
> also means that a doc where "tooth" appears in both the category and name 
> fields but "blue" doesn't appear at all won't score as high as a doc that 
> matches "blue" in category and "tooth" in name (allthough you have to look 
> at the score explanations to really see hwat i mean by that)
> 
> 
> There are certainly a lot of improvements that could be made to dismax ... 
> more customiation in terms of how the querystrings is parsed before 
> building up the DisjunctionMaxQueries and calling the individual field 
> analyzers would certainly be one way it could improve ... but so far no 
> one has attempted anything like that.
> 
> 
> 
> 
> -Hoss
> 


Re: Compound word search (maybe DisMaxQueryPaser problem)

Posted by Chris Hostetter <ho...@fucit.org>.
: My original assumption for the DisMax Handler was, that it will just take the
: original query string and pass it to every field in its fieldlist using the
: fields configured analyzer stack. Maybe in the end add some stuff for the
: special options and so ... and then send the query to lucene. Can you explain
: why this approach was not choosen?

because then it wouldn't be the DisMaxRequestHandler.

seriously: the point of dismax is to build up a DisjunctionMaxQuery for 
each "chunk" in the query string and populate those DisjunctionMaxQueries 
with the Queries produced by analyzing that "chunk" against each field in 
the qf -- then all of the DisjunctionMaxQueries are grouped into a 
BooleanQuery with a minNrSHouldMatch.

if you look at the query toString from debugQuery (using a non trivial qf 
param and a q string containing more then one "chunk") you can see what i 
mean.  your example shows it pretty well actaully...

: > : > : > ((category:blue | name:blue)~0.1 (category:tooth | name:tooth)~0.1)

the point is to build those DisjunctionMaxQueries -- so that each "chunk" 
only contributes significantly based on the highest scoring field that 
chunk appears in ... if your example someone typing "blue tooth" can get a 
match when a doc matches blue in one field and tooth in another -- that 
wouldn't be possible with the appraoch you describe.  the Query structure 
also means that a doc where "tooth" appears in both the category and name 
fields but "blue" doesn't appear at all won't score as high as a doc that 
matches "blue" in category and "tooth" in name (allthough you have to look 
at the score explanations to really see hwat i mean by that)


There are certainly a lot of improvements that could be made to dismax ... 
more customiation in terms of how the querystrings is parsed before 
building up the DisjunctionMaxQueries and calling the individual field 
analyzers would certainly be one way it could improve ... but so far no 
one has attempted anything like that.




-Hoss


Re: Compound word search (maybe DisMaxQueryPaser problem)

Posted by Tobias Dittrich <di...@wave-computer.de>.
First of all: sorry Chris, Walter .. I did not mean to put 
pressure on anyone. It's just that if you're stuck with 
something and you have that little needle stinging saying: 
maybe you're just too damn stupid for this ... :) So, thanks 
a lot for your answers.

As for index time expansion using synonyms: I think this is 
not an option for me since it would mean that I have to a) 
find all such words that might cause problems and b) find 
every variant that might possibly be used by customers. And 
then in the end I have to keep all my synonym files 
up-to-date. But the main design goal for my search 
implementation is little to no maintainance.

My original assumption for the DisMax Handler was, that it 
will just take the original query string and pass it to 
every field in its fieldlist using the fields configured 
analyzer stack. Maybe in the end add some stuff for the 
special options and so ... and then send the query to 
lucene. Can you explain why this approach was not choosen?

Thanks
Tobi


Chris Hostetter schrieb:
> : Hmmm was my mail so weird or my question so stupid ... or is there simply
> : noone with an answer? Not even a hint? :(
> 
> patience my freind, i've got a backlog of ~~500 Lucene related messages in 
> my INBOX, and i was just reading your original email when this reply came 
> in.
> 
> In generally this is a fairly hard problem ... the easiest solution i know 
> of that works in most cases is to do index time expansion using the 
> SYnonymFilter, so regardless of wether a document contains "usbcable" 
> "usb-cable" or "usb cable" all three varients get indexed, and then the 
> user can search for any of them.
> 
> the downside is that it can throw off your tf/idf stats for some terms (if 
> they apear by themselves, and as part of a compound) and it can result in 
> false positives for esoteric phrase searches (but that tends to be more of 
> a theoretical problem then an actual one.
> 
> : > But this never happens since with the DisMax Searcher the parser produces a
> : > query like this:
> : > 
> : > ((category:blue | name:blue)~0.1 (category:tooth | name:tooth)~0.1)
> 	...
> : > to deal with this compound word problem? Is there another query parser that
> : > already does the trick?
> 
> take a look at the FieldQParserPlugin ... it passes the raw query string 
> to the analyser of a specified field -- this would let your TokenFilters 
> see the "stream" of tokens (which isn't possible with the conventional 
> QueryParser tokenization rules) but it doesn't have any of the 
> "field/query matric cross product" goodness of dismax -- you'd only be 
> able to query the one field.
> 
> (Hmmm.... i wonder if DisMaxQParser 2.0 could have an option to let you 
> specify a FieldType whose analyzer was used to tokenize the query string 
> instead of using the Lucene QueryParser JavaCC tokenization, and *then* 
> the tokens resulting from that initial analyzer could be passed to the 
> analyzers of the various qf fields ... hmmm, that might be just crazy 
> enough to be too crazy to work)
> 
> 
> 
> 
> -Hoss
> 
> 


Re: Compound word search (maybe DisMaxQueryPaser problem)

Posted by Chris Hostetter <ho...@fucit.org>.
: Hmmm was my mail so weird or my question so stupid ... or is there simply
: noone with an answer? Not even a hint? :(

patience my freind, i've got a backlog of ~~500 Lucene related messages in 
my INBOX, and i was just reading your original email when this reply came 
in.

In generally this is a fairly hard problem ... the easiest solution i know 
of that works in most cases is to do index time expansion using the 
SYnonymFilter, so regardless of wether a document contains "usbcable" 
"usb-cable" or "usb cable" all three varients get indexed, and then the 
user can search for any of them.

the downside is that it can throw off your tf/idf stats for some terms (if 
they apear by themselves, and as part of a compound) and it can result in 
false positives for esoteric phrase searches (but that tends to be more of 
a theoretical problem then an actual one.

: > But this never happens since with the DisMax Searcher the parser produces a
: > query like this:
: > 
: > ((category:blue | name:blue)~0.1 (category:tooth | name:tooth)~0.1)
	...
: > to deal with this compound word problem? Is there another query parser that
: > already does the trick?

take a look at the FieldQParserPlugin ... it passes the raw query string 
to the analyser of a specified field -- this would let your TokenFilters 
see the "stream" of tokens (which isn't possible with the conventional 
QueryParser tokenization rules) but it doesn't have any of the 
"field/query matric cross product" goodness of dismax -- you'd only be 
able to query the one field.

(Hmmm.... i wonder if DisMaxQParser 2.0 could have an option to let you 
specify a FieldType whose analyzer was used to tokenize the query string 
instead of using the Lucene QueryParser JavaCC tokenization, and *then* 
the tokens resulting from that initial analyzer could be passed to the 
analyzers of the various qf fields ... hmmm, that might be just crazy 
enough to be too crazy to work)




-Hoss

Re: Compound word search (maybe DisMaxQueryPaser problem)

Posted by Walter Underwood <wu...@netflix.com>.
Sorry, I missed this. We have the same problem.

None of our customers use query syntax, so I have considered making a
full-text query parser. Use the analyzer chain, then convert the result
into a big OR query, then pass it to the rest of Dismax. Shingles and
synonyms should work at query time with that approach.

This question should probably go to a Lucene list, too.

wunder

On 3/11/09 2:54 AM, "Tobias Dittrich" <di...@wave-computer.de> wrote:

> Hmmm was my mail so weird or my question so stupid ... or is
> there simply noone with an answer? Not even a hint? :(
> 
> Tobias Dittrich schrieb:
>> Hi all,
>> 
>> I know there are a lot of topics about compound word search already but
>> I haven't found anything for my specific problem yet. So if this is
>> already answered (which would be nice :)) then any hints or search
>> phrases for the mail archive would be apreciated.
>> 
>> Bascially I want users to be able to search my index for compound words
>> that are not really compounds but merely terms that can be written in
>> several ways.
>> 
>> For example I have the categories "usb" and "cable" in my index and I
>> want the user to be able to search for "usbcable" or "usb-cable" etc.
>> Also there is "bluetooth" in the index and I want the search for "blue
>> tooth" to return the corresponding documents.
>> 
>> My approach is to use ShingleFilterFactory followed by
>> WordDelimiterFilterFactory to index all possible combinations of words
>> and get rid of intra-word delimiters. This nicely covers the first part
>> of my requirements since the terms "usb" and "cable" somewhere along the
>> process get concatenated and "usbcable" is in the index.
>> 
>> Now I also want use this on the query side, so the user input "blue
>> tooth" (not as phrase) would become "bluetooth" for this field and
>> produce a hit. But this never happens since with the DisMax Searcher the
>> parser produces a query like this:
>> 
>> ((category:blue | name:blue)~0.1 (category:tooth | name:tooth)~0.1)
>> 
>> And the filters and analysers for this field never get to see the whole
>> user query and cannot perform their shingle and delimiter tasks :(
>> 
>> So my question now is: how can I get this working? Is there a preferable
>> way to deal with this compound word problem? Is there another query
>> parser that already does the trick?
>> 
>> Or would it make sense to write my own query parser that passes the user
>> query "as is" to the several fields?
>> 
>> Any hints on this are welcome.
>> 
>> Thanks in advance
>> Tobias
>> 


Re: Compound word search (maybe DisMaxQueryPaser problem)

Posted by Tobias Dittrich <di...@wave-computer.de>.
Hmmm was my mail so weird or my question so stupid ... or is 
there simply noone with an answer? Not even a hint? :(

Tobias Dittrich schrieb:
> Hi all,
> 
> I know there are a lot of topics about compound word search already but 
> I haven't found anything for my specific problem yet. So if this is 
> already answered (which would be nice :)) then any hints or search 
> phrases for the mail archive would be apreciated.
> 
> Bascially I want users to be able to search my index for compound words 
> that are not really compounds but merely terms that can be written in 
> several ways.
> 
> For example I have the categories "usb" and "cable" in my index and I 
> want the user to be able to search for "usbcable" or "usb-cable" etc. 
> Also there is "bluetooth" in the index and I want the search for "blue 
> tooth" to return the corresponding documents.
> 
> My approach is to use ShingleFilterFactory followed by 
> WordDelimiterFilterFactory to index all possible combinations of words 
> and get rid of intra-word delimiters. This nicely covers the first part 
> of my requirements since the terms "usb" and "cable" somewhere along the 
> process get concatenated and "usbcable" is in the index.
> 
> Now I also want use this on the query side, so the user input "blue 
> tooth" (not as phrase) would become "bluetooth" for this field and 
> produce a hit. But this never happens since with the DisMax Searcher the 
> parser produces a query like this:
> 
> ((category:blue | name:blue)~0.1 (category:tooth | name:tooth)~0.1)
> 
> And the filters and analysers for this field never get to see the whole 
> user query and cannot perform their shingle and delimiter tasks :(
> 
> So my question now is: how can I get this working? Is there a preferable 
> way to deal with this compound word problem? Is there another query 
> parser that already does the trick?
> 
> Or would it make sense to write my own query parser that passes the user 
> query "as is" to the several fields?
> 
> Any hints on this are welcome.
> 
> Thanks in advance
> Tobias
> 

-- 
Tobias Dittrich
- Leiter Internet-Entwicklung -
_________________________________
WAVE Computersysteme GmbH

Philipp-Reis-Str. 9
35440 Linden

Geschäftsführer: Carsten Kellmann
Registergericht Gießen HRB 1823

Fon: +49 (0) 6403 / 9050 6001
Fax: +49 (0) 6403 / 9050 5089
mailto:dittrich@wave-computer.de
http://www.wave-computer.de