You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Ensdorf Ken <En...@zoominfo.com> on 2009/09/23 23:14:40 UTC

Mixed field types and boolean searching

Hi-

let's say you have two indexed fields, "F1" and "F2".  F1 uses the StandardAnalyzer, while F2 doesn't.  Now imagine you index a document where you have

F1="A & B"

F2="C + D"

Now imagine you run a query:

(F1:A OR F2:A) AND (F1:B OR F2:B)

in other words, both "A" and "B" must exist in at least one of F1 or F2.  This  returns the document in question.  Now imagine you run another query:

(F1:A OR F2:A) AND (F1:& OR F2:&)

Since "&" is removed by the StandardAnalyzer, the parsed query looks like

(F1:A OR F2:A) AND (F2:&)

Now you don't match the document.  Is this a bug?

Thanks!
-Ken

RE: Mixed field types and boolean searching

Posted by Ensdorf Ken <En...@zoominfo.com>.

> The DisMax parser essentially creates a set of queries against
> different fields. These queries are analyzed as per each field.
> 
> I think this what you are talking about- "The" in a movie title is
> diffferent from "the" in the movie description. Would you expect "The
> Sound Of Music" to fetch every movie in the database? So "the" is a
> stopword in the description but is not in the title.
> 
> Also, the DisMax parser has no OR. It has +, - and "at least one of
> and more is better". The query "A B" means "A or B but both is
> better". "+a +b" means "a AND B". "+a b" means "must have 'a' but is
> better with 'b'".

Right - I'm talked about the disjunction query generated by dismax, which does contain ORs.  To take your example above, say you have a dismax handler defined that searches both movie title and description.  If the query includes "the" and you have mm=100% (so all terms are required) you will only get results whose title contains "the", regardless of the description's contents.  Is that the intent?  Because it seems wrong to me.

Re: Mixed field types and boolean searching

Posted by Lance Norskog <go...@gmail.com>.

The DisMax parser essentially creates a set of queries against
different fields. These queries are analyzed as per each field.

I think this what you are talking about- "The" in a movie title is
diffferent from "the" in the movie description. Would you expect "The
Sound Of Music" to fetch every movie in the database? So "the" is a
stopword in the description but is not in the title.

Also, the DisMax parser has no OR. It has +, - and "at least one of
and more is better". The query "A B" means "A or B but both is
better". "+a +b" means "a AND B". "+a b" means "must have 'a' but is
better with 'b'".

On Fri, Sep 25, 2009 at 7:04 AM, Ensdorf Ken <En...@zoominfo.com> wrote:
>> No- there are various analyzers. StandardAnalyzer is geared toward
>> searching bodies of text for interesting words -  punctuation is
>> ripped out. Other analyzers are more useful for "concrete" text. You
>> may have to work at finding one that leaves punctuation in.
>>
>
> My problem is not with the StandardAnalyzer per se, but more as to how "dismax" style queries are handled by the query parser when the different fields have different sets of ignored tokens or stop words.
>
> Say you want to use the contents of a text box in your app and query a field in Solr.  The user enters "A and B", so you map this to "f1:A and f1:B".  Now, if "B" is an ignored token in the "f1" field for whatever reason, the query boils down to "f1:A".
>
> Now imagine you want to allow the user's text to match multiple fields - as in any term can match any field, but all terms must match at least 1 field.  So now you map the user's query to "(f1:A OR f2:A) AND (f1:B OR f2:B)".  But if f2 does not ignore "B", the query boils down to "(f1:A OR f2:A) AND (f2:B)".  Now documents that could come back when you were only matching against the f1 field don't come back.
>
> This seems counter-intuitive - to be consistent, I would think the query should essentially be treated as "(f1:A OR f2:A) AND (TRUE OR f2:B) " - and thus a term that is a stop word or ignored token for any of the fields would be ignored across the board.
>
> So I guess what I'm asking is if there is a reason for the existing behavior, or is it just a fact-of-life of the query parser?  Thanks!
>
> -Ken
>



-- 
Lance Norskog
goksron@gmail.com

RE: Mixed field types and boolean searching

Posted by Ensdorf Ken <En...@zoominfo.com>.

> No- there are various analyzers. StandardAnalyzer is geared toward
> searching bodies of text for interesting words -  punctuation is
> ripped out. Other analyzers are more useful for "concrete" text. You
> may have to work at finding one that leaves punctuation in.
> 

My problem is not with the StandardAnalyzer per se, but more as to how "dismax" style queries are handled by the query parser when the different fields have different sets of ignored tokens or stop words.

Say you want to use the contents of a text box in your app and query a field in Solr.  The user enters "A and B", so you map this to "f1:A and f1:B".  Now, if "B" is an ignored token in the "f1" field for whatever reason, the query boils down to "f1:A".  

Now imagine you want to allow the user's text to match multiple fields - as in any term can match any field, but all terms must match at least 1 field.  So now you map the user's query to "(f1:A OR f2:A) AND (f1:B OR f2:B)".  But if f2 does not ignore "B", the query boils down to "(f1:A OR f2:A) AND (f2:B)".  Now documents that could come back when you were only matching against the f1 field don't come back.  

This seems counter-intuitive - to be consistent, I would think the query should essentially be treated as "(f1:A OR f2:A) AND (TRUE OR f2:B) " - and thus a term that is a stop word or ignored token for any of the fields would be ignored across the board.

So I guess what I'm asking is if there is a reason for the existing behavior, or is it just a fact-of-life of the query parser?  Thanks!

-Ken

Re: Mixed field types and boolean searching

Posted by Lance Norskog <go...@gmail.com>.

No- there are various analyzers. StandardAnalyzer is geared toward
searching bodies of text for interesting words -  punctuation is
ripped out. Other analyzers are more useful for "concrete" text. You
may have to work at finding one that leaves punctuation in.

On Wed, Sep 23, 2009 at 2:14 PM, Ensdorf Ken <En...@zoominfo.com> wrote:
> Hi-
>
> let's say you have two indexed fields, "F1" and "F2".  F1 uses the StandardAnalyzer, while F2 doesn't.  Now imagine you index a document where you have
>
> F1="A & B"
>
> F2="C + D"
>
> Now imagine you run a query:
>
> (F1:A OR F2:A) AND (F1:B OR F2:B)
>
> in other words, both "A" and "B" must exist in at least one of F1 or F2.  This  returns the document in question.  Now imagine you run another query:
>
> (F1:A OR F2:A) AND (F1:& OR F2:&)
>
> Since "&" is removed by the StandardAnalyzer, the parsed query looks like
>
> (F1:A OR F2:A) AND (F2:&)
>
> Now you don't match the document.  Is this a bug?
>
> Thanks!
> -Ken
>
>



-- 
Lance Norskog
goksron@gmail.com