You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Bill Janssen <ja...@parc.com> on 2004/09/09 02:01:06 UTC

Re: MultiFieldQueryParser seems broken... Fix attached.

Ren�,

Thanks for your note.

I'd think that if a user specified a query "cutting lucene", with an
implicit AND and the default fields "title" and "author", they'd
expect to see a match in which both "cutting" and "lucene" appears.  That is,

(title:cutting OR author:cutting) AND (title:lucene OR author:lucene)

Instead, what they'd get using the current (broken) strategy of outer
combination used by the current MultiFieldQueryParser, would be

(title:cutting OR title:lucene) AND (author:cutting OR author:lucene)

Note that this would match even if only "lucene" occurred in the
document, as long as it occurred both in the title field and in the
author field.  Or, for that matter, it would also match "Cutting on
Cutting", by Doug Cutting :-).

> http://issues.apache.org/eyebrowse/ReadMsg?listName=lucene-user@jakarta.apache.org&msgId=1798116

Yes, the approach there is similar.  I attempted to complete the
solution and provide a working replacement for MultiFieldQueryParser.

Bill

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: MultiFieldQueryParser seems broken... Fix attached.

Posted by Doug Cutting <cu...@apache.org>.
Daniel Naber wrote:
> On Thursday 09 September 2004 18:52, Doug Cutting wrote:
> 
> 
>>I have not been
>>able to construct a two-word query that returns a page without both
>>words in either the content, the title, the url or in a single anchor.
>>Can you?
> 
> 
> Like this one?
> 
> konvens leitseite 
> 
> Leitseite is only in the title of the first match (www.gldv.org), konvens 
> is only in the body.

Good job finding that!  I guess I should fix Nutch's BasicQueryFilter.

Thanks,

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: MultiFieldQueryParser seems broken... Fix attached.

Posted by Daniel Naber <da...@t-online.de>.
On Thursday 09 September 2004 18:52, Doug Cutting wrote:

> I have not been
> able to construct a two-word query that returns a page without both
> words in either the content, the title, the url or in a single anchor.
> Can you?

Like this one?

konvens leitseite 

Leitseite is only in the title of the first match (www.gldv.org), konvens 
is only in the body.

-- 
http://www.danielnaber.de

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: MultiFieldQueryParser seems broken... Fix attached.

Posted by Doug Cutting <cu...@apache.org>.
Bill Janssen wrote:
> I'd think that if a user specified a query "cutting lucene", with an
> implicit AND and the default fields "title" and "author", they'd
> expect to see a match in which both "cutting" and "lucene" appears.  That is,
> 
> (title:cutting OR author:cutting) AND (title:lucene OR author:lucene)

Your proposal is certainly an improvement.

It's interesting to note that in Nutch I implemented something 
different.  There, a search for "cutting lucene" expands to something like:

  (+url:cutting^4.0 +url:lucene^4.0 +url:"cutting lucene"~2147483647^4.0)
  (+anchor:cutting^2.0 +anchor:lucene^2.0 +anchor:"cutting lucene"~4^2.0)
  (+content:cutting +content:lucene +content:"cutting lucene"~2147483647)

So a page with "cutting" in the body and "lucene" in anchor text won't 
match: the body, anchor or url must contain all query terms.  A single 
authority (content, url or anchor) must vouch for all attributes.

Note that Nutch also boosts matches where the terms are close together. 
  Using "~2147483647" permits them to be anywhere in the document, but 
boosts more when they're closer and in-order.  (The "~4" in anchor 
matches is to prohibit matches across different anchors.  Each anchor is 
separated by a Token.positionIncrement() of 4.)

But perhaps this is not a feature.  Perhaps Nutch should instead expand 
this to:

  +(url:cutting^4.0 anchor:cutting^2.0 content:cutting)
  +(url:lucene^4.0 anchor:lucene^2.0 content:lucene)
  url:"cutting lucene"~2147483647^4.0
  anchor:"cutting lucene"~4^2.0
  content:"cutting lucene"~2147483647

That would, e.g., permit a match with only "lucene" in an anchor and 
"cutting" in the content, which the earlier formulation would not.

Can anyone tell whether Google has this requirement?  I have not been 
able to construct a two-word query that returns a page without both 
words in either the content, the title, the url or in a single anchor. 
Can you?

If you're interested, the Nutch query expansion code in question is:

http://cvs.sourceforge.net/viewcvs.py/nutch/nutch/src/plugin/query-basic/src/java/net/nutch/searcher/basic/BasicQueryFilter.java?view=markup

To play with it you can download Nutch and use the command:

   bin/nutch net.nutch.searcher.Query

>>http://issues.apache.org/eyebrowse/ReadMsg?listName=lucene-user@jakarta.apache.org&msgId=1798116
> 
> 
> Yes, the approach there is similar.  I attempted to complete the
> solution and provide a working replacement for MultiFieldQueryParser.

But, inspired by that message, couldn't MultiFieldQueryParser just be a 
subclass of QueryParser that overrides getFieldQuery()?

Cheers,

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: MultiFieldQueryParser seems broken... Fix attached.

Posted by sergiu gordea <gs...@ifit.uni-klu.ac.at>.
 Hi Bill,
 
  I think that more people wait for this patch of MultifieldIndexParser.
  It would be nice if it will be included in the next realease candidate 
....

    All the best,

   Sergiu

Bill Janssen wrote:

>René,
>
>Thanks for your note.
>
>I'd think that if a user specified a query "cutting lucene", with an
>implicit AND and the default fields "title" and "author", they'd
>expect to see a match in which both "cutting" and "lucene" appears.  That is,
>
>(title:cutting OR author:cutting) AND (title:lucene OR author:lucene)
>
>Instead, what they'd get using the current (broken) strategy of outer
>combination used by the current MultiFieldQueryParser, would be
>
>(title:cutting OR title:lucene) AND (author:cutting OR author:lucene)
>
>Note that this would match even if only "lucene" occurred in the
>document, as long as it occurred both in the title field and in the
>author field.  Or, for that matter, it would also match "Cutting on
>Cutting", by Doug Cutting :-).
>
>  
>
>>http://issues.apache.org/eyebrowse/ReadMsg?listName=lucene-user@jakarta.apache.org&msgId=1798116
>>    
>>
>
>Yes, the approach there is similar.  I attempted to complete the
>solution and provide a working replacement for MultiFieldQueryParser.
>
>Bill
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
>  
>


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: Handling user queries (Was: Re: MultiFieldQueryParser seems broken... Fix attached.)

Posted by sergiu gordea <gs...@ifit.uni-klu.ac.at>.
René Hackl wrote:

>>is it a problem if the users will search "coffee OR tea" as a search 
>>string in the case that MultifieldQueryParser is
>>modifyed as Bill suggested?, and the default opperator is set to AND?
>>    
>>
>
>No. There's not a problem with the proposed correction to MFQP. MFQP should
>work the way Bill suggested.
>
>My babbling about coffee or tea was more aimed at Bill's referring to "darn
>users started demanding" <nifty feature>. So this is a totally different
>matter. In my experience, many users fall to everyday language traps, like
>in: "What do you want to drink, coffee or tea?" The answer normally isn't
>'yes' to both, is it?  
>
>  
>
this problem may be solved if the users know the meaning of the 
following signs mean:
- + "" * ~
this will improve the results in a better way that our parsing is doing ...

>I have an app where in some cases I make subqueries for an initial
>user-stated query. The aim is to come up with pointers to partial matching
>docs. The background is, one ill-advised NOT can ruin a query. But this has
>nothing to do with MFQP. Just random thoughts about making users happy even
>when they are new to formulating queries :-)
>
>Cheers,
>René
>  
>







---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Handling user queries (Was: Re: MultiFieldQueryParser seems broken... Fix attached.)

Posted by René Hackl <re...@gmx.de>.
> is it a problem if the users will search "coffee OR tea" as a search 
> string in the case that MultifieldQueryParser is
> modifyed as Bill suggested?, and the default opperator is set to AND?

No. There's not a problem with the proposed correction to MFQP. MFQP should
work the way Bill suggested.

My babbling about coffee or tea was more aimed at Bill's referring to "darn
users started demanding" <nifty feature>. So this is a totally different
matter. In my experience, many users fall to everyday language traps, like
in: "What do you want to drink, coffee or tea?" The answer normally isn't
'yes' to both, is it?  

I have an app where in some cases I make subqueries for an initial
user-stated query. The aim is to come up with pointers to partial matching
docs. The background is, one ill-advised NOT can ruin a query. But this has
nothing to do with MFQP. Just random thoughts about making users happy even
when they are new to formulating queries :-)

Cheers,
René

-- 
NEU: Bis zu 10 GB Speicher für e-mails & Dateien!
1 GB bereits bei GMX FreeMail http://www.gmx.net/de/go/mail


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: MultiFieldQueryParser seems broken... Fix attached.

Posted by sergiu gordea <gs...@ifit.uni-klu.ac.at>.
René Hackl wrote:

>Bill,
>
>Thank you for clarifying on that issue. I missed the...
>
>  
>
>>(title:cutting OR author:cutting) AND (title:lucene OR author:lucene)
>>    
>>
>   ...
>  
>
>>(title:cutting OR title:lucene) AND (author:cutting OR author:lucene)
>>
>>Note that this would match even if only "lucene" occurred in the
>>    
>>
>
>... "only lucene"/"only cutting" match. 
>
>  
>
>>I'd think that if a user specified a query "cutting lucene", with an
>>implicit AND and the default fields "title" and "author", they'd
>>expect to see a match in which both "cutting" and "lucene" appears. 
>>    
>>
>
>Hopefully they'd expect that. Sometimes users assume that e.g. "coffee OR
>tea" would provide matches with either term, but not both. But this is
>already "user-attune your application" territory. Your proposal makes
>perfect sense, of course.
>
>René
>
>  
>
is it a problem if the users will search "coffee OR tea" as a search 
string in the case that MultifieldQueryParser is
modifyed as Bill suggested?, and the default opperator is set to AND?

I don't think so ... I think that the resulting Query should be:

(title:cutting OR author:cutting) OR (title:lucene OR author:lucene)

 And I think that the results will be correct.
Am I wrong?

I don't know exactly what will happen with more complex queries, the uses grouping, exact matches and NOT operator


like:

  (alcohol NOT tea) OR ("black tea" AND brandy)
what will happen if you send this to a MultifieldQueryParser that searches in an index with

the fields "drink" and "juices"

Maybe this kind of search constructions should be a part of JUnit tests, if they are not already there.

 
 Thanks,

 Sergiu 
  

 



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: MultiFieldQueryParser seems broken... Fix attached.

Posted by René Hackl <re...@gmx.de>.
Bill,

Thank you for clarifying on that issue. I missed the...

> (title:cutting OR author:cutting) AND (title:lucene OR author:lucene)
   ...
> (title:cutting OR title:lucene) AND (author:cutting OR author:lucene)
> 
> Note that this would match even if only "lucene" occurred in the

... "only lucene"/"only cutting" match. 

> I'd think that if a user specified a query "cutting lucene", with an
> implicit AND and the default fields "title" and "author", they'd
> expect to see a match in which both "cutting" and "lucene" appears. 

Hopefully they'd expect that. Sometimes users assume that e.g. "coffee OR
tea" would provide matches with either term, but not both. But this is
already "user-attune your application" territory. Your proposal makes
perfect sense, of course.

René


-- 
Supergünstige DSL-Tarife + WLAN-Router für 0,- EUR*
Jetzt zu GMX wechseln und sparen http://www.gmx.net/de/go/dsl


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org