You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by David Spencer <da...@tropo.com> on 2005/02/08 00:06:57 UTC

single field code ready - Re: URL to compare 2 Similarity's ready-- Re: Scoring benchmark evaluation. Was RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

scroll a bit...

Chuck Williams wrote:

> David Spencer wrote:
>   > [1]
>   > 
>   > I currently have 2 variations on the index, one w/ the default
> settings
>   > and another with the Similarity code Chuck attached to the bug
> report.
>   > Do we need other variations on the index e.g. with different
> weights, or
>   >   during indexing are the weights less important than the log() vs.
>   > sqrt(log()) issue?
> 
> My Similarity eliminates the idf^2 by using sqrt(log()), changes the
> base of the logarithm for flattening tf and idf from e to 10 (or any
> parameter setting at runtime), changes the lengthNorm flattening from
> sqrt to log base-10 (not settable at runtime), and adds 1000 to all
> field lengths (normalizing this re. the log base-10 by changing the
> numerator from 1 to 3 = log10(1000)).
> 
> The net effects are to increase flattening of tf and idf by a constant,
> increase flattening of lengthNorm fundamentally (sqrt to log), and
> eliminate large lengthNorm effects with very small fields (further
> flattening its effect).
> 
> At least in the case of multiple fields with meaningful field-boosts,
> I've found these all improve relevance (i.e., in my app).  I found and
> made the changes 1-at-a-time based on analyzing explain()'s with result
> lists my app produces.
> 
> Re. this analysis, any sequencing of considering the different changes
> is fine with me, although once again, I don't think these are completely
> orthogonal considerations.  The combination of Similarity tuning
> decisions has impact above-and-beyond the individual effects.
> 
>   > [2]
>   > 
>   > I guess it's obvious from the above, but just to make it clear -
> I'll
>   > change the page to only do single field queries - but how many
>   > variations do we want to see in parallel - the current page shows
> 2x2
>   > results, for each combo of index and query - but I, say, show
> several
>   > more queries in parallel w/ different weights...
>   >
> 
> I'd like to keep the current multi-field results as there hasn't been
> much analysis of this yet.
> 
> Re. other scenarios, I think we should look at:
>   1.  Current QueryParser and DefaultSimilarity with single field and
> Default-OR.
>   2.  Above with Default-AND.
>   3.  My Similarity (or subset thereof) and current QueryParser with
> Default-OR.
>   4.  " with Default-AND
> 
> 
> Consideration of proximity solutions (e.g., Doug's DensityQuery for
> Default-AND, and what I'm proposing for Default-OR) should be separate.


Sorry for the delay in getting back to this thread - hope I found the 
right place to put the reply.

I did another page (wikipedia-similarity1.jsp) which is like the earlier 
experiment in that is has 2 versions of the Wikipedia index, one with 
the default Lucene Similarity, and one with Chucks's proposal ( 
http://issues.apache.org/bugzilla/show_bug.cgi?id=32674 ).

If you want to skip the explanation just click here for an example 
results page and play around:

http://www.searchmorph.com/kat/wikipedia-similarity1.jsp?s=information+retrieval+search+engine&goal=10&tfLogBase=2.3026&idfLogBase=2.3026&phraseBoost=2.0000&slop=9999&qp=information+retrieval+search+engine


The difference is that this new page only does single-field queries and 
does a lot(!) of them.  Please don't make any human factors judgments on 
the page, or submit it to Tufte as an example of how not to present 
information :)

So - there are 2 indexes and 9 (!) query parsers used for every query, 
and 18 (2 * 9) searches performed.

Queries are:

q1: MultiFieldQueryParser with OR semantics
q2: Same, with AND

q3: DistributingMultiFieldQueryParser with OR
q4: Holding place for DistributingMultiFieldQueryParser with AND (Chuck?)

q5: Code of mine that does a simple OR, so "a b" => (Query) "a b"
q6: Code of mine that does a simple AND so "a b" => +a +b

q7: Code of mine based on Doug's suggestion somewhere else in this 
thread, like q5 but tosses in a phrase, so "a b" => a b "a b"~10
q8: Like q7 but AND

q9: Separate call to QueryParser


I'm not convinced MultiFieldQueryParser works right with one field (but 
maybe that was the point of this thread? :) )
If you search for "blahblahblah java" one would expect the AND queries 
would return zero matches as blahblahblah does not appear in the corpus:

http://www.searchmorph.com/kat/wikipedia-similarity1.jsp?s=blahblahblah+java&goal=10&tfLogBase=2.3026&idfLogBase=2.3026&phraseBoost=1.0000&slop=10&qp=blahblahblah+java

But q2, MultiFieldQueryParser/AND returns:
	 	+(blahblahblah java)
instead of
		+blahblahblah +java

The only AND code that works right when one of the terms doesn't match 
is, um, my humble code (q6/q8).

So, does this make sense and is it useful way of trying to evaluate the 
Similarities?

I think another thread w/ a different thread has started on this topic, 
I'll try to redirect it back here.

thx,
  Dave





> 
> My $0.02,
> 
> Chuck
> 
>   > -----Original Message-----
>   > From: David Spencer [mailto:dave-lucene-dev@tropo.com]
>   > Sent: Tuesday, February 01, 2005 10:59 AM
>   > To: Lucene Developers List
>   > Subject: Re: URL to compare 2 Similarity's ready-- Re: Scoring
> benchmark
>   > evaluation. Was RE: How to proceed with Bug 31841 - MultiSearcher
>   > problems with Similarity.docFreq() ?
>   > 
>   > Doug Cutting wrote:
>   > 
>   > > David Spencer wrote:
>   > >
>   > >>
>   > >> +(f1:t1^2.0 t1) +(f1:t2^2.0 t2) f1:"t1 t2"~5^3.0 "t1 t2"~2^1.5
>   > >>
>   > >> (f1:t1^2.0 t1) (f1:t2^2.0 t2) f1:"t1 t2"~5^3.0 "t1 t2"~2^1.5
>   > >>
>   > >> (f1:t1^2.0 t1) (f1:t2^2.0 t2) (f1:t3^2.0 t3) (f1:t4^2.0 t4)
>   > (f1:t5^2.0
>   > >> t5) f1:"t1 t2 t3 t4 t5"~5^3.0 "t1 t2 t3 t4 t5"~2^1.5
>   > >
>   > >
>   > > This looks great to me!  I'd make mand=true by default, i.e., have
> a
>   > > method where this parameter is not specified.  Similarly, we might
>   > > default phraseBoosts[i] to boolBoosts[i]*phraseBoost, and slops to
>   > > infinity.  What we want is something that provides only the knobs
> that
>   > > we think most folks will need.  Ideally we wouldn't even need to
>   > specify
>   > > fieldBoosts.  Short fields like titles get a larger lengthNorm,
> which
>   > > effectively boosts them a lot already.
>   > 
>   > Yeah I agree w/ all of the above, offer options but have easy to use
>   > ways of calling it w/ intelligent defaults.
>   > >
>   > > But perhaps we should back off and first just evaluate single
> field
>   > > search with different idf, tf (and perhaps lengthNorm and
> sloppyFreq)
>   > > definitions.  Once we're happy with those, then we should return
> to
>   > > different multi-field query formulations.
>   > >
>   > > Let's start with the issue that's been raised so much: whether idf
> is
>   > > better defined with log() or sqrt(log()).
>   > 
>   > I can redo my page and rebuild indexes if necessary, I just need it
>   > clarified what we want to do, esp -> does the index need to be
> rebuilt?
>   > 
>   > [1]
>   > 
>   > I currently have 2 variations on the index, one w/ the default
> settings
>   > and another with the Similarity code Chuck attached to the bug
> report.
>   > Do we need other variations on the index e.g. with different
> weights, or
>   >   during indexing are the weights less important than the log() vs.
>   > sqrt(log()) issue?
>   > 
>   > [2]
>   > 
>   > I guess it's obvious from the above, but just to make it clear -
> I'll
>   > change the page to only do single field queries - but how many
>   > variations do we want to see in parallel - the current page shows
> 2x2
>   > results, for each combo of index and query - but I, say, show
> several
>   > more queries in parallel w/ different weights...
>   > 
>   > 
>   > >
>   > > Doug
>   > >
>   > >
> ---------------------------------------------------------------------
>   > > To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
>   > > For additional commands, e-mail:
> lucene-dev-help@jakarta.apache.org
>   > >
>   > 
>   > 
>   >
> ---------------------------------------------------------------------
>   > To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
>   > For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: single field code ready - Re: URL to compare 2 Similarity's ready-- Re: Scoring benchmark evaluation. Was RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

Posted by David Spencer <da...@tropo.com>.
Daniel Naber wrote:

> On Tuesday 08 February 2005 00:06, David Spencer wrote:
> 
> 
>>So, does this make sense and is it useful way of trying to evaluate the
>>Similarities?
> 
> 
> Is this the MultiFieldQueryParser from Lucene 1.4? 

I see WEB-INF/lib/lucene-1.5-rc1-dev.jar dated Jan 28, though I'm not 
sure if that's when I built it.

> Then it's "buggy" 
> anyway, so it probably doesn't make sense to test it. But even with the 
> current SVN version I don't see how it makes sense to use 
> MultiFieldQueryParser for searches on just one field.

Sure, if n/a it can be ignored, and I'll strike it from the output after 
there's more discussion here if we want the page made cleaner..

> 
> Regards
>  Daniel
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: single field code ready - Re: URL to compare 2 Similarity's ready-- Re: Scoring benchmark evaluation. Was RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

Posted by Daniel Naber <da...@t-online.de>.
On Tuesday 08 February 2005 00:06, David Spencer wrote:

> So, does this make sense and is it useful way of trying to evaluate the
> Similarities?

Is this the MultiFieldQueryParser from Lucene 1.4? Then it's "buggy" 
anyway, so it probably doesn't make sense to test it. But even with the 
current SVN version I don't see how it makes sense to use 
MultiFieldQueryParser for searches on just one field.

Regards
 Daniel

-- 
http://www.danielnaber.de

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org