You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by David Spencer <da...@tropo.com> on 2005/02/08 00:06:57 UTC
single field code ready - Re: URL to compare 2 Similarity's ready--
Re: Scoring benchmark evaluation. Was RE: How to proceed with Bug 31841
- MultiSearcher problems with Similarity.docFreq() ?
scroll a bit...
Chuck Williams wrote:
> David Spencer wrote:
> > [1]
> >
> > I currently have 2 variations on the index, one w/ the default
> settings
> > and another with the Similarity code Chuck attached to the bug
> report.
> > Do we need other variations on the index e.g. with different
> weights, or
> > during indexing are the weights less important than the log() vs.
> > sqrt(log()) issue?
>
> My Similarity eliminates the idf^2 by using sqrt(log()), changes the
> base of the logarithm for flattening tf and idf from e to 10 (or any
> parameter setting at runtime), changes the lengthNorm flattening from
> sqrt to log base-10 (not settable at runtime), and adds 1000 to all
> field lengths (normalizing this re. the log base-10 by changing the
> numerator from 1 to 3 = log10(1000)).
>
> The net effects are to increase flattening of tf and idf by a constant,
> increase flattening of lengthNorm fundamentally (sqrt to log), and
> eliminate large lengthNorm effects with very small fields (further
> flattening its effect).
>
> At least in the case of multiple fields with meaningful field-boosts,
> I've found these all improve relevance (i.e., in my app). I found and
> made the changes 1-at-a-time based on analyzing explain()'s with result
> lists my app produces.
>
> Re. this analysis, any sequencing of considering the different changes
> is fine with me, although once again, I don't think these are completely
> orthogonal considerations. The combination of Similarity tuning
> decisions has impact above-and-beyond the individual effects.
>
> > [2]
> >
> > I guess it's obvious from the above, but just to make it clear -
> I'll
> > change the page to only do single field queries - but how many
> > variations do we want to see in parallel - the current page shows
> 2x2
> > results, for each combo of index and query - but I, say, show
> several
> > more queries in parallel w/ different weights...
> >
>
> I'd like to keep the current multi-field results as there hasn't been
> much analysis of this yet.
>
> Re. other scenarios, I think we should look at:
> 1. Current QueryParser and DefaultSimilarity with single field and
> Default-OR.
> 2. Above with Default-AND.
> 3. My Similarity (or subset thereof) and current QueryParser with
> Default-OR.
> 4. " with Default-AND
>
>
> Consideration of proximity solutions (e.g., Doug's DensityQuery for
> Default-AND, and what I'm proposing for Default-OR) should be separate.
Sorry for the delay in getting back to this thread - hope I found the
right place to put the reply.
I did another page (wikipedia-similarity1.jsp) which is like the earlier
experiment in that is has 2 versions of the Wikipedia index, one with
the default Lucene Similarity, and one with Chucks's proposal (
http://issues.apache.org/bugzilla/show_bug.cgi?id=32674 ).
If you want to skip the explanation just click here for an example
results page and play around:
http://www.searchmorph.com/kat/wikipedia-similarity1.jsp?s=information+retrieval+search+engine&goal=10&tfLogBase=2.3026&idfLogBase=2.3026&phraseBoost=2.0000&slop=9999&qp=information+retrieval+search+engine
The difference is that this new page only does single-field queries and
does a lot(!) of them. Please don't make any human factors judgments on
the page, or submit it to Tufte as an example of how not to present
information :)
So - there are 2 indexes and 9 (!) query parsers used for every query,
and 18 (2 * 9) searches performed.
Queries are:
q1: MultiFieldQueryParser with OR semantics
q2: Same, with AND
q3: DistributingMultiFieldQueryParser with OR
q4: Holding place for DistributingMultiFieldQueryParser with AND (Chuck?)
q5: Code of mine that does a simple OR, so "a b" => (Query) "a b"
q6: Code of mine that does a simple AND so "a b" => +a +b
q7: Code of mine based on Doug's suggestion somewhere else in this
thread, like q5 but tosses in a phrase, so "a b" => a b "a b"~10
q8: Like q7 but AND
q9: Separate call to QueryParser
I'm not convinced MultiFieldQueryParser works right with one field (but
maybe that was the point of this thread? :) )
If you search for "blahblahblah java" one would expect the AND queries
would return zero matches as blahblahblah does not appear in the corpus:
http://www.searchmorph.com/kat/wikipedia-similarity1.jsp?s=blahblahblah+java&goal=10&tfLogBase=2.3026&idfLogBase=2.3026&phraseBoost=1.0000&slop=10&qp=blahblahblah+java
But q2, MultiFieldQueryParser/AND returns:
+(blahblahblah java)
instead of
+blahblahblah +java
The only AND code that works right when one of the terms doesn't match
is, um, my humble code (q6/q8).
So, does this make sense and is it useful way of trying to evaluate the
Similarities?
I think another thread w/ a different thread has started on this topic,
I'll try to redirect it back here.
thx,
Dave
>
> My $0.02,
>
> Chuck
>
> > -----Original Message-----
> > From: David Spencer [mailto:dave-lucene-dev@tropo.com]
> > Sent: Tuesday, February 01, 2005 10:59 AM
> > To: Lucene Developers List
> > Subject: Re: URL to compare 2 Similarity's ready-- Re: Scoring
> benchmark
> > evaluation. Was RE: How to proceed with Bug 31841 - MultiSearcher
> > problems with Similarity.docFreq() ?
> >
> > Doug Cutting wrote:
> >
> > > David Spencer wrote:
> > >
> > >>
> > >> +(f1:t1^2.0 t1) +(f1:t2^2.0 t2) f1:"t1 t2"~5^3.0 "t1 t2"~2^1.5
> > >>
> > >> (f1:t1^2.0 t1) (f1:t2^2.0 t2) f1:"t1 t2"~5^3.0 "t1 t2"~2^1.5
> > >>
> > >> (f1:t1^2.0 t1) (f1:t2^2.0 t2) (f1:t3^2.0 t3) (f1:t4^2.0 t4)
> > (f1:t5^2.0
> > >> t5) f1:"t1 t2 t3 t4 t5"~5^3.0 "t1 t2 t3 t4 t5"~2^1.5
> > >
> > >
> > > This looks great to me! I'd make mand=true by default, i.e., have
> a
> > > method where this parameter is not specified. Similarly, we might
> > > default phraseBoosts[i] to boolBoosts[i]*phraseBoost, and slops to
> > > infinity. What we want is something that provides only the knobs
> that
> > > we think most folks will need. Ideally we wouldn't even need to
> > specify
> > > fieldBoosts. Short fields like titles get a larger lengthNorm,
> which
> > > effectively boosts them a lot already.
> >
> > Yeah I agree w/ all of the above, offer options but have easy to use
> > ways of calling it w/ intelligent defaults.
> > >
> > > But perhaps we should back off and first just evaluate single
> field
> > > search with different idf, tf (and perhaps lengthNorm and
> sloppyFreq)
> > > definitions. Once we're happy with those, then we should return
> to
> > > different multi-field query formulations.
> > >
> > > Let's start with the issue that's been raised so much: whether idf
> is
> > > better defined with log() or sqrt(log()).
> >
> > I can redo my page and rebuild indexes if necessary, I just need it
> > clarified what we want to do, esp -> does the index need to be
> rebuilt?
> >
> > [1]
> >
> > I currently have 2 variations on the index, one w/ the default
> settings
> > and another with the Similarity code Chuck attached to the bug
> report.
> > Do we need other variations on the index e.g. with different
> weights, or
> > during indexing are the weights less important than the log() vs.
> > sqrt(log()) issue?
> >
> > [2]
> >
> > I guess it's obvious from the above, but just to make it clear -
> I'll
> > change the page to only do single field queries - but how many
> > variations do we want to see in parallel - the current page shows
> 2x2
> > results, for each combo of index and query - but I, say, show
> several
> > more queries in parallel w/ different weights...
> >
> >
> > >
> > > Doug
> > >
> > >
> ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> > > For additional commands, e-mail:
> lucene-dev-help@jakarta.apache.org
> > >
> >
> >
> >
> ---------------------------------------------------------------------
> > To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> > For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
>
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
Re: single field code ready - Re: URL to compare 2 Similarity's ready--
Re: Scoring benchmark evaluation. Was RE: How to proceed with Bug 31841
- MultiSearcher problems with Similarity.docFreq() ?
Posted by David Spencer <da...@tropo.com>.
Daniel Naber wrote:
> On Tuesday 08 February 2005 00:06, David Spencer wrote:
>
>
>>So, does this make sense and is it useful way of trying to evaluate the
>>Similarities?
>
>
> Is this the MultiFieldQueryParser from Lucene 1.4?
I see WEB-INF/lib/lucene-1.5-rc1-dev.jar dated Jan 28, though I'm not
sure if that's when I built it.
> Then it's "buggy"
> anyway, so it probably doesn't make sense to test it. But even with the
> current SVN version I don't see how it makes sense to use
> MultiFieldQueryParser for searches on just one field.
Sure, if n/a it can be ignored, and I'll strike it from the output after
there's more discussion here if we want the page made cleaner..
>
> Regards
> Daniel
>
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
Re: single field code ready - Re: URL to compare 2 Similarity's ready-- Re: Scoring benchmark evaluation. Was RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?
Posted by Daniel Naber <da...@t-online.de>.
On Tuesday 08 February 2005 00:06, David Spencer wrote:
> So, does this make sense and is it useful way of trying to evaluate the
> Similarities?
Is this the MultiFieldQueryParser from Lucene 1.4? Then it's "buggy"
anyway, so it probably doesn't make sense to test it. But even with the
current SVN version I don't see how it makes sense to use
MultiFieldQueryParser for searches on just one field.
Regards
Daniel
--
http://www.danielnaber.de
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org