You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Ivan Provalov <ip...@yahoo.com> on 2010/02/16 15:46:34 UTC

BM25 Scoring Patch

I applied the Lucene patch mentioned in https://issues.apache.org/jira/browse/LUCENE-2091 and ran the MAP numbers on TREC-3 collection using topics 151-200.  I am not getting worse results comparing to Lucene DefaultSimilarity.  I suspect, I am not using it correctly.  I have single field documents.  This is the process I use:

1. During the indexing, I am setting the similarity to BM25 as such:

IndexWriter writer = new IndexWriter(dir, new StandardAnalyzer(
                Version.LUCENE_CURRENT), true,
                IndexWriter.MaxFieldLength.UNLIMITED);
writer.setSimilarity(new BM25Similarity());

2. During the Precision/Recall measurements, I am using a SimpleBM25QQParser extension I added to the benchmark:

QualityQueryParser qqParser = new SimpleBM25QQParser("title", "TEXT");


3. Here is the parser code (I set an avg doc length here):

public Query parse(QualityQuery qq) throws ParseException {
    BM25Parameters.setAverageLength(indexField, 798.30f);//avg doc length
    BM25Parameters.setB(0.5f);//tried default values
    BM25Parameters.setK1(2f);
    return query = new BM25BooleanQuery(qq.getValue(qqName), indexField, new StandardAnalyzer(Version.LUCENE_CURRENT));
}

4. The searcher is using BM25 similarity:

Searcher searcher = new IndexSearcher(dir, true);
searcher.setSimilarity(sim);

Am I missing some steps?  Does anyone have experience with this code?

Thanks,

Ivan


      

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: BM25 Scoring Patch

Posted by Robert Muir <rc...@gmail.com>.
no i mean that gathering your previous emails you have supplied these MAP
improvements:

SweetSpot: 15%
lnb.ltc: 24%
bm25: 21%

these are close enough that given the bias from a pooled collection (
http://www.ir.uwaterloo.ca/slides/buettcher_reliable_evaluation.pdf) I
wouldn't want to say for sure that any is better than the other for this
collection, but its probably safe to say they are all an improvement for
this collection... this is assuming your SweetSpot/lnb.btc calculations were
correct?

On Tue, Feb 16, 2010 at 2:16 PM, Ivan Provalov <ip...@yahoo.com> wrote:

> By the end of the week, I will publish the results once we run the
> experiments on a full collection.  Are you talking about the bias caused by
> using a sub-collection?
>
> Thanks,
>
> Ivan
>
> --- On Tue, 2/16/10, Robert Muir <rc...@gmail.com> wrote:
>
> > From: Robert Muir <rc...@gmail.com>
> > Subject: Re: BM25 Scoring Patch
> > To: java-user@lucene.apache.org
> > Date: Tuesday, February 16, 2010, 2:11 PM
> > Ivan, ok. it would be cool if you can
> > list the map and bpref for the
> > different approaches you try (default lucene, lnb.ltc,
> > bm25), with or
> > without stemming.
> >
> > as you reported previously you got a 24% improvement with
> > lnb.btc (right?) I
> > am guessing that we won't be able to draw many conclusions
> > at all due to
> > bias.
> >
> > On Tue, Feb 16, 2010 at 2:01 PM, Ivan Provalov <ip...@yahoo.com>
> > wrote:
> >
> > > Robert, Joaquin,
> > >
> > > Sorry, I made an error reporting the results.
> > The preliminary improvement
> > > is around 21% (it's a reduced collection).  I
> > will have to run another test
> > > to get the final numbers on the complete collection.
> > >
> > > We are planning to also apply the stemming.
> > Right now we are trying to
> > > isolate each improvement experiment.
> > >
> > > Thanks,
> > >
> > > Ivan
> > >
> > >
> > >
> > > --- On Tue, 2/16/10, Robert Muir <rc...@gmail.com>
> > wrote:
> > >
> > > > From: Robert Muir <rc...@gmail.com>
> > > > Subject: Re: BM25 Scoring Patch
> > > > To: java-user@lucene.apache.org
> > > > Date: Tuesday, February 16, 2010, 1:14 PM
> > > > Ivan just a little more food for
> > > > thought to help you with this:
> > > >
> > > > I'm glad you got improved results, yet I stand by
> > my
> > > > original statement of
> > > > 'be careful' interpreting too much from one
> > collection.
> > > >
> > > > eg. had you chosen TREC-4 instead of TREC-3, you
> > would see
> > > > different
> > > > results, as vector-space with non-cosine doc
> > length norm
> > > > (LUCENE-2187)
> > > > performed better than BM25 there:
> > > > http://trec.nist.gov/pubs/trec4/overview.ps.gz
> > > >
> > > > in truth its hard to 'reuse' a pooled test
> > collection to
> > > > compare methods
> > > > that were not part of the pool:
> > > > http://www.ir.uwaterloo.ca/slides/buettcher_reliable_evaluation.pdf
> > > >
> > > > This might help explain why you see such a
> > difference in
> > > > MAP score!
> > > >
> > > > On Tue, Feb 16, 2010 at 12:15 PM, Ivan Provalov
> > <ip...@yahoo.com>
> > > > wrote:
> > > >
> > > > > Joaquin, Robert,
> > > > >
> > > > > I followed Joaquin's recommendation and
> > removed the
> > > > call to set similarity
> > > > > to BM25 explicitly (indexer,
> > searcher).  The
> > > > results showed 55% improvement
> > > > > for the MAP score (0.141->0.219) over
> > default
> > > > similarity.
> > > > >
> > > > > Joaquin, how would setting the similarity to
> > BM25
> > > > explicitly make the score
> > > > > worse?
> > > > >
> > > > > Thank you,
> > > > >
> > > > > Ivan
> > > > >
> > > > >
> > > > >
> > > > > --- On Tue, 2/16/10, Robert Muir <rc...@gmail.com>
> > > > wrote:
> > > > >
> > > > > > From: Robert Muir <rc...@gmail.com>
> > > > > > Subject: Re: BM25 Scoring Patch
> > > > > > To: java-user@lucene.apache.org
> > > > > > Date: Tuesday, February 16, 2010, 11:36
> > AM
> > > > > > yes Ivan, if possible please report
> > > > > > back any findings you can on the
> > > > > > experiments you are doing!
> > > > > >
> > > > > > On Tue, Feb 16, 2010 at 11:22 AM,
> > Joaquin Perez
> > > > Iglesias
> > > > > > <
> > > > > > joaquin.perez@lsi.uned.es>
> > > > > > wrote:
> > > > > >
> > > > > > > Hi Ivan,
> > > > > > >
> > > > > > > You shouldn't set the
> > BM25Similarity for
> > > > indexing or
> > > > > > searching.
> > > > > > > Please try removing the lines:
> > > > > >
> > >   writer.setSimilarity(new
> > > > > > BM25Similarity());
> > > > > >
> > > >
> > >   searcher.setSimilarity(sim);
> > > > > > >
> > > > > > > Please let us/me know if you
> > improve your
> > > > results with
> > > > > > these changes.
> > > > > > >
> > > > > > >
> > > > > > > Robert Muir escribió:
> > > > > > >
> > > > > > >  Hi Ivan, I've seen many
> > cases where
> > > > BM25
> > > > > > performs worse than Lucene's
> > > > > > >> default Similarity. Perhaps
> > this is just
> > > > another
> > > > > > one?
> > > > > > >>
> > > > > > >> Again while I have not worked
> > with this
> > > > particular
> > > > > > collection, I looked at
> > > > > > >> the statistics and noted that
> > its
> > > > composed of
> > > > > > several 'sub-collections':
> > > > > > >> for
> > > > > > >> example the PAT documents on
> > disk 3 have
> > > > an
> > > > > > average doc length of 3543,
> > > > > > >> but
> > > > > > >> the AP documents on disk 1
> > have an avg
> > > > doc length
> > > > > > of 353.
> > > > > > >>
> > > > > > >> I have found on other
> > collections that
> > > > any
> > > > > > advantages of BM25's document
> > > > > > >> length normalization fall
> > apart when
> > > > 'average
> > > > > > document length' doesn't
> > > > > > >> make
> > > > > > >> a whole lot of sense (cases
> > like this).
> > > > > > >>
> > > > > > >> For this same reason, I've
> > only found a
> > > > few
> > > > > > collections where BM25's doc
> > > > > > >> length normalization is
> > really
> > > > significantly
> > > > > > better than Lucene's.
> > > > > > >>
> > > > > > >> In my opinion, the results on
> > a
> > > > particular test
> > > > > > collection or 2 have
> > > > > > >> perhaps
> > > > > > >> been taken too far and created
> > a myth
> > > > that BM25 is
> > > > > > always superior to
> > > > > > >> Lucene's scoring... this is
> > not true!
> > > > > > >>
> > > > > > >> On Tue, Feb 16, 2010 at 9:46
> > AM, Ivan
> > > > Provalov
> > > > > > <ip...@yahoo.com>
> > > > > > >> wrote:
> > > > > > >>
> > > > > > >>  I applied the Lucene
> > patch
> > > > mentioned in
> > > > > > >>> https://issues.apache.org/jira/browse/LUCENE-2091 and
> > > > > > ran the MAP
> > > > > > >>> numbers
> > > > > > >>> on TREC-3 collection using
> > topics
> > > > > > 151-200.  I am not getting worse
> > > > > > >>> results
> > > > > > >>> comparing to Lucene
> > > > DefaultSimilarity.  I
> > > > > > suspect, I am not using it
> > > > > > >>> correctly.  I have
> > single
> > > > field
> > > > > > documents.  This is the process I
> > use:
> > > > > > >>>
> > > > > > >>> 1. During the indexing, I
> > am setting
> > > > the
> > > > > > similarity to BM25 as such:
> > > > > > >>>
> > > > > > >>> IndexWriter writer = new
> > > > IndexWriter(dir, new
> > > > > > StandardAnalyzer(
> > > > > > >>>
> > > > > >    Version.LUCENE_CURRENT),
> > true,
> > > > > > >>>
> > > > > >
> > > > IndexWriter.MaxFieldLength.UNLIMITED);
> > > > > > >>> writer.setSimilarity(new
> > > > BM25Similarity());
> > > > > > >>>
> > > > > > >>> 2. During the
> > Precision/Recall
> > > > measurements, I
> > > > > > am using a
> > > > > > >>> SimpleBM25QQParser
> > extension I added
> > > > to the
> > > > > > benchmark:
> > > > > > >>>
> > > > > > >>> QualityQueryParser
> > qqParser = new
> > > > > > SimpleBM25QQParser("title", "TEXT");
> > > > > > >>>
> > > > > > >>>
> > > > > > >>> 3. Here is the parser code
> > (I set an
> > > > avg doc
> > > > > > length here):
> > > > > > >>>
> > > > > > >>> public Query
> > parse(QualityQuery qq)
> > > > throws
> > > > > > ParseException {
> > > > > >
> > > >
> > >>>   BM25Parameters.setAverageLength(indexField,
> > > > > > 798.30f);//avg doc length
> > > > > >
> > > >
> > >>>   BM25Parameters.setB(0.5f);//tried
> > > > > > default values
> > > > > >
> > > >
> > >>>   BM25Parameters.setK1(2f);
> > > > > > >>>   return
> > query = new
> > > > > > BM25BooleanQuery(qq.getValue(qqName),
> > > > indexField,
> > > > > > >>> new
> > > > > > >>>
> > > > StandardAnalyzer(Version.LUCENE_CURRENT));
> > > > > > >>> }
> > > > > > >>>
> > > > > > >>> 4. The searcher is using
> > BM25
> > > > similarity:
> > > > > > >>>
> > > > > > >>> Searcher searcher = new
> > > > IndexSearcher(dir,
> > > > > > true);
> > > > > > >>>
> > searcher.setSimilarity(sim);
> > > > > > >>>
> > > > > > >>> Am I missing some
> > steps?  Does
> > > > anyone
> > > > > > have experience with this code?
> > > > > > >>>
> > > > > > >>> Thanks,
> > > > > > >>>
> > > > > > >>> Ivan
> > > > > > >>>
> > > > > > >>>
> > > > > > >>>
> > > > > > >>>
> > > > > > >>>
> > > > > >
> > > >
> > ---------------------------------------------------------------------
> > > > > > >>> To unsubscribe, e-mail:
> java-user-unsubscribe@lucene.apache.org
> > > > > > >>> For additional commands,
> > e-mail:
> > > java-user-help@lucene.apache.org
> > > > > > >>>
> > > > > > >>>
> > > > > > >>>
> > > > > > >>
> > > > > > >>
> > > > > > > --
> > > > > > >
> > > > > >
> > > >
> > -----------------------------------------------------------
> > > > > > > Joaquín Pérez Iglesias
> > > > > > > Dpto. Lenguajes y Sistemas
> > Informáticos
> > > > > > > E.T.S.I. Informática (UNED)
> > > > > > > Ciudad Universitaria
> > > > > > > C/ Juan del Rosal nº 16
> > > > > > > 28040 Madrid - Spain
> > > > > > > Phone. +34 91 398 89 19
> > > > > > > Fax    +34 91 398 65 35
> > > > > > > Office  2.11
> > > > > > > Email: joaquin.perez@lsi.uned.es
> > > > > > > web:   http://nlp.uned.es/~jperezi/<http://nlp.uned.es/%7Ejperezi/>
> <http://nlp.uned.es/%7Ejperezi/><
> > > http://nlp.uned.es/%7Ejperezi/> <
> > > > > http://nlp.uned.es/%7Ejperezi/>
> > > > > > >
> > > > > >
> > > >
> > -----------------------------------------------------------
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > >
> > > >
> > ---------------------------------------------------------------------
> > > > > > > To unsubscribe, e-mail:
> java-user-unsubscribe@lucene.apache.org
> > > > > > > For additional commands, e-mail:
> > java-user-help@lucene.apache.org
> > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Robert Muir
> > > > > > rcmuir@gmail.com
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > >
> > ---------------------------------------------------------------------
> > > > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > > Robert Muir
> > > > rcmuir@gmail.com
> > > >
> > >
> > >
> > >
> > >
> > >
> > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > >
> > >
> >
> >
> > --
> > Robert Muir
> > rcmuir@gmail.com
> >
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


-- 
Robert Muir
rcmuir@gmail.com

Re: BM25 Scoring Patch

Posted by Ivan Provalov <ip...@yahoo.com>.
By the end of the week, I will publish the results once we run the experiments on a full collection.  Are you talking about the bias caused by using a sub-collection?

Thanks,

Ivan

--- On Tue, 2/16/10, Robert Muir <rc...@gmail.com> wrote:

> From: Robert Muir <rc...@gmail.com>
> Subject: Re: BM25 Scoring Patch
> To: java-user@lucene.apache.org
> Date: Tuesday, February 16, 2010, 2:11 PM
> Ivan, ok. it would be cool if you can
> list the map and bpref for the
> different approaches you try (default lucene, lnb.ltc,
> bm25), with or
> without stemming.
> 
> as you reported previously you got a 24% improvement with
> lnb.btc (right?) I
> am guessing that we won't be able to draw many conclusions
> at all due to
> bias.
> 
> On Tue, Feb 16, 2010 at 2:01 PM, Ivan Provalov <ip...@yahoo.com>
> wrote:
> 
> > Robert, Joaquin,
> >
> > Sorry, I made an error reporting the results. 
> The preliminary improvement
> > is around 21% (it's a reduced collection).  I
> will have to run another test
> > to get the final numbers on the complete collection.
> >
> > We are planning to also apply the stemming. 
> Right now we are trying to
> > isolate each improvement experiment.
> >
> > Thanks,
> >
> > Ivan
> >
> >
> >
> > --- On Tue, 2/16/10, Robert Muir <rc...@gmail.com>
> wrote:
> >
> > > From: Robert Muir <rc...@gmail.com>
> > > Subject: Re: BM25 Scoring Patch
> > > To: java-user@lucene.apache.org
> > > Date: Tuesday, February 16, 2010, 1:14 PM
> > > Ivan just a little more food for
> > > thought to help you with this:
> > >
> > > I'm glad you got improved results, yet I stand by
> my
> > > original statement of
> > > 'be careful' interpreting too much from one
> collection.
> > >
> > > eg. had you chosen TREC-4 instead of TREC-3, you
> would see
> > > different
> > > results, as vector-space with non-cosine doc
> length norm
> > > (LUCENE-2187)
> > > performed better than BM25 there:
> > > http://trec.nist.gov/pubs/trec4/overview.ps.gz
> > >
> > > in truth its hard to 'reuse' a pooled test
> collection to
> > > compare methods
> > > that were not part of the pool:
> > > http://www.ir.uwaterloo.ca/slides/buettcher_reliable_evaluation.pdf
> > >
> > > This might help explain why you see such a
> difference in
> > > MAP score!
> > >
> > > On Tue, Feb 16, 2010 at 12:15 PM, Ivan Provalov
> <ip...@yahoo.com>
> > > wrote:
> > >
> > > > Joaquin, Robert,
> > > >
> > > > I followed Joaquin's recommendation and
> removed the
> > > call to set similarity
> > > > to BM25 explicitly (indexer,
> searcher).  The
> > > results showed 55% improvement
> > > > for the MAP score (0.141->0.219) over
> default
> > > similarity.
> > > >
> > > > Joaquin, how would setting the similarity to
> BM25
> > > explicitly make the score
> > > > worse?
> > > >
> > > > Thank you,
> > > >
> > > > Ivan
> > > >
> > > >
> > > >
> > > > --- On Tue, 2/16/10, Robert Muir <rc...@gmail.com>
> > > wrote:
> > > >
> > > > > From: Robert Muir <rc...@gmail.com>
> > > > > Subject: Re: BM25 Scoring Patch
> > > > > To: java-user@lucene.apache.org
> > > > > Date: Tuesday, February 16, 2010, 11:36
> AM
> > > > > yes Ivan, if possible please report
> > > > > back any findings you can on the
> > > > > experiments you are doing!
> > > > >
> > > > > On Tue, Feb 16, 2010 at 11:22 AM,
> Joaquin Perez
> > > Iglesias
> > > > > <
> > > > > joaquin.perez@lsi.uned.es>
> > > > > wrote:
> > > > >
> > > > > > Hi Ivan,
> > > > > >
> > > > > > You shouldn't set the
> BM25Similarity for
> > > indexing or
> > > > > searching.
> > > > > > Please try removing the lines:
> > > > >
> >   writer.setSimilarity(new
> > > > > BM25Similarity());
> > > > >
> > >
> >   searcher.setSimilarity(sim);
> > > > > >
> > > > > > Please let us/me know if you
> improve your
> > > results with
> > > > > these changes.
> > > > > >
> > > > > >
> > > > > > Robert Muir escribió:
> > > > > >
> > > > > >  Hi Ivan, I've seen many
> cases where
> > > BM25
> > > > > performs worse than Lucene's
> > > > > >> default Similarity. Perhaps
> this is just
> > > another
> > > > > one?
> > > > > >>
> > > > > >> Again while I have not worked
> with this
> > > particular
> > > > > collection, I looked at
> > > > > >> the statistics and noted that
> its
> > > composed of
> > > > > several 'sub-collections':
> > > > > >> for
> > > > > >> example the PAT documents on
> disk 3 have
> > > an
> > > > > average doc length of 3543,
> > > > > >> but
> > > > > >> the AP documents on disk 1
> have an avg
> > > doc length
> > > > > of 353.
> > > > > >>
> > > > > >> I have found on other
> collections that
> > > any
> > > > > advantages of BM25's document
> > > > > >> length normalization fall
> apart when
> > > 'average
> > > > > document length' doesn't
> > > > > >> make
> > > > > >> a whole lot of sense (cases
> like this).
> > > > > >>
> > > > > >> For this same reason, I've
> only found a
> > > few
> > > > > collections where BM25's doc
> > > > > >> length normalization is
> really
> > > significantly
> > > > > better than Lucene's.
> > > > > >>
> > > > > >> In my opinion, the results on
> a
> > > particular test
> > > > > collection or 2 have
> > > > > >> perhaps
> > > > > >> been taken too far and created
> a myth
> > > that BM25 is
> > > > > always superior to
> > > > > >> Lucene's scoring... this is
> not true!
> > > > > >>
> > > > > >> On Tue, Feb 16, 2010 at 9:46
> AM, Ivan
> > > Provalov
> > > > > <ip...@yahoo.com>
> > > > > >> wrote:
> > > > > >>
> > > > > >>  I applied the Lucene
> patch
> > > mentioned in
> > > > > >>> https://issues.apache.org/jira/browse/LUCENE-2091 and
> > > > > ran the MAP
> > > > > >>> numbers
> > > > > >>> on TREC-3 collection using
> topics
> > > > > 151-200.  I am not getting worse
> > > > > >>> results
> > > > > >>> comparing to Lucene
> > > DefaultSimilarity.  I
> > > > > suspect, I am not using it
> > > > > >>> correctly.  I have
> single
> > > field
> > > > > documents.  This is the process I
> use:
> > > > > >>>
> > > > > >>> 1. During the indexing, I
> am setting
> > > the
> > > > > similarity to BM25 as such:
> > > > > >>>
> > > > > >>> IndexWriter writer = new
> > > IndexWriter(dir, new
> > > > > StandardAnalyzer(
> > > > > >>>
> > > > >    Version.LUCENE_CURRENT),
> true,
> > > > > >>>
> > > > >
> > > IndexWriter.MaxFieldLength.UNLIMITED);
> > > > > >>> writer.setSimilarity(new
> > > BM25Similarity());
> > > > > >>>
> > > > > >>> 2. During the
> Precision/Recall
> > > measurements, I
> > > > > am using a
> > > > > >>> SimpleBM25QQParser
> extension I added
> > > to the
> > > > > benchmark:
> > > > > >>>
> > > > > >>> QualityQueryParser
> qqParser = new
> > > > > SimpleBM25QQParser("title", "TEXT");
> > > > > >>>
> > > > > >>>
> > > > > >>> 3. Here is the parser code
> (I set an
> > > avg doc
> > > > > length here):
> > > > > >>>
> > > > > >>> public Query
> parse(QualityQuery qq)
> > > throws
> > > > > ParseException {
> > > > >
> > >
> >>>   BM25Parameters.setAverageLength(indexField,
> > > > > 798.30f);//avg doc length
> > > > >
> > >
> >>>   BM25Parameters.setB(0.5f);//tried
> > > > > default values
> > > > >
> > >
> >>>   BM25Parameters.setK1(2f);
> > > > > >>>   return
> query = new
> > > > > BM25BooleanQuery(qq.getValue(qqName),
> > > indexField,
> > > > > >>> new
> > > > > >>>
> > > StandardAnalyzer(Version.LUCENE_CURRENT));
> > > > > >>> }
> > > > > >>>
> > > > > >>> 4. The searcher is using
> BM25
> > > similarity:
> > > > > >>>
> > > > > >>> Searcher searcher = new
> > > IndexSearcher(dir,
> > > > > true);
> > > > > >>>
> searcher.setSimilarity(sim);
> > > > > >>>
> > > > > >>> Am I missing some
> steps?  Does
> > > anyone
> > > > > have experience with this code?
> > > > > >>>
> > > > > >>> Thanks,
> > > > > >>>
> > > > > >>> Ivan
> > > > > >>>
> > > > > >>>
> > > > > >>>
> > > > > >>>
> > > > > >>>
> > > > >
> > >
> ---------------------------------------------------------------------
> > > > > >>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > > > >>> For additional commands,
> e-mail:
> > java-user-help@lucene.apache.org
> > > > > >>>
> > > > > >>>
> > > > > >>>
> > > > > >>
> > > > > >>
> > > > > > --
> > > > > >
> > > > >
> > >
> -----------------------------------------------------------
> > > > > > Joaquín Pérez Iglesias
> > > > > > Dpto. Lenguajes y Sistemas
> Informáticos
> > > > > > E.T.S.I. Informática (UNED)
> > > > > > Ciudad Universitaria
> > > > > > C/ Juan del Rosal nº 16
> > > > > > 28040 Madrid - Spain
> > > > > > Phone. +34 91 398 89 19
> > > > > > Fax    +34 91 398 65 35
> > > > > > Office  2.11
> > > > > > Email: joaquin.perez@lsi.uned.es
> > > > > > web:   http://nlp.uned.es/~jperezi/<http://nlp.uned.es/%7Ejperezi/><
> > http://nlp.uned.es/%7Ejperezi/> <
> > > > http://nlp.uned.es/%7Ejperezi/>
> > > > > >
> > > > >
> > >
> -----------------------------------------------------------
> > > > > >
> > > > > >
> > > > > >
> > > > >
> > >
> ---------------------------------------------------------------------
> > > > > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > > > > For additional commands, e-mail:
> java-user-help@lucene.apache.org
> > > > > >
> > > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Robert Muir
> > > > > rcmuir@gmail.com
> > > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > >
> ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > > >
> > > >
> > >
> > >
> > > --
> > > Robert Muir
> > > rcmuir@gmail.com
> > >
> >
> >
> >
> >
> >
> ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
> 
> 
> -- 
> Robert Muir
> rcmuir@gmail.com
> 


      

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: BM25 Scoring Patch

Posted by Robert Muir <rc...@gmail.com>.
Ivan, ok. it would be cool if you can list the map and bpref for the
different approaches you try (default lucene, lnb.ltc, bm25), with or
without stemming.

as you reported previously you got a 24% improvement with lnb.btc (right?) I
am guessing that we won't be able to draw many conclusions at all due to
bias.

On Tue, Feb 16, 2010 at 2:01 PM, Ivan Provalov <ip...@yahoo.com> wrote:

> Robert, Joaquin,
>
> Sorry, I made an error reporting the results.  The preliminary improvement
> is around 21% (it's a reduced collection).  I will have to run another test
> to get the final numbers on the complete collection.
>
> We are planning to also apply the stemming.  Right now we are trying to
> isolate each improvement experiment.
>
> Thanks,
>
> Ivan
>
>
>
> --- On Tue, 2/16/10, Robert Muir <rc...@gmail.com> wrote:
>
> > From: Robert Muir <rc...@gmail.com>
> > Subject: Re: BM25 Scoring Patch
> > To: java-user@lucene.apache.org
> > Date: Tuesday, February 16, 2010, 1:14 PM
> > Ivan just a little more food for
> > thought to help you with this:
> >
> > I'm glad you got improved results, yet I stand by my
> > original statement of
> > 'be careful' interpreting too much from one collection.
> >
> > eg. had you chosen TREC-4 instead of TREC-3, you would see
> > different
> > results, as vector-space with non-cosine doc length norm
> > (LUCENE-2187)
> > performed better than BM25 there:
> > http://trec.nist.gov/pubs/trec4/overview.ps.gz
> >
> > in truth its hard to 'reuse' a pooled test collection to
> > compare methods
> > that were not part of the pool:
> > http://www.ir.uwaterloo.ca/slides/buettcher_reliable_evaluation.pdf
> >
> > This might help explain why you see such a difference in
> > MAP score!
> >
> > On Tue, Feb 16, 2010 at 12:15 PM, Ivan Provalov <ip...@yahoo.com>
> > wrote:
> >
> > > Joaquin, Robert,
> > >
> > > I followed Joaquin's recommendation and removed the
> > call to set similarity
> > > to BM25 explicitly (indexer, searcher).  The
> > results showed 55% improvement
> > > for the MAP score (0.141->0.219) over default
> > similarity.
> > >
> > > Joaquin, how would setting the similarity to BM25
> > explicitly make the score
> > > worse?
> > >
> > > Thank you,
> > >
> > > Ivan
> > >
> > >
> > >
> > > --- On Tue, 2/16/10, Robert Muir <rc...@gmail.com>
> > wrote:
> > >
> > > > From: Robert Muir <rc...@gmail.com>
> > > > Subject: Re: BM25 Scoring Patch
> > > > To: java-user@lucene.apache.org
> > > > Date: Tuesday, February 16, 2010, 11:36 AM
> > > > yes Ivan, if possible please report
> > > > back any findings you can on the
> > > > experiments you are doing!
> > > >
> > > > On Tue, Feb 16, 2010 at 11:22 AM, Joaquin Perez
> > Iglesias
> > > > <
> > > > joaquin.perez@lsi.uned.es>
> > > > wrote:
> > > >
> > > > > Hi Ivan,
> > > > >
> > > > > You shouldn't set the BM25Similarity for
> > indexing or
> > > > searching.
> > > > > Please try removing the lines:
> > > > >   writer.setSimilarity(new
> > > > BM25Similarity());
> > > >
> > >   searcher.setSimilarity(sim);
> > > > >
> > > > > Please let us/me know if you improve your
> > results with
> > > > these changes.
> > > > >
> > > > >
> > > > > Robert Muir escribió:
> > > > >
> > > > >  Hi Ivan, I've seen many cases where
> > BM25
> > > > performs worse than Lucene's
> > > > >> default Similarity. Perhaps this is just
> > another
> > > > one?
> > > > >>
> > > > >> Again while I have not worked with this
> > particular
> > > > collection, I looked at
> > > > >> the statistics and noted that its
> > composed of
> > > > several 'sub-collections':
> > > > >> for
> > > > >> example the PAT documents on disk 3 have
> > an
> > > > average doc length of 3543,
> > > > >> but
> > > > >> the AP documents on disk 1 have an avg
> > doc length
> > > > of 353.
> > > > >>
> > > > >> I have found on other collections that
> > any
> > > > advantages of BM25's document
> > > > >> length normalization fall apart when
> > 'average
> > > > document length' doesn't
> > > > >> make
> > > > >> a whole lot of sense (cases like this).
> > > > >>
> > > > >> For this same reason, I've only found a
> > few
> > > > collections where BM25's doc
> > > > >> length normalization is really
> > significantly
> > > > better than Lucene's.
> > > > >>
> > > > >> In my opinion, the results on a
> > particular test
> > > > collection or 2 have
> > > > >> perhaps
> > > > >> been taken too far and created a myth
> > that BM25 is
> > > > always superior to
> > > > >> Lucene's scoring... this is not true!
> > > > >>
> > > > >> On Tue, Feb 16, 2010 at 9:46 AM, Ivan
> > Provalov
> > > > <ip...@yahoo.com>
> > > > >> wrote:
> > > > >>
> > > > >>  I applied the Lucene patch
> > mentioned in
> > > > >>> https://issues.apache.org/jira/browse/LUCENE-2091 and
> > > > ran the MAP
> > > > >>> numbers
> > > > >>> on TREC-3 collection using topics
> > > > 151-200.  I am not getting worse
> > > > >>> results
> > > > >>> comparing to Lucene
> > DefaultSimilarity.  I
> > > > suspect, I am not using it
> > > > >>> correctly.  I have single
> > field
> > > > documents.  This is the process I use:
> > > > >>>
> > > > >>> 1. During the indexing, I am setting
> > the
> > > > similarity to BM25 as such:
> > > > >>>
> > > > >>> IndexWriter writer = new
> > IndexWriter(dir, new
> > > > StandardAnalyzer(
> > > > >>>
> > > >    Version.LUCENE_CURRENT), true,
> > > > >>>
> > > >
> > IndexWriter.MaxFieldLength.UNLIMITED);
> > > > >>> writer.setSimilarity(new
> > BM25Similarity());
> > > > >>>
> > > > >>> 2. During the Precision/Recall
> > measurements, I
> > > > am using a
> > > > >>> SimpleBM25QQParser extension I added
> > to the
> > > > benchmark:
> > > > >>>
> > > > >>> QualityQueryParser qqParser = new
> > > > SimpleBM25QQParser("title", "TEXT");
> > > > >>>
> > > > >>>
> > > > >>> 3. Here is the parser code (I set an
> > avg doc
> > > > length here):
> > > > >>>
> > > > >>> public Query parse(QualityQuery qq)
> > throws
> > > > ParseException {
> > > >
> > >>>   BM25Parameters.setAverageLength(indexField,
> > > > 798.30f);//avg doc length
> > > >
> > >>>   BM25Parameters.setB(0.5f);//tried
> > > > default values
> > > >
> > >>>   BM25Parameters.setK1(2f);
> > > > >>>   return query = new
> > > > BM25BooleanQuery(qq.getValue(qqName),
> > indexField,
> > > > >>> new
> > > > >>>
> > StandardAnalyzer(Version.LUCENE_CURRENT));
> > > > >>> }
> > > > >>>
> > > > >>> 4. The searcher is using BM25
> > similarity:
> > > > >>>
> > > > >>> Searcher searcher = new
> > IndexSearcher(dir,
> > > > true);
> > > > >>> searcher.setSimilarity(sim);
> > > > >>>
> > > > >>> Am I missing some steps?  Does
> > anyone
> > > > have experience with this code?
> > > > >>>
> > > > >>> Thanks,
> > > > >>>
> > > > >>> Ivan
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >>>
> > > >
> > ---------------------------------------------------------------------
> > > > >>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > > >>> For additional commands, e-mail:
> java-user-help@lucene.apache.org
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >>
> > > > >>
> > > > > --
> > > > >
> > > >
> > -----------------------------------------------------------
> > > > > Joaquín Pérez Iglesias
> > > > > Dpto. Lenguajes y Sistemas Informáticos
> > > > > E.T.S.I. Informática (UNED)
> > > > > Ciudad Universitaria
> > > > > C/ Juan del Rosal nº 16
> > > > > 28040 Madrid - Spain
> > > > > Phone. +34 91 398 89 19
> > > > > Fax    +34 91 398 65 35
> > > > > Office  2.11
> > > > > Email: joaquin.perez@lsi.uned.es
> > > > > web:   http://nlp.uned.es/~jperezi/<http://nlp.uned.es/%7Ejperezi/><
> http://nlp.uned.es/%7Ejperezi/> <
> > > http://nlp.uned.es/%7Ejperezi/>
> > > > >
> > > >
> > -----------------------------------------------------------
> > > > >
> > > > >
> > > > >
> > > >
> > ---------------------------------------------------------------------
> > > > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > > Robert Muir
> > > > rcmuir@gmail.com
> > > >
> > >
> > >
> > >
> > >
> > >
> > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > >
> > >
> >
> >
> > --
> > Robert Muir
> > rcmuir@gmail.com
> >
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


-- 
Robert Muir
rcmuir@gmail.com

Re: BM25 Scoring Patch

Posted by Ivan Provalov <ip...@yahoo.com>.
Robert, Joaquin,

Sorry, I made an error reporting the results.  The preliminary improvement is around 21% (it's a reduced collection).  I will have to run another test to get the final numbers on the complete collection.  

We are planning to also apply the stemming.  Right now we are trying to isolate each improvement experiment.

Thanks,

Ivan



--- On Tue, 2/16/10, Robert Muir <rc...@gmail.com> wrote:

> From: Robert Muir <rc...@gmail.com>
> Subject: Re: BM25 Scoring Patch
> To: java-user@lucene.apache.org
> Date: Tuesday, February 16, 2010, 1:14 PM
> Ivan just a little more food for
> thought to help you with this:
> 
> I'm glad you got improved results, yet I stand by my
> original statement of
> 'be careful' interpreting too much from one collection.
> 
> eg. had you chosen TREC-4 instead of TREC-3, you would see
> different
> results, as vector-space with non-cosine doc length norm
> (LUCENE-2187)
> performed better than BM25 there:
> http://trec.nist.gov/pubs/trec4/overview.ps.gz
> 
> in truth its hard to 'reuse' a pooled test collection to
> compare methods
> that were not part of the pool:
> http://www.ir.uwaterloo.ca/slides/buettcher_reliable_evaluation.pdf
> 
> This might help explain why you see such a difference in
> MAP score!
> 
> On Tue, Feb 16, 2010 at 12:15 PM, Ivan Provalov <ip...@yahoo.com>
> wrote:
> 
> > Joaquin, Robert,
> >
> > I followed Joaquin's recommendation and removed the
> call to set similarity
> > to BM25 explicitly (indexer, searcher).  The
> results showed 55% improvement
> > for the MAP score (0.141->0.219) over default
> similarity.
> >
> > Joaquin, how would setting the similarity to BM25
> explicitly make the score
> > worse?
> >
> > Thank you,
> >
> > Ivan
> >
> >
> >
> > --- On Tue, 2/16/10, Robert Muir <rc...@gmail.com>
> wrote:
> >
> > > From: Robert Muir <rc...@gmail.com>
> > > Subject: Re: BM25 Scoring Patch
> > > To: java-user@lucene.apache.org
> > > Date: Tuesday, February 16, 2010, 11:36 AM
> > > yes Ivan, if possible please report
> > > back any findings you can on the
> > > experiments you are doing!
> > >
> > > On Tue, Feb 16, 2010 at 11:22 AM, Joaquin Perez
> Iglesias
> > > <
> > > joaquin.perez@lsi.uned.es>
> > > wrote:
> > >
> > > > Hi Ivan,
> > > >
> > > > You shouldn't set the BM25Similarity for
> indexing or
> > > searching.
> > > > Please try removing the lines:
> > > >   writer.setSimilarity(new
> > > BM25Similarity());
> > >
> >   searcher.setSimilarity(sim);
> > > >
> > > > Please let us/me know if you improve your
> results with
> > > these changes.
> > > >
> > > >
> > > > Robert Muir escribió:
> > > >
> > > >  Hi Ivan, I've seen many cases where
> BM25
> > > performs worse than Lucene's
> > > >> default Similarity. Perhaps this is just
> another
> > > one?
> > > >>
> > > >> Again while I have not worked with this
> particular
> > > collection, I looked at
> > > >> the statistics and noted that its
> composed of
> > > several 'sub-collections':
> > > >> for
> > > >> example the PAT documents on disk 3 have
> an
> > > average doc length of 3543,
> > > >> but
> > > >> the AP documents on disk 1 have an avg
> doc length
> > > of 353.
> > > >>
> > > >> I have found on other collections that
> any
> > > advantages of BM25's document
> > > >> length normalization fall apart when
> 'average
> > > document length' doesn't
> > > >> make
> > > >> a whole lot of sense (cases like this).
> > > >>
> > > >> For this same reason, I've only found a
> few
> > > collections where BM25's doc
> > > >> length normalization is really
> significantly
> > > better than Lucene's.
> > > >>
> > > >> In my opinion, the results on a
> particular test
> > > collection or 2 have
> > > >> perhaps
> > > >> been taken too far and created a myth
> that BM25 is
> > > always superior to
> > > >> Lucene's scoring... this is not true!
> > > >>
> > > >> On Tue, Feb 16, 2010 at 9:46 AM, Ivan
> Provalov
> > > <ip...@yahoo.com>
> > > >> wrote:
> > > >>
> > > >>  I applied the Lucene patch
> mentioned in
> > > >>> https://issues.apache.org/jira/browse/LUCENE-2091 and
> > > ran the MAP
> > > >>> numbers
> > > >>> on TREC-3 collection using topics
> > > 151-200.  I am not getting worse
> > > >>> results
> > > >>> comparing to Lucene
> DefaultSimilarity.  I
> > > suspect, I am not using it
> > > >>> correctly.  I have single
> field
> > > documents.  This is the process I use:
> > > >>>
> > > >>> 1. During the indexing, I am setting
> the
> > > similarity to BM25 as such:
> > > >>>
> > > >>> IndexWriter writer = new
> IndexWriter(dir, new
> > > StandardAnalyzer(
> > > >>>
> > >    Version.LUCENE_CURRENT), true,
> > > >>>
> > >   
> IndexWriter.MaxFieldLength.UNLIMITED);
> > > >>> writer.setSimilarity(new
> BM25Similarity());
> > > >>>
> > > >>> 2. During the Precision/Recall
> measurements, I
> > > am using a
> > > >>> SimpleBM25QQParser extension I added
> to the
> > > benchmark:
> > > >>>
> > > >>> QualityQueryParser qqParser = new
> > > SimpleBM25QQParser("title", "TEXT");
> > > >>>
> > > >>>
> > > >>> 3. Here is the parser code (I set an
> avg doc
> > > length here):
> > > >>>
> > > >>> public Query parse(QualityQuery qq)
> throws
> > > ParseException {
> > >
> >>>   BM25Parameters.setAverageLength(indexField,
> > > 798.30f);//avg doc length
> > >
> >>>   BM25Parameters.setB(0.5f);//tried
> > > default values
> > >
> >>>   BM25Parameters.setK1(2f);
> > > >>>   return query = new
> > > BM25BooleanQuery(qq.getValue(qqName),
> indexField,
> > > >>> new
> > > >>>
> StandardAnalyzer(Version.LUCENE_CURRENT));
> > > >>> }
> > > >>>
> > > >>> 4. The searcher is using BM25
> similarity:
> > > >>>
> > > >>> Searcher searcher = new
> IndexSearcher(dir,
> > > true);
> > > >>> searcher.setSimilarity(sim);
> > > >>>
> > > >>> Am I missing some steps?  Does
> anyone
> > > have experience with this code?
> > > >>>
> > > >>> Thanks,
> > > >>>
> > > >>> Ivan
> > > >>>
> > > >>>
> > > >>>
> > > >>>
> > > >>>
> > >
> ---------------------------------------------------------------------
> > > >>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > >>> For additional commands, e-mail: java-user-help@lucene.apache.org
> > > >>>
> > > >>>
> > > >>>
> > > >>
> > > >>
> > > > --
> > > >
> > >
> -----------------------------------------------------------
> > > > Joaquín Pérez Iglesias
> > > > Dpto. Lenguajes y Sistemas Informáticos
> > > > E.T.S.I. Informática (UNED)
> > > > Ciudad Universitaria
> > > > C/ Juan del Rosal nº 16
> > > > 28040 Madrid - Spain
> > > > Phone. +34 91 398 89 19
> > > > Fax    +34 91 398 65 35
> > > > Office  2.11
> > > > Email: joaquin.perez@lsi.uned.es
> > > > web:   http://nlp.uned.es/~jperezi/ <http://nlp.uned.es/%7Ejperezi/> <
> > http://nlp.uned.es/%7Ejperezi/>
> > > >
> > >
> -----------------------------------------------------------
> > > >
> > > >
> > > >
> > >
> ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > > >
> > > >
> > >
> > >
> > > --
> > > Robert Muir
> > > rcmuir@gmail.com
> > >
> >
> >
> >
> >
> >
> ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
> 
> 
> -- 
> Robert Muir
> rcmuir@gmail.com
> 


      

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: BM25 Scoring Patch

Posted by Robert Muir <rc...@gmail.com>.
Ivan just a little more food for thought to help you with this:

I'm glad you got improved results, yet I stand by my original statement of
'be careful' interpreting too much from one collection.

eg. had you chosen TREC-4 instead of TREC-3, you would see different
results, as vector-space with non-cosine doc length norm (LUCENE-2187)
performed better than BM25 there:
http://trec.nist.gov/pubs/trec4/overview.ps.gz

in truth its hard to 'reuse' a pooled test collection to compare methods
that were not part of the pool:
http://www.ir.uwaterloo.ca/slides/buettcher_reliable_evaluation.pdf

This might help explain why you see such a difference in MAP score!

On Tue, Feb 16, 2010 at 12:15 PM, Ivan Provalov <ip...@yahoo.com> wrote:

> Joaquin, Robert,
>
> I followed Joaquin's recommendation and removed the call to set similarity
> to BM25 explicitly (indexer, searcher).  The results showed 55% improvement
> for the MAP score (0.141->0.219) over default similarity.
>
> Joaquin, how would setting the similarity to BM25 explicitly make the score
> worse?
>
> Thank you,
>
> Ivan
>
>
>
> --- On Tue, 2/16/10, Robert Muir <rc...@gmail.com> wrote:
>
> > From: Robert Muir <rc...@gmail.com>
> > Subject: Re: BM25 Scoring Patch
> > To: java-user@lucene.apache.org
> > Date: Tuesday, February 16, 2010, 11:36 AM
> > yes Ivan, if possible please report
> > back any findings you can on the
> > experiments you are doing!
> >
> > On Tue, Feb 16, 2010 at 11:22 AM, Joaquin Perez Iglesias
> > <
> > joaquin.perez@lsi.uned.es>
> > wrote:
> >
> > > Hi Ivan,
> > >
> > > You shouldn't set the BM25Similarity for indexing or
> > searching.
> > > Please try removing the lines:
> > >   writer.setSimilarity(new
> > BM25Similarity());
> > >   searcher.setSimilarity(sim);
> > >
> > > Please let us/me know if you improve your results with
> > these changes.
> > >
> > >
> > > Robert Muir escribió:
> > >
> > >  Hi Ivan, I've seen many cases where BM25
> > performs worse than Lucene's
> > >> default Similarity. Perhaps this is just another
> > one?
> > >>
> > >> Again while I have not worked with this particular
> > collection, I looked at
> > >> the statistics and noted that its composed of
> > several 'sub-collections':
> > >> for
> > >> example the PAT documents on disk 3 have an
> > average doc length of 3543,
> > >> but
> > >> the AP documents on disk 1 have an avg doc length
> > of 353.
> > >>
> > >> I have found on other collections that any
> > advantages of BM25's document
> > >> length normalization fall apart when 'average
> > document length' doesn't
> > >> make
> > >> a whole lot of sense (cases like this).
> > >>
> > >> For this same reason, I've only found a few
> > collections where BM25's doc
> > >> length normalization is really significantly
> > better than Lucene's.
> > >>
> > >> In my opinion, the results on a particular test
> > collection or 2 have
> > >> perhaps
> > >> been taken too far and created a myth that BM25 is
> > always superior to
> > >> Lucene's scoring... this is not true!
> > >>
> > >> On Tue, Feb 16, 2010 at 9:46 AM, Ivan Provalov
> > <ip...@yahoo.com>
> > >> wrote:
> > >>
> > >>  I applied the Lucene patch mentioned in
> > >>> https://issues.apache.org/jira/browse/LUCENE-2091 and
> > ran the MAP
> > >>> numbers
> > >>> on TREC-3 collection using topics
> > 151-200.  I am not getting worse
> > >>> results
> > >>> comparing to Lucene DefaultSimilarity.  I
> > suspect, I am not using it
> > >>> correctly.  I have single field
> > documents.  This is the process I use:
> > >>>
> > >>> 1. During the indexing, I am setting the
> > similarity to BM25 as such:
> > >>>
> > >>> IndexWriter writer = new IndexWriter(dir, new
> > StandardAnalyzer(
> > >>>
> >    Version.LUCENE_CURRENT), true,
> > >>>
> >    IndexWriter.MaxFieldLength.UNLIMITED);
> > >>> writer.setSimilarity(new BM25Similarity());
> > >>>
> > >>> 2. During the Precision/Recall measurements, I
> > am using a
> > >>> SimpleBM25QQParser extension I added to the
> > benchmark:
> > >>>
> > >>> QualityQueryParser qqParser = new
> > SimpleBM25QQParser("title", "TEXT");
> > >>>
> > >>>
> > >>> 3. Here is the parser code (I set an avg doc
> > length here):
> > >>>
> > >>> public Query parse(QualityQuery qq) throws
> > ParseException {
> > >>>   BM25Parameters.setAverageLength(indexField,
> > 798.30f);//avg doc length
> > >>>   BM25Parameters.setB(0.5f);//tried
> > default values
> > >>>   BM25Parameters.setK1(2f);
> > >>>   return query = new
> > BM25BooleanQuery(qq.getValue(qqName), indexField,
> > >>> new
> > >>> StandardAnalyzer(Version.LUCENE_CURRENT));
> > >>> }
> > >>>
> > >>> 4. The searcher is using BM25 similarity:
> > >>>
> > >>> Searcher searcher = new IndexSearcher(dir,
> > true);
> > >>> searcher.setSimilarity(sim);
> > >>>
> > >>> Am I missing some steps?  Does anyone
> > have experience with this code?
> > >>>
> > >>> Thanks,
> > >>>
> > >>> Ivan
> > >>>
> > >>>
> > >>>
> > >>>
> > >>>
> > ---------------------------------------------------------------------
> > >>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > >>> For additional commands, e-mail: java-user-help@lucene.apache.org
> > >>>
> > >>>
> > >>>
> > >>
> > >>
> > > --
> > >
> > -----------------------------------------------------------
> > > Joaquín Pérez Iglesias
> > > Dpto. Lenguajes y Sistemas Informáticos
> > > E.T.S.I. Informática (UNED)
> > > Ciudad Universitaria
> > > C/ Juan del Rosal nº 16
> > > 28040 Madrid - Spain
> > > Phone. +34 91 398 89 19
> > > Fax    +34 91 398 65 35
> > > Office  2.11
> > > Email: joaquin.perez@lsi.uned.es
> > > web:   http://nlp.uned.es/~jperezi/ <http://nlp.uned.es/%7Ejperezi/> <
> http://nlp.uned.es/%7Ejperezi/>
> > >
> > -----------------------------------------------------------
> > >
> > >
> > >
> > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > >
> > >
> >
> >
> > --
> > Robert Muir
> > rcmuir@gmail.com
> >
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


-- 
Robert Muir
rcmuir@gmail.com

Re: BM25 Scoring Patch

Posted by Robert Muir <rc...@gmail.com>.
cool! i saw you were using StandardAnalyzer too, maybe you want to try using
stemming also (as this analyzer does not do stemming)... usually helps.

On Tue, Feb 16, 2010 at 12:15 PM, Ivan Provalov <ip...@yahoo.com> wrote:

> Joaquin, Robert,
>
> I followed Joaquin's recommendation and removed the call to set similarity
> to BM25 explicitly (indexer, searcher).  The results showed 55% improvement
> for the MAP score (0.141->0.219) over default similarity.
>
> Joaquin, how would setting the similarity to BM25 explicitly make the score
> worse?
>
> Thank you,
>
> Ivan
>
>
>
> --- On Tue, 2/16/10, Robert Muir <rc...@gmail.com> wrote:
>
> > From: Robert Muir <rc...@gmail.com>
> > Subject: Re: BM25 Scoring Patch
> > To: java-user@lucene.apache.org
> > Date: Tuesday, February 16, 2010, 11:36 AM
> > yes Ivan, if possible please report
> > back any findings you can on the
> > experiments you are doing!
> >
> > On Tue, Feb 16, 2010 at 11:22 AM, Joaquin Perez Iglesias
> > <
> > joaquin.perez@lsi.uned.es>
> > wrote:
> >
> > > Hi Ivan,
> > >
> > > You shouldn't set the BM25Similarity for indexing or
> > searching.
> > > Please try removing the lines:
> > >   writer.setSimilarity(new
> > BM25Similarity());
> > >   searcher.setSimilarity(sim);
> > >
> > > Please let us/me know if you improve your results with
> > these changes.
> > >
> > >
> > > Robert Muir escribió:
> > >
> > >  Hi Ivan, I've seen many cases where BM25
> > performs worse than Lucene's
> > >> default Similarity. Perhaps this is just another
> > one?
> > >>
> > >> Again while I have not worked with this particular
> > collection, I looked at
> > >> the statistics and noted that its composed of
> > several 'sub-collections':
> > >> for
> > >> example the PAT documents on disk 3 have an
> > average doc length of 3543,
> > >> but
> > >> the AP documents on disk 1 have an avg doc length
> > of 353.
> > >>
> > >> I have found on other collections that any
> > advantages of BM25's document
> > >> length normalization fall apart when 'average
> > document length' doesn't
> > >> make
> > >> a whole lot of sense (cases like this).
> > >>
> > >> For this same reason, I've only found a few
> > collections where BM25's doc
> > >> length normalization is really significantly
> > better than Lucene's.
> > >>
> > >> In my opinion, the results on a particular test
> > collection or 2 have
> > >> perhaps
> > >> been taken too far and created a myth that BM25 is
> > always superior to
> > >> Lucene's scoring... this is not true!
> > >>
> > >> On Tue, Feb 16, 2010 at 9:46 AM, Ivan Provalov
> > <ip...@yahoo.com>
> > >> wrote:
> > >>
> > >>  I applied the Lucene patch mentioned in
> > >>> https://issues.apache.org/jira/browse/LUCENE-2091 and
> > ran the MAP
> > >>> numbers
> > >>> on TREC-3 collection using topics
> > 151-200.  I am not getting worse
> > >>> results
> > >>> comparing to Lucene DefaultSimilarity.  I
> > suspect, I am not using it
> > >>> correctly.  I have single field
> > documents.  This is the process I use:
> > >>>
> > >>> 1. During the indexing, I am setting the
> > similarity to BM25 as such:
> > >>>
> > >>> IndexWriter writer = new IndexWriter(dir, new
> > StandardAnalyzer(
> > >>>
> >    Version.LUCENE_CURRENT), true,
> > >>>
> >    IndexWriter.MaxFieldLength.UNLIMITED);
> > >>> writer.setSimilarity(new BM25Similarity());
> > >>>
> > >>> 2. During the Precision/Recall measurements, I
> > am using a
> > >>> SimpleBM25QQParser extension I added to the
> > benchmark:
> > >>>
> > >>> QualityQueryParser qqParser = new
> > SimpleBM25QQParser("title", "TEXT");
> > >>>
> > >>>
> > >>> 3. Here is the parser code (I set an avg doc
> > length here):
> > >>>
> > >>> public Query parse(QualityQuery qq) throws
> > ParseException {
> > >>>   BM25Parameters.setAverageLength(indexField,
> > 798.30f);//avg doc length
> > >>>   BM25Parameters.setB(0.5f);//tried
> > default values
> > >>>   BM25Parameters.setK1(2f);
> > >>>   return query = new
> > BM25BooleanQuery(qq.getValue(qqName), indexField,
> > >>> new
> > >>> StandardAnalyzer(Version.LUCENE_CURRENT));
> > >>> }
> > >>>
> > >>> 4. The searcher is using BM25 similarity:
> > >>>
> > >>> Searcher searcher = new IndexSearcher(dir,
> > true);
> > >>> searcher.setSimilarity(sim);
> > >>>
> > >>> Am I missing some steps?  Does anyone
> > have experience with this code?
> > >>>
> > >>> Thanks,
> > >>>
> > >>> Ivan
> > >>>
> > >>>
> > >>>
> > >>>
> > >>>
> > ---------------------------------------------------------------------
> > >>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > >>> For additional commands, e-mail: java-user-help@lucene.apache.org
> > >>>
> > >>>
> > >>>
> > >>
> > >>
> > > --
> > >
> > -----------------------------------------------------------
> > > Joaquín Pérez Iglesias
> > > Dpto. Lenguajes y Sistemas Informáticos
> > > E.T.S.I. Informática (UNED)
> > > Ciudad Universitaria
> > > C/ Juan del Rosal nº 16
> > > 28040 Madrid - Spain
> > > Phone. +34 91 398 89 19
> > > Fax    +34 91 398 65 35
> > > Office  2.11
> > > Email: joaquin.perez@lsi.uned.es
> > > web:   http://nlp.uned.es/~jperezi/ <http://nlp.uned.es/%7Ejperezi/> <
> http://nlp.uned.es/%7Ejperezi/>
> > >
> > -----------------------------------------------------------
> > >
> > >
> > >
> > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > >
> > >
> >
> >
> > --
> > Robert Muir
> > rcmuir@gmail.com
> >
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


-- 
Robert Muir
rcmuir@gmail.com

Re: BM25 Scoring Patch

Posted by Robert Muir <rc...@gmail.com>.
right, that is basically lnu.ltc, we should support that model too.

i experimented with this for some work i was doing, and hacked it in with
similarity by exposing this stuff to FieldInvertState and shoving it in the
norm, not the best as its just a byte and already storing the length norm
too, but it somewhat works.

On Wed, Feb 17, 2010 at 11:15 AM, Ivan Provalov <ip...@yahoo.com> wrote:

> Another example of plugging in different score mechanism is getting average
> term frequency for the TF normalization described in IBM's
> http://trec.nist.gov/pubs/trec16/papers/ibm-haifa.mq.final.pdf
>
> We opened up the TermScorer class for that.
>
> Thanks,
>
> Ivan
>
> --- On Wed, 2/17/10, Robert Muir <rc...@gmail.com> wrote:
>
> > From: Robert Muir <rc...@gmail.com>
> > Subject: Re: BM25 Scoring Patch
> > To: java-user@lucene.apache.org
> > Date: Wednesday, February 17, 2010, 10:31 AM
> > Yuval, i apologize for not having an
> > intelligent response for your question
> > (if i did i would try to formulate it as a patch), but I
> > too would like for
> > it to be extremely easy... maybe we can iterate on the
> > patch.
> >
> > below is how i feel about it:
> >
> > i guess theoretically, the use of Similarity is how we
> > would implement a
> > pluggable scoring formula, i think already supported by
> > Solr. it would be
> > nice if BM25 could be just another Similarity, but i'm not
> > even sure thats
> > realistic in the near future.
> >
> > yet if we don't do the hard work up front to make it easy
> > to plug in things
> > like BM25, then no one will implement additional scoring
> > formulas for
> > Lucene, we currently make it terribly difficult to do
> > this.
> >
> > in the BM25 case we are just lucky, as Joaquin went thru a
> > lot of
> > work/jumped thru a lot of hoops to make it happen.
> >
> > On Wed, Feb 17, 2010 at 3:36 AM, Yuval Feinstein <yu...@answers.com>
> > wrote:
> >
> > > This is very interesting and much friendlier than a
> > flame war.
> > > My practical question for Robert is:
> > > How can we modify the BM25 patch so that it:
> > > a) Becomes part of Lucene contrib.
> > > b) Be easier to use (preventing mistakes  such as
> > Ivan's using the BM25
> > > similarity during indexing).
> > > c) Proceeds towards a pluggable scoring formula
> > (Ideally, we should have an
> > > IndexReader/IndexSearcher/IndexWriter
> > > constructor enabling specifying a scoring model
> > through an enum, with the
> > > default being, well, Lucene's default scoring model)?
> > > The easier it is to use, the more experiments people
> > can make, and see how
> > > it works for them.
> > > A future "marketing" step could be adding BM25 to
> > Solr, to further ease
> > > experimentation.
> > > TIA,
> > > Yuval
> > >
> > >
> > > -----Original Message-----
> > > From: Robert Muir [mailto:rcmuir@gmail.com]
> > > Sent: Tuesday, February 16, 2010 10:38 PM
> > > To: java-user@lucene.apache.org
> > > Subject: Re: BM25 Scoring Patch
> > >
> > > Joaquin, I have a typical methodology where I don't
> > optimize any scoring
> > > params: be it BM25 params (I stick with your
> > defaults), or lnb.ltc params
> > > (i
> > > stick with default slope). When doing query expansion
> > i don't modify the
> > > defaults for MoreLikeThis either.
> > >
> > > I've found that changing these params can have a
> > significant difference in
> > > retrieval performance, which is interesting, but I'm
> > typically focused on
> > > text analysis (how is the text
> > indexed?/stemming/stopwords). I also feel
> > > that such things are corpus-specific, which i
> > generally try to avoid in my
> > > work...
> > >
> > > for example, in analysis work,  the text
> > collection often has a majority of
> > > text in a specific tense (i.e. news), so i don't at
> > all try to tune any
> > > part
> > > of analysis as I worry this would be
> > corpus-specific... I do the same with
> > > scoring.
> > >
> > > As far as why some models perform better than others
> > for certain languages,
> > > I think this is a million-dollar question. But my
> > intuition (I don't have
> > > references or anything to back this up), is that
> > probabilistic models
> > > outperform vector-space models when you are using
> > approaches like n-grams:
> > > you don't have nice stopwords lists, stemming,
> > decompounding etc.
> > >
> > > This is particularly interesting to me, as
> > probabilistic model + ngram is a
> > > very general multilingual approach that I would like
> > to have working well
> > > in
> > > Lucene, its also important as a "default" when we
> > don't have a nicely tuned
> > > analyzer available that will work well with a vector
> > space model. In my
> > > opinion, vector-space tends to fall apart without good
> > language support.
> > >
> > >
> > > On Tue, Feb 16, 2010 at 3:23 PM, JOAQUIN PEREZ
> > IGLESIAS <
> > > joaquin.perez@lsi.uned.es>
> > wrote:
> > >
> > > > Ok,
> > > >
> > > > I'm not advocating the BM25 patch neither,
> > unfortunately BM25 was not my
> > > > idea :-))), and I'm sure that the implementation
> > can be improved.
> > > >
> > > > When you use the BM25 implementation, are you
> > optimising the parameters
> > > > specifically per collection? (It is a key factor
> > for improving BM25
> > > > performance).
> > > >
> > > > Why do you think that BM25 works better for
> > English than in other
> > > > languages (apart of experiments). What are your
> > intuitions?
> > > >
> > > > I dont't have too much experience on languages
> > moreover of Spanish and
> > > > English, and it sounds pretty interesting.
> > > >
> > > > Kind Regards.
> > > >
> > > > P.S: Maybe this is not a topic for this list???
> > > >
> > > >
> > > > > Joaquin, I don't see this as a flame war?
> > First of all I'd like to
> > > > > personally thank you for your excellent BM25
> > implementation!
> > > > >
> > > > > I think the selection of a retrieval model
> > depends highly on the
> > > > > language/indexing approach, i.e. if we were
> > talking East Asian
> > > languages
> > > > I
> > > > > think we want a probabilistic model: no
> > argument there!
> > > > >
> > > > > All i said was that it is a myth that BM25
> > is "always" better than
> > > > > Lucene's
> > > > > scoring model, it really depends on what you
> > are trying to do, how you
> > > > are
> > > > > indexing your text, properties of your
> > corpus, how your queries are
> > > > > running.
> > > > >
> > > > > I don't even want to come across as
> > advocating the lnb.ltc approach
> > > > > either,
> > > > > sure I wrote the patch, but this means
> > nothing. I only like it as its
> > > > > currently a simple integration into Lucene,
> > but long-term its best if
> > > we
> > > > > can
> > > > > support other models also!
> > > > >
> > > > > Finally I think there is something to be
> > said for Lucene's default
> > > > > retrieval
> > > > > model, which in my (non-english) findings
> > across the board isn't
> > > terrible
> > > > > at
> > > > > all... then again I am working with
> > languages where analysis is really
> > > > the
> > > > > thing holding Lucene back, not scoring.
> > > > >
> > > > > On Tue, Feb 16, 2010 at 2:40 PM, JOAQUIN
> > PEREZ IGLESIAS <
> > > > > joaquin.perez@lsi.uned.es>
> > wrote:
> > > > >
> > > > >> Just some final comments (as I said I'm
> > not interested in flame wars),
> > > > >>
> > > > >> If I obtain better results there are not
> > problem with pooling
> > > otherwise
> > > > >> it
> > > > >> is biased.
> > > > >> The only important thing (in my opinion)
> > is that it cannot be said
> > > that
> > > > >> BM25 is a myth.
> > > > >> Yes, you are right there is not an only
> > ranking model that beats the
> > > > >> rest,
> > > > >> but there are models that generally show
> > a better performance in more
> > > > >> cases.
> > > > >>
> > > > >> About CLEF I have had the same
> > experience (VSM vs BM25) on Spanish and
> > > > >> English (WebCLEF) and Q&A
> > (ResPubliQA)
> > > > >>
> > > > >> Ivan checks the parameters (b and k1),
> > probably you can improve your
> > > > >> results. (that's the bad part of BM25).
> > > > >>
> > > > >> Finally we are just speaking of personal
> > experience, so obviously you
> > > > >> should use the best model for your data
> > and your own experience, on IR
> > > > >> there are not myths neither best ranking
> > models. If any of us is able
> > > to
> > > > >> find the “best”
> > ranking model, or is able to prove that
> > > any
> > > > >> state-of-the art is a myth he should
> > send these results to the SIGIR
> > > > >> conference.
> > > > >>
> > > > >> Ivan, Robert good luck with your
> > experiments, as I said the good part
> > > of
> > > > >> IR is that you can always make
> > experiments on your own.
> > > > >>
> > > > >> > I don't think its really a
> > competition, I think preferably we should
> > > > >> have
> > > > >> > the flexibility to change the
> > scoring model in lucene actually?
> > > > >> >
> > > > >> > I have found lots of cases where
> > VSM improves on BM25, but then
> > > again
> > > > >> I
> > > > >> > don't work with TREC stuff, as I
> > work with non-english collections.
> > > > >> >
> > > > >> > It doesn't contradict years of
> > research to say that VSM isn't a
> > > > >> > state-of-the-art model, besides the
> > TREC-4 results, there are CLEF
> > > > >> results
> > > > >> > where VSM models perform
> > competitively or exceed (Finnish, Russian,
> > > > >> etc)
> > > > >> > BM25/DFR/etc.
> > > > >> >
> > > > >> > It depends on the collection, there
> > isn't a 'best retrieval
> > > formula'.
> > > > >> >
> > > > >> > Note: I have no bias against BM-25,
> > but its definitely a myth to say
> > > > >> there
> > > > >> > is a single retrieval formula that
> > is the 'best' across the board.
> > > > >> >
> > > > >> >
> > > > >> > On Tue, Feb 16, 2010 at 1:53 PM,
> > JOAQUIN PEREZ IGLESIAS <
> > > > >> > joaquin.perez@lsi.uned.es>
> > wrote:
> > > > >> >
> > > > >> >> By the way,
> > > > >> >>
> > > > >> >> I don't want to start a flame
> > war VSM vs BM25, but I really believe
> > > > >> that
> > > > >> >> I
> > > > >> >> have to express my opinion as
> > Robert has done. In my experience, I
> > > > >> have
> > > > >> >> never found a case where VSM
> > improves significantly BM25. Maybe you
> > > > >> can
> > > > >> >> find some cases under some very
> > specific collection
> > > characteristics,
> > > > >> (as
> > > > >> >> average length of 300 vs 3000)
> > or a bad usage of BM25 (not proper
> > > > >> >> parameters) where it can
> > happen.
> > > > >> >>
> > > > >> >> BM25 is not just only a
> > different way of length normalization, it
> > > is
> > > > >> >> based
> > > > >> >> strongly in the probabilistic
> > framework, and parametrises
> > > frequencies
> > > > >> >> and
> > > > >> >> length. This is probably the
> > most successful ranking model of the
> > > > >> last
> > > > >> >> years in Information
> > Retrieval.
> > > > >> >>
> > > > >> >> I have never read a paper where
> > VSM  improves any of the
> > > > >> >> state-of-the-art
> > > > >> >> ranking models (Language
> > Models, DFR, BM25,...),  although the VSM
> > > > >> with
> > > > >> >> pivoted normalisation length
> > can obtain nice results. This can be
> > > > >> proved
> > > > >> >> checking the last years of the
> > TREC competition.
> > > > >> >>
> > > > >> >> Honestly to say that is a myth
> > that BM25 improves VSM breaks the
> > > last
> > > > >> 10
> > > > >> >> or 15 years of research on
> > Information Retrieval, and I really
> > > > >> believe
> > > > >> >> that is not accurate.
> > > > >> >>
> > > > >> >> The good thing of Information
> > Retrieval is that you can always make
> > > > >> your
> > > > >> >> owns experiments and you can
> > use the experience of a lot of years
> > > of
> > > > >> >> research.
> > > > >> >>
> > > > >> >> PS: This opinion is based on
> > experiments on TREC and CLEF
> > > > >> collections,
> > > > >> >> obviously we can start a debate
> > about the suitability of this type
> > > of
> > > > >> >> experimentation (concept of
> > relevance, pooling, relevance
> > > > >> judgements),
> > > > >> >> but
> > > > >> >> this is a much more complex
> > topic and I believe is far from what we
> > > > >> are
> > > > >> >> dealing here.
> > > > >> >>
> > > > >> >> PS2: In relation with TREC4
> > Cornell used a pivoted length
> > > > >> normalisation
> > > > >> >> and they were applying
> > pseudo-relevance feedback, what honestly
> > > makes
> > > > >> >> much
> > > > >> >> more difficult the analysis of
> > the results. Obviously their results
> > > > >> were
> > > > >> >> part of the pool.
> > > > >> >>
> > > > >> >> Sorry for the huge mail :-))))
> > > > >> >>
> > > > >> >> > Hi Ivan,
> > > > >> >> >
> > > > >> >> > the problem is that
> > unfortunately BM25
> > > > >> >> > cannot be implemented
> > overwriting
> > > > >> >> > the Similarity interface.
> > Therefore BM25Similarity
> > > > >> >> > only computes the classic
> > probabilistic IDF (what is
> > > > >> >> > interesting only at search
> > time).
> > > > >> >> > If you set BM25Similarity
> > at indexing time
> > > > >> >> > some basic stats are not
> > stored
> > > > >> >> > correctly in the segments
> > (like docs length).
> > > > >> >> >
> > > > >> >> > When you use
> > BM25BooleanQuery this class
> > > > >> >> > will set automatically the
> > BM25Similarity for you,
> > > > >> >> > therefore you don't need
> > to do this explicitly.
> > > > >> >> >
> > > > >> >> > I tried to make this
> > implementation with the focus on
> > > > >> >> > not interfering on the
> > typical use of Lucene (so no changing
> > > > >> >> > DefaultSimilarity).
> > > > >> >> >
> > > > >> >> >> Joaquin, Robert,
> > > > >> >> >>
> > > > >> >> >> I followed Joaquin's
> > recommendation and removed the call to set
> > > > >> >> >> similarity
> > > > >> >> >> to BM25 explicitly
> > (indexer, searcher).  The results showed 55%
> > > > >> >> >> improvement for the
> > MAP score (0.141->0.219) over default
> > > > >> similarity.
> > > > >> >> >>
> > > > >> >> >> Joaquin, how would
> > setting the similarity to BM25 explicitly
> > > make
> > > > >> the
> > > > >> >> >> score worse?
> > > > >> >> >>
> > > > >> >> >> Thank you,
> > > > >> >> >>
> > > > >> >> >> Ivan
> > > > >> >> >>
> > > > >> >> >>
> > > > >> >> >>
> > > > >> >> >> --- On Tue, 2/16/10,
> > Robert Muir <rc...@gmail.com>
> > wrote:
> > > > >> >> >>
> > > > >> >> >>> From: Robert Muir
> > <rc...@gmail.com>
> > > > >> >> >>> Subject: Re: BM25
> > Scoring Patch
> > > > >> >> >>> To: java-user@lucene.apache.org
> > > > >> >> >>> Date: Tuesday,
> > February 16, 2010, 11:36 AM
> > > > >> >> >>> yes Ivan, if
> > possible please report
> > > > >> >> >>> back any findings
> > you can on the
> > > > >> >> >>> experiments you
> > are doing!
> > > > >> >> >>>
> > > > >> >> >>> On Tue, Feb 16,
> > 2010 at 11:22 AM, Joaquin Perez Iglesias
> > > > >> >> >>> <
> > > > >> >> >>> joaquin.perez@lsi.uned.es>
> > > > >> >> >>> wrote:
> > > > >> >> >>>
> > > > >> >> >>> > Hi Ivan,
> > > > >> >> >>> >
> > > > >> >> >>> > You shouldn't
> > set the BM25Similarity for indexing or
> > > > >> >> >>> searching.
> > > > >> >> >>> > Please try
> > removing the lines:
> > > > >> >> >>>
> > >   writer.setSimilarity(new
> > > > >> >> >>>
> > BM25Similarity());
> > > > >> >> >>>
> > >   searcher.setSimilarity(sim);
> > > > >> >> >>> >
> > > > >> >> >>> > Please let
> > us/me know if you improve your results with
> > > > >> >> >>> these changes.
> > > > >> >> >>> >
> > > > >> >> >>> >
> > > > >> >> >>> > Robert Muir
> > escribió:
> > > > >> >> >>> >
> > > > >> >> >>> >  Hi
> > Ivan, I've seen many cases where BM25
> > > > >> >> >>> performs worse
> > than Lucene's
> > > > >> >> >>> >> default
> > Similarity. Perhaps this is just another
> > > > >> >> >>> one?
> > > > >> >> >>> >>
> > > > >> >> >>> >> Again
> > while I have not worked with this particular
> > > > >> >> >>> collection, I
> > looked at
> > > > >> >> >>> >> the
> > statistics and noted that its composed of
> > > > >> >> >>> several
> > 'sub-collections':
> > > > >> >> >>> >> for
> > > > >> >> >>> >> example
> > the PAT documents on disk 3 have an
> > > > >> >> >>> average doc length
> > of 3543,
> > > > >> >> >>> >> but
> > > > >> >> >>> >> the AP
> > documents on disk 1 have an avg doc length
> > > > >> >> >>> of 353.
> > > > >> >> >>> >>
> > > > >> >> >>> >> I have
> > found on other collections that any
> > > > >> >> >>> advantages of
> > BM25's document
> > > > >> >> >>> >> length
> > normalization fall apart when 'average
> > > > >> >> >>> document length'
> > doesn't
> > > > >> >> >>> >> make
> > > > >> >> >>> >> a whole
> > lot of sense (cases like this).
> > > > >> >> >>> >>
> > > > >> >> >>> >> For this
> > same reason, I've only found a few
> > > > >> >> >>> collections where
> > BM25's doc
> > > > >> >> >>> >> length
> > normalization is really significantly
> > > > >> >> >>> better than
> > Lucene's.
> > > > >> >> >>> >>
> > > > >> >> >>> >> In my
> > opinion, the results on a particular test
> > > > >> >> >>> collection or 2
> > have
> > > > >> >> >>> >> perhaps
> > > > >> >> >>> >> been
> > taken too far and created a myth that BM25 is
> > > > >> >> >>> always superior
> > to
> > > > >> >> >>> >> Lucene's
> > scoring... this is not true!
> > > > >> >> >>> >>
> > > > >> >> >>> >> On Tue,
> > Feb 16, 2010 at 9:46 AM, Ivan Provalov
> > > > >> >> >>> <ip...@yahoo.com>
> > > > >> >> >>> >> wrote:
> > > > >> >> >>> >>
> > > > >> >> >>> >>  I
> > applied the Lucene patch mentioned in
> > > > >> >> >>> >>> https://issues.apache.org/jira/browse/LUCENE-2091 and
> > > > >> >> >>> ran the MAP
> > > > >> >> >>> >>>
> > numbers
> > > > >> >> >>> >>> on
> > TREC-3 collection using topics
> > > > >> >> >>> 151-200.  I
> > am not getting worse
> > > > >> >> >>> >>>
> > results
> > > > >> >> >>> >>>
> > comparing to Lucene DefaultSimilarity.  I
> > > > >> >> >>> suspect, I am not
> > using it
> > > > >> >> >>> >>>
> > correctly.  I have single field
> > > > >> >> >>> documents.
> > This is the process I use:
> > > > >> >> >>> >>>
> > > > >> >> >>> >>> 1.
> > During the indexing, I am setting the
> > > > >> >> >>> similarity to BM25
> > as such:
> > > > >> >> >>> >>>
> > > > >> >> >>> >>>
> > IndexWriter writer = new IndexWriter(dir, new
> > > > >> >> >>> StandardAnalyzer(
> > > > >> >> >>> >>>
> > > > >> >> >>>
> > Version.LUCENE_CURRENT), true,
> > > > >> >> >>> >>>
> > > > >> >> >>>
> > IndexWriter.MaxFieldLength.UNLIMITED);
> > > > >> >> >>> >>>
> > writer.setSimilarity(new BM25Similarity());
> > > > >> >> >>> >>>
> > > > >> >> >>> >>> 2.
> > During the Precision/Recall measurements, I
> > > > >> >> >>> am using a
> > > > >> >> >>> >>>
> > SimpleBM25QQParser extension I added to the
> > > > >> >> >>> benchmark:
> > > > >> >> >>> >>>
> > > > >> >> >>> >>>
> > QualityQueryParser qqParser = new
> > > > >> >> >>>
> > SimpleBM25QQParser("title", "TEXT");
> > > > >> >> >>> >>>
> > > > >> >> >>> >>>
> > > > >> >> >>> >>> 3.
> > Here is the parser code (I set an avg doc
> > > > >> >> >>> length here):
> > > > >> >> >>> >>>
> > > > >> >> >>> >>>
> > public Query parse(QualityQuery qq) throws
> > > > >> >> >>> ParseException {
> > > > >> >> >>>
> > >>>   BM25Parameters.setAverageLength(indexField,
> > > > >> >> >>> 798.30f);//avg doc
> > length
> > > > >> >> >>>
> > >>>   BM25Parameters.setB(0.5f);//tried
> > > > >> >> >>> default values
> > > > >> >> >>>
> > >>>   BM25Parameters.setK1(2f);
> > > > >> >> >>>
> > >>>   return query = new
> > > > >> >> >>>
> > BM25BooleanQuery(qq.getValue(qqName), indexField,
> > > > >> >> >>> >>> new
> > > > >> >> >>> >>>
> > StandardAnalyzer(Version.LUCENE_CURRENT));
> > > > >> >> >>> >>> }
> > > > >> >> >>> >>>
> > > > >> >> >>> >>> 4.
> > The searcher is using BM25 similarity:
> > > > >> >> >>> >>>
> > > > >> >> >>> >>>
> > Searcher searcher = new IndexSearcher(dir,
> > > > >> >> >>> true);
> > > > >> >> >>> >>>
> > searcher.setSimilarity(sim);
> > > > >> >> >>> >>>
> > > > >> >> >>> >>> Am I
> > missing some steps?  Does anyone
> > > > >> >> >>> have experience
> > with this code?
> > > > >> >> >>> >>>
> > > > >> >> >>> >>>
> > Thanks,
> > > > >> >> >>> >>>
> > > > >> >> >>> >>> Ivan
> > > > >> >> >>> >>>
> > > > >> >> >>> >>>
> > > > >> >> >>> >>>
> > > > >> >> >>> >>>
> > > > >> >> >>> >>>
> > > > >> >> >>>
> > > > >>
> > ---------------------------------------------------------------------
> > > > >> >> >>> >>> To
> > unsubscribe, e-mail:
> > > > >> java-user-unsubscribe@lucene.apache.org
> > > > >> >> >>> >>> For
> > additional commands, e-mail:
> > > > >> >> java-user-help@lucene.apache.org
> > > > >> >> >>> >>>
> > > > >> >> >>> >>>
> > > > >> >> >>> >>>
> > > > >> >> >>> >>
> > > > >> >> >>> >>
> > > > >> >> >>> > --
> > > > >> >> >>> >
> > > > >> >> >>>
> > -----------------------------------------------------------
> > > > >> >> >>> > Joaquín
> > Pérez Iglesias
> > > > >> >> >>> > Dpto.
> > Lenguajes y Sistemas Informáticos
> > > > >> >> >>> > E.T.S.I.
> > Informática (UNED)
> > > > >> >> >>> > Ciudad
> > Universitaria
> > > > >> >> >>> > C/ Juan del
> > Rosal nº 16
> > > > >> >> >>> > 28040 Madrid
> > - Spain
> > > > >> >> >>> > Phone. +34 91
> > 398 89 19
> > > > >> >> >>> > Fax
> >   +34 91 398 65 35
> > > > >> >> >>> > Office
> > 2.11
> > > > >> >> >>> > Email: joaquin.perez@lsi.uned.es
> > > > >> >> >>> > web:
> > > > >> http://nlp.uned.es/~jperezi/ <http://nlp.uned.es/%7Ejperezi/> <
> http://nlp.uned.es/%7Ejperezi/> <
> > > http://nlp.uned.es/%7Ejperezi/><
> > > > http://nlp.uned.es/%7Ejperezi/>
> > > > >> >> <http://nlp.uned.es/%7Ejperezi/><
> > > > >> >> http://nlp.uned.es/%7Ejperezi/>
> > > > >> >> >>> >
> > > > >> >> >>>
> > -----------------------------------------------------------
> > > > >> >> >>> >
> > > > >> >> >>> >
> > > > >> >> >>> >
> > > > >> >> >>>
> > > > >>
> > ---------------------------------------------------------------------
> > > > >> >> >>> > To
> > unsubscribe, e-mail:
> > > > java-user-unsubscribe@lucene.apache.org
> > > > >> >> >>> > For
> > additional commands, e-mail:
> > > > >> java-user-help@lucene.apache.org
> > > > >> >> >>> >
> > > > >> >> >>> >
> > > > >> >> >>>
> > > > >> >> >>>
> > > > >> >> >>> --
> > > > >> >> >>> Robert Muir
> > > > >> >> >>> rcmuir@gmail.com
> > > > >> >> >>>
> > > > >> >> >>
> > > > >> >> >>
> > > > >> >> >>
> > > > >> >> >>
> > > > >> >> >>
> > > >
> > ---------------------------------------------------------------------
> > > > >> >> >> To unsubscribe,
> > e-mail: java-user-unsubscribe@lucene.apache.org
> > > > >> >> >> For additional
> > commands, e-mail:
> > > java-user-help@lucene.apache.org
> > > > >> >> >>
> > > > >> >> >>
> > > > >> >> >
> > > > >> >> >
> > > > >> >> >
> > > > >> >> >
> > > >
> > ---------------------------------------------------------------------
> > > > >> >> > To unsubscribe, e-mail:
> java-user-unsubscribe@lucene.apache.org
> > > > >> >> > For additional commands,
> > e-mail:
> > > java-user-help@lucene.apache.org
> > > > >> >> >
> > > > >> >> >
> > > > >> >>
> > > > >> >>
> > > > >> >>
> > > > >> >>
> > >
> > ---------------------------------------------------------------------
> > > > >> >> To unsubscribe, e-mail:
> java-user-unsubscribe@lucene.apache.org
> > > > >> >> For additional commands,
> > e-mail: java-user-help@lucene.apache.org
> > > > >> >>
> > > > >> >>
> > > > >> >
> > > > >> >
> > > > >> > --
> > > > >> > Robert Muir
> > > > >> > rcmuir@gmail.com
> > > > >> >
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > ---------------------------------------------------------------------
> > > > >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > > >> For additional commands, e-mail: java-user-help@lucene.apache.org
> > > > >>
> > > > >>
> > > > >
> > > > >
> > > > > --
> > > > > Robert Muir
> > > > > rcmuir@gmail.com
> > > > >
> > > >
> > > >
> > > >
> > > >
> > ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > > >
> > > >
> > >
> > >
> > > --
> > > Robert Muir
> > > rcmuir@gmail.com
> > >
> >
> >
> >
> > --
> > Robert Muir
> > rcmuir@gmail.com
> >
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


-- 
Robert Muir
rcmuir@gmail.com

RE: BM25 Scoring Patch

Posted by Yuval Feinstein <yu...@answers.com>.
-----Original Message-----
From: Robert Muir [mailto:rcmuir@gmail.com] 
Sent: Thursday, February 18, 2010 3:09 PM
To: java-user@lucene.apache.org
Subject: Re: BM25 Scoring Patch

Yuval, don't we still need this 'document-level IDF' for BM25f?

- Yes, we do need 'document-level IDF' for BM25f.  
- Joaquin tried to bypass this by using the IDF of the field having the longest average length instead
- of the document's IDF.
- This introduces some bias into the scoring formula, but maybe it is not too large...

On Thu, Feb 18, 2010 at 3:45 AM, Yuval Feinstein <yu...@answers.com> wrote:

> We could solve this by saying we only incorporate BM25F into Lucene.
> This is a field-based scoring method, so it saves us the need to deal with
> documents.
> Building on Joaquin's work, the extra parts needed IMO are:
> a. Support for storing average length per field during indexing. I think I
> saw some reference to this
> when Grant described the new features in Lucene 2.9. We need to store two
> numbers (say
> number of documents containing the field and average length) to support
> incremental indexing.
> b. Easy integration of BM25F similarity - default parameter values, working
> with regular Lucene class hierarchy.
> c. Support for all regular query types - PhraseQuery, FuzzyQuery etc. (We
> could do this incrementally,
> throwing an "UnsupportedOperationException" in the meantime).
> d. Some work on run-time efficiency, to be near the efficiency of the
> default scoring.
> I could do some of this work myself, but guidance from a Lucene scoring
> guru would be a great help.
> Thanks,
> Yuval
>

Re: BM25 Scoring Patch

Posted by Robert Muir <rc...@gmail.com>.
Yuval, don't we still need this 'document-level IDF' for BM25f?

On Thu, Feb 18, 2010 at 3:45 AM, Yuval Feinstein <yu...@answers.com> wrote:

> We could solve this by saying we only incorporate BM25F into Lucene.
> This is a field-based scoring method, so it saves us the need to deal with
> documents.
> Building on Joaquin's work, the extra parts needed IMO are:
> a. Support for storing average length per field during indexing. I think I
> saw some reference to this
> when Grant described the new features in Lucene 2.9. We need to store two
> numbers (say
> number of documents containing the field and average length) to support
> incremental indexing.
> b. Easy integration of BM25F similarity - default parameter values, working
> with regular Lucene class hierarchy.
> c. Support for all regular query types - PhraseQuery, FuzzyQuery etc. (We
> could do this incrementally,
> throwing an "UnsupportedOperationException" in the meantime).
> d. Some work on run-time efficiency, to be near the efficiency of the
> default scoring.
> I could do some of this work myself, but guidance from a Lucene scoring
> guru would be a great help.
> Thanks,
> Yuval
>
> -----Original Message-----
> From: Robert Muir [mailto:rcmuir@gmail.com]
> Sent: Wednesday, February 17, 2010 6:47 PM
> To: java-user@lucene.apache.org
> Subject: Re: BM25 Scoring Patch
>
> I tend to agree with you Marvin, you are right, the different scoring
> mechanisms need different information available and this is the problem.
>
> although last I checked, one hard part of BM25 rotates around fields versus
> documents... e.g. BM25's IDF calculation.
>
> but maybe this is just an extreme form of your example :)
>
> On Wed, Feb 17, 2010 at 11:39 AM, Marvin Humphrey <marvin@rectangular.com
> >wrote:
>
> > On Wed, Feb 17, 2010 at 10:31:19AM -0500, Robert Muir wrote:
> > > yet if we don't do the hard work up front to make it easy to plug in
> > things
> > > like BM25, then no one will implement additional scoring formulas for
> > > Lucene, we currently make it terribly difficult to do this.
> >
> > FWIW... Similarity and posting format spec are so closely tied that I'm
> > considering linking them in Lucy.
> >
> >  Schema schema = new Schema();
> >  FullTextType bm25Type = new FullTextType(new BM25Similarity());
> >  schema.specField("content", bm25Type);
> >  schema.specField("title", bm25Type);
> >  StringType matchType = new StringType(new MatchSimilarity());
> >  schema.specField("category", matchType);
> >
> > That way, custom scoring implementations can guarantee that they always
> > have
> > the posting information they need available to make their similarity
> > judgments.  Similarity also becomes a more generalized notion, with the
> > TF/IDF-specific functionality moving into a subclass.
> >
> > Maybe something similar could be made to work in Lucene.  Dunno how
> > McCandless
> > has things set up for spec'ing codecs on the flex branch.
> >
> > Marvin Humphrey
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
>
>
> --
> Robert Muir
> rcmuir@gmail.com
>



-- 
Robert Muir
rcmuir@gmail.com

RE: BM25 Scoring Patch

Posted by Yuval Feinstein <yu...@answers.com>.
We could solve this by saying we only incorporate BM25F into Lucene.
This is a field-based scoring method, so it saves us the need to deal with documents.
Building on Joaquin's work, the extra parts needed IMO are:
a. Support for storing average length per field during indexing. I think I saw some reference to this
when Grant described the new features in Lucene 2.9. We need to store two numbers (say
number of documents containing the field and average length) to support incremental indexing.
b. Easy integration of BM25F similarity - default parameter values, working with regular Lucene class hierarchy.
c. Support for all regular query types - PhraseQuery, FuzzyQuery etc. (We could do this incrementally,
throwing an "UnsupportedOperationException" in the meantime).
d. Some work on run-time efficiency, to be near the efficiency of the default scoring.
I could do some of this work myself, but guidance from a Lucene scoring guru would be a great help.
Thanks,
Yuval

-----Original Message-----
From: Robert Muir [mailto:rcmuir@gmail.com] 
Sent: Wednesday, February 17, 2010 6:47 PM
To: java-user@lucene.apache.org
Subject: Re: BM25 Scoring Patch

I tend to agree with you Marvin, you are right, the different scoring
mechanisms need different information available and this is the problem.

although last I checked, one hard part of BM25 rotates around fields versus
documents... e.g. BM25's IDF calculation.

but maybe this is just an extreme form of your example :)

On Wed, Feb 17, 2010 at 11:39 AM, Marvin Humphrey <ma...@rectangular.com>wrote:

> On Wed, Feb 17, 2010 at 10:31:19AM -0500, Robert Muir wrote:
> > yet if we don't do the hard work up front to make it easy to plug in
> things
> > like BM25, then no one will implement additional scoring formulas for
> > Lucene, we currently make it terribly difficult to do this.
>
> FWIW... Similarity and posting format spec are so closely tied that I'm
> considering linking them in Lucy.
>
>  Schema schema = new Schema();
>  FullTextType bm25Type = new FullTextType(new BM25Similarity());
>  schema.specField("content", bm25Type);
>  schema.specField("title", bm25Type);
>  StringType matchType = new StringType(new MatchSimilarity());
>  schema.specField("category", matchType);
>
> That way, custom scoring implementations can guarantee that they always
> have
> the posting information they need available to make their similarity
> judgments.  Similarity also becomes a more generalized notion, with the
> TF/IDF-specific functionality moving into a subclass.
>
> Maybe something similar could be made to work in Lucene.  Dunno how
> McCandless
> has things set up for spec'ing codecs on the flex branch.
>
> Marvin Humphrey
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


-- 
Robert Muir
rcmuir@gmail.com

Re: BM25 Scoring Patch

Posted by Robert Muir <rc...@gmail.com>.
I tend to agree with you Marvin, you are right, the different scoring
mechanisms need different information available and this is the problem.

although last I checked, one hard part of BM25 rotates around fields versus
documents... e.g. BM25's IDF calculation.

but maybe this is just an extreme form of your example :)

On Wed, Feb 17, 2010 at 11:39 AM, Marvin Humphrey <ma...@rectangular.com>wrote:

> On Wed, Feb 17, 2010 at 10:31:19AM -0500, Robert Muir wrote:
> > yet if we don't do the hard work up front to make it easy to plug in
> things
> > like BM25, then no one will implement additional scoring formulas for
> > Lucene, we currently make it terribly difficult to do this.
>
> FWIW... Similarity and posting format spec are so closely tied that I'm
> considering linking them in Lucy.
>
>  Schema schema = new Schema();
>  FullTextType bm25Type = new FullTextType(new BM25Similarity());
>  schema.specField("content", bm25Type);
>  schema.specField("title", bm25Type);
>  StringType matchType = new StringType(new MatchSimilarity());
>  schema.specField("category", matchType);
>
> That way, custom scoring implementations can guarantee that they always
> have
> the posting information they need available to make their similarity
> judgments.  Similarity also becomes a more generalized notion, with the
> TF/IDF-specific functionality moving into a subclass.
>
> Maybe something similar could be made to work in Lucene.  Dunno how
> McCandless
> has things set up for spec'ing codecs on the flex branch.
>
> Marvin Humphrey
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


-- 
Robert Muir
rcmuir@gmail.com

Re: BM25 Scoring Patch

Posted by Marvin Humphrey <ma...@rectangular.com>.
On Wed, Feb 17, 2010 at 10:31:19AM -0500, Robert Muir wrote:
> yet if we don't do the hard work up front to make it easy to plug in things
> like BM25, then no one will implement additional scoring formulas for
> Lucene, we currently make it terribly difficult to do this.

FWIW... Similarity and posting format spec are so closely tied that I'm
considering linking them in Lucy.  

  Schema schema = new Schema();
  FullTextType bm25Type = new FullTextType(new BM25Similarity());
  schema.specField("content", bm25Type);
  schema.specField("title", bm25Type);
  StringType matchType = new StringType(new MatchSimilarity());
  schema.specField("category", matchType);

That way, custom scoring implementations can guarantee that they always have
the posting information they need available to make their similarity
judgments.  Similarity also becomes a more generalized notion, with the
TF/IDF-specific functionality moving into a subclass.

Maybe something similar could be made to work in Lucene.  Dunno how McCandless
has things set up for spec'ing codecs on the flex branch.

Marvin Humphrey


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: BM25 Scoring Patch

Posted by Ivan Provalov <ip...@yahoo.com>.
Another example of plugging in different score mechanism is getting average term frequency for the TF normalization described in IBM's http://trec.nist.gov/pubs/trec16/papers/ibm-haifa.mq.final.pdf

We opened up the TermScorer class for that.

Thanks,

Ivan

--- On Wed, 2/17/10, Robert Muir <rc...@gmail.com> wrote:

> From: Robert Muir <rc...@gmail.com>
> Subject: Re: BM25 Scoring Patch
> To: java-user@lucene.apache.org
> Date: Wednesday, February 17, 2010, 10:31 AM
> Yuval, i apologize for not having an
> intelligent response for your question
> (if i did i would try to formulate it as a patch), but I
> too would like for
> it to be extremely easy... maybe we can iterate on the
> patch.
> 
> below is how i feel about it:
> 
> i guess theoretically, the use of Similarity is how we
> would implement a
> pluggable scoring formula, i think already supported by
> Solr. it would be
> nice if BM25 could be just another Similarity, but i'm not
> even sure thats
> realistic in the near future.
> 
> yet if we don't do the hard work up front to make it easy
> to plug in things
> like BM25, then no one will implement additional scoring
> formulas for
> Lucene, we currently make it terribly difficult to do
> this.
> 
> in the BM25 case we are just lucky, as Joaquin went thru a
> lot of
> work/jumped thru a lot of hoops to make it happen.
> 
> On Wed, Feb 17, 2010 at 3:36 AM, Yuval Feinstein <yu...@answers.com>
> wrote:
> 
> > This is very interesting and much friendlier than a
> flame war.
> > My practical question for Robert is:
> > How can we modify the BM25 patch so that it:
> > a) Becomes part of Lucene contrib.
> > b) Be easier to use (preventing mistakes  such as
> Ivan's using the BM25
> > similarity during indexing).
> > c) Proceeds towards a pluggable scoring formula
> (Ideally, we should have an
> > IndexReader/IndexSearcher/IndexWriter
> > constructor enabling specifying a scoring model
> through an enum, with the
> > default being, well, Lucene's default scoring model)?
> > The easier it is to use, the more experiments people
> can make, and see how
> > it works for them.
> > A future "marketing" step could be adding BM25 to
> Solr, to further ease
> > experimentation.
> > TIA,
> > Yuval
> >
> >
> > -----Original Message-----
> > From: Robert Muir [mailto:rcmuir@gmail.com]
> > Sent: Tuesday, February 16, 2010 10:38 PM
> > To: java-user@lucene.apache.org
> > Subject: Re: BM25 Scoring Patch
> >
> > Joaquin, I have a typical methodology where I don't
> optimize any scoring
> > params: be it BM25 params (I stick with your
> defaults), or lnb.ltc params
> > (i
> > stick with default slope). When doing query expansion
> i don't modify the
> > defaults for MoreLikeThis either.
> >
> > I've found that changing these params can have a
> significant difference in
> > retrieval performance, which is interesting, but I'm
> typically focused on
> > text analysis (how is the text
> indexed?/stemming/stopwords). I also feel
> > that such things are corpus-specific, which i
> generally try to avoid in my
> > work...
> >
> > for example, in analysis work,  the text
> collection often has a majority of
> > text in a specific tense (i.e. news), so i don't at
> all try to tune any
> > part
> > of analysis as I worry this would be
> corpus-specific... I do the same with
> > scoring.
> >
> > As far as why some models perform better than others
> for certain languages,
> > I think this is a million-dollar question. But my
> intuition (I don't have
> > references or anything to back this up), is that
> probabilistic models
> > outperform vector-space models when you are using
> approaches like n-grams:
> > you don't have nice stopwords lists, stemming,
> decompounding etc.
> >
> > This is particularly interesting to me, as
> probabilistic model + ngram is a
> > very general multilingual approach that I would like
> to have working well
> > in
> > Lucene, its also important as a "default" when we
> don't have a nicely tuned
> > analyzer available that will work well with a vector
> space model. In my
> > opinion, vector-space tends to fall apart without good
> language support.
> >
> >
> > On Tue, Feb 16, 2010 at 3:23 PM, JOAQUIN PEREZ
> IGLESIAS <
> > joaquin.perez@lsi.uned.es>
> wrote:
> >
> > > Ok,
> > >
> > > I'm not advocating the BM25 patch neither,
> unfortunately BM25 was not my
> > > idea :-))), and I'm sure that the implementation
> can be improved.
> > >
> > > When you use the BM25 implementation, are you
> optimising the parameters
> > > specifically per collection? (It is a key factor
> for improving BM25
> > > performance).
> > >
> > > Why do you think that BM25 works better for
> English than in other
> > > languages (apart of experiments). What are your
> intuitions?
> > >
> > > I dont't have too much experience on languages
> moreover of Spanish and
> > > English, and it sounds pretty interesting.
> > >
> > > Kind Regards.
> > >
> > > P.S: Maybe this is not a topic for this list???
> > >
> > >
> > > > Joaquin, I don't see this as a flame war?
> First of all I'd like to
> > > > personally thank you for your excellent BM25
> implementation!
> > > >
> > > > I think the selection of a retrieval model
> depends highly on the
> > > > language/indexing approach, i.e. if we were
> talking East Asian
> > languages
> > > I
> > > > think we want a probabilistic model: no
> argument there!
> > > >
> > > > All i said was that it is a myth that BM25
> is "always" better than
> > > > Lucene's
> > > > scoring model, it really depends on what you
> are trying to do, how you
> > > are
> > > > indexing your text, properties of your
> corpus, how your queries are
> > > > running.
> > > >
> > > > I don't even want to come across as
> advocating the lnb.ltc approach
> > > > either,
> > > > sure I wrote the patch, but this means
> nothing. I only like it as its
> > > > currently a simple integration into Lucene,
> but long-term its best if
> > we
> > > > can
> > > > support other models also!
> > > >
> > > > Finally I think there is something to be
> said for Lucene's default
> > > > retrieval
> > > > model, which in my (non-english) findings
> across the board isn't
> > terrible
> > > > at
> > > > all... then again I am working with
> languages where analysis is really
> > > the
> > > > thing holding Lucene back, not scoring.
> > > >
> > > > On Tue, Feb 16, 2010 at 2:40 PM, JOAQUIN
> PEREZ IGLESIAS <
> > > > joaquin.perez@lsi.uned.es>
> wrote:
> > > >
> > > >> Just some final comments (as I said I'm
> not interested in flame wars),
> > > >>
> > > >> If I obtain better results there are not
> problem with pooling
> > otherwise
> > > >> it
> > > >> is biased.
> > > >> The only important thing (in my opinion)
> is that it cannot be said
> > that
> > > >> BM25 is a myth.
> > > >> Yes, you are right there is not an only
> ranking model that beats the
> > > >> rest,
> > > >> but there are models that generally show
> a better performance in more
> > > >> cases.
> > > >>
> > > >> About CLEF I have had the same
> experience (VSM vs BM25) on Spanish and
> > > >> English (WebCLEF) and Q&A
> (ResPubliQA)
> > > >>
> > > >> Ivan checks the parameters (b and k1),
> probably you can improve your
> > > >> results. (that's the bad part of BM25).
> > > >>
> > > >> Finally we are just speaking of personal
> experience, so obviously you
> > > >> should use the best model for your data
> and your own experience, on IR
> > > >> there are not myths neither best ranking
> models. If any of us is able
> > to
> > > >> find the “best” 
> ranking model, or is able to prove that
> > any
> > > >> state-of-the art is a myth he should
> send these results to the SIGIR
> > > >> conference.
> > > >>
> > > >> Ivan, Robert good luck with your
> experiments, as I said the good part
> > of
> > > >> IR is that you can always make
> experiments on your own.
> > > >>
> > > >> > I don't think its really a
> competition, I think preferably we should
> > > >> have
> > > >> > the flexibility to change the
> scoring model in lucene actually?
> > > >> >
> > > >> > I have found lots of cases where
> VSM improves on BM25, but then
> > again
> > > >> I
> > > >> > don't work with TREC stuff, as I
> work with non-english collections.
> > > >> >
> > > >> > It doesn't contradict years of
> research to say that VSM isn't a
> > > >> > state-of-the-art model, besides the
> TREC-4 results, there are CLEF
> > > >> results
> > > >> > where VSM models perform
> competitively or exceed (Finnish, Russian,
> > > >> etc)
> > > >> > BM25/DFR/etc.
> > > >> >
> > > >> > It depends on the collection, there
> isn't a 'best retrieval
> > formula'.
> > > >> >
> > > >> > Note: I have no bias against BM-25,
> but its definitely a myth to say
> > > >> there
> > > >> > is a single retrieval formula that
> is the 'best' across the board.
> > > >> >
> > > >> >
> > > >> > On Tue, Feb 16, 2010 at 1:53 PM,
> JOAQUIN PEREZ IGLESIAS <
> > > >> > joaquin.perez@lsi.uned.es>
> wrote:
> > > >> >
> > > >> >> By the way,
> > > >> >>
> > > >> >> I don't want to start a flame
> war VSM vs BM25, but I really believe
> > > >> that
> > > >> >> I
> > > >> >> have to express my opinion as
> Robert has done. In my experience, I
> > > >> have
> > > >> >> never found a case where VSM
> improves significantly BM25. Maybe you
> > > >> can
> > > >> >> find some cases under some very
> specific collection
> > characteristics,
> > > >> (as
> > > >> >> average length of 300 vs 3000)
> or a bad usage of BM25 (not proper
> > > >> >> parameters) where it can
> happen.
> > > >> >>
> > > >> >> BM25 is not just only a
> different way of length normalization, it
> > is
> > > >> >> based
> > > >> >> strongly in the probabilistic
> framework, and parametrises
> > frequencies
> > > >> >> and
> > > >> >> length. This is probably the
> most successful ranking model of the
> > > >> last
> > > >> >> years in Information
> Retrieval.
> > > >> >>
> > > >> >> I have never read a paper where
> VSM  improves any of the
> > > >> >> state-of-the-art
> > > >> >> ranking models (Language
> Models, DFR, BM25,...),  although the VSM
> > > >> with
> > > >> >> pivoted normalisation length
> can obtain nice results. This can be
> > > >> proved
> > > >> >> checking the last years of the
> TREC competition.
> > > >> >>
> > > >> >> Honestly to say that is a myth
> that BM25 improves VSM breaks the
> > last
> > > >> 10
> > > >> >> or 15 years of research on
> Information Retrieval, and I really
> > > >> believe
> > > >> >> that is not accurate.
> > > >> >>
> > > >> >> The good thing of Information
> Retrieval is that you can always make
> > > >> your
> > > >> >> owns experiments and you can
> use the experience of a lot of years
> > of
> > > >> >> research.
> > > >> >>
> > > >> >> PS: This opinion is based on
> experiments on TREC and CLEF
> > > >> collections,
> > > >> >> obviously we can start a debate
> about the suitability of this type
> > of
> > > >> >> experimentation (concept of
> relevance, pooling, relevance
> > > >> judgements),
> > > >> >> but
> > > >> >> this is a much more complex
> topic and I believe is far from what we
> > > >> are
> > > >> >> dealing here.
> > > >> >>
> > > >> >> PS2: In relation with TREC4
> Cornell used a pivoted length
> > > >> normalisation
> > > >> >> and they were applying
> pseudo-relevance feedback, what honestly
> > makes
> > > >> >> much
> > > >> >> more difficult the analysis of
> the results. Obviously their results
> > > >> were
> > > >> >> part of the pool.
> > > >> >>
> > > >> >> Sorry for the huge mail :-))))
> > > >> >>
> > > >> >> > Hi Ivan,
> > > >> >> >
> > > >> >> > the problem is that
> unfortunately BM25
> > > >> >> > cannot be implemented
> overwriting
> > > >> >> > the Similarity interface.
> Therefore BM25Similarity
> > > >> >> > only computes the classic
> probabilistic IDF (what is
> > > >> >> > interesting only at search
> time).
> > > >> >> > If you set BM25Similarity
> at indexing time
> > > >> >> > some basic stats are not
> stored
> > > >> >> > correctly in the segments
> (like docs length).
> > > >> >> >
> > > >> >> > When you use
> BM25BooleanQuery this class
> > > >> >> > will set automatically the
> BM25Similarity for you,
> > > >> >> > therefore you don't need
> to do this explicitly.
> > > >> >> >
> > > >> >> > I tried to make this
> implementation with the focus on
> > > >> >> > not interfering on the
> typical use of Lucene (so no changing
> > > >> >> > DefaultSimilarity).
> > > >> >> >
> > > >> >> >> Joaquin, Robert,
> > > >> >> >>
> > > >> >> >> I followed Joaquin's
> recommendation and removed the call to set
> > > >> >> >> similarity
> > > >> >> >> to BM25 explicitly
> (indexer, searcher).  The results showed 55%
> > > >> >> >> improvement for the
> MAP score (0.141->0.219) over default
> > > >> similarity.
> > > >> >> >>
> > > >> >> >> Joaquin, how would
> setting the similarity to BM25 explicitly
> > make
> > > >> the
> > > >> >> >> score worse?
> > > >> >> >>
> > > >> >> >> Thank you,
> > > >> >> >>
> > > >> >> >> Ivan
> > > >> >> >>
> > > >> >> >>
> > > >> >> >>
> > > >> >> >> --- On Tue, 2/16/10,
> Robert Muir <rc...@gmail.com>
> wrote:
> > > >> >> >>
> > > >> >> >>> From: Robert Muir
> <rc...@gmail.com>
> > > >> >> >>> Subject: Re: BM25
> Scoring Patch
> > > >> >> >>> To: java-user@lucene.apache.org
> > > >> >> >>> Date: Tuesday,
> February 16, 2010, 11:36 AM
> > > >> >> >>> yes Ivan, if
> possible please report
> > > >> >> >>> back any findings
> you can on the
> > > >> >> >>> experiments you
> are doing!
> > > >> >> >>>
> > > >> >> >>> On Tue, Feb 16,
> 2010 at 11:22 AM, Joaquin Perez Iglesias
> > > >> >> >>> <
> > > >> >> >>> joaquin.perez@lsi.uned.es>
> > > >> >> >>> wrote:
> > > >> >> >>>
> > > >> >> >>> > Hi Ivan,
> > > >> >> >>> >
> > > >> >> >>> > You shouldn't
> set the BM25Similarity for indexing or
> > > >> >> >>> searching.
> > > >> >> >>> > Please try
> removing the lines:
> > > >> >> >>>
> >   writer.setSimilarity(new
> > > >> >> >>>
> BM25Similarity());
> > > >> >> >>>
> >   searcher.setSimilarity(sim);
> > > >> >> >>> >
> > > >> >> >>> > Please let
> us/me know if you improve your results with
> > > >> >> >>> these changes.
> > > >> >> >>> >
> > > >> >> >>> >
> > > >> >> >>> > Robert Muir
> escribió:
> > > >> >> >>> >
> > > >> >> >>> >  Hi
> Ivan, I've seen many cases where BM25
> > > >> >> >>> performs worse
> than Lucene's
> > > >> >> >>> >> default
> Similarity. Perhaps this is just another
> > > >> >> >>> one?
> > > >> >> >>> >>
> > > >> >> >>> >> Again
> while I have not worked with this particular
> > > >> >> >>> collection, I
> looked at
> > > >> >> >>> >> the
> statistics and noted that its composed of
> > > >> >> >>> several
> 'sub-collections':
> > > >> >> >>> >> for
> > > >> >> >>> >> example
> the PAT documents on disk 3 have an
> > > >> >> >>> average doc length
> of 3543,
> > > >> >> >>> >> but
> > > >> >> >>> >> the AP
> documents on disk 1 have an avg doc length
> > > >> >> >>> of 353.
> > > >> >> >>> >>
> > > >> >> >>> >> I have
> found on other collections that any
> > > >> >> >>> advantages of
> BM25's document
> > > >> >> >>> >> length
> normalization fall apart when 'average
> > > >> >> >>> document length'
> doesn't
> > > >> >> >>> >> make
> > > >> >> >>> >> a whole
> lot of sense (cases like this).
> > > >> >> >>> >>
> > > >> >> >>> >> For this
> same reason, I've only found a few
> > > >> >> >>> collections where
> BM25's doc
> > > >> >> >>> >> length
> normalization is really significantly
> > > >> >> >>> better than
> Lucene's.
> > > >> >> >>> >>
> > > >> >> >>> >> In my
> opinion, the results on a particular test
> > > >> >> >>> collection or 2
> have
> > > >> >> >>> >> perhaps
> > > >> >> >>> >> been
> taken too far and created a myth that BM25 is
> > > >> >> >>> always superior
> to
> > > >> >> >>> >> Lucene's
> scoring... this is not true!
> > > >> >> >>> >>
> > > >> >> >>> >> On Tue,
> Feb 16, 2010 at 9:46 AM, Ivan Provalov
> > > >> >> >>> <ip...@yahoo.com>
> > > >> >> >>> >> wrote:
> > > >> >> >>> >>
> > > >> >> >>> >>  I
> applied the Lucene patch mentioned in
> > > >> >> >>> >>> https://issues.apache.org/jira/browse/LUCENE-2091 and
> > > >> >> >>> ran the MAP
> > > >> >> >>> >>>
> numbers
> > > >> >> >>> >>> on
> TREC-3 collection using topics
> > > >> >> >>> 151-200.  I
> am not getting worse
> > > >> >> >>> >>>
> results
> > > >> >> >>> >>>
> comparing to Lucene DefaultSimilarity.  I
> > > >> >> >>> suspect, I am not
> using it
> > > >> >> >>> >>>
> correctly.  I have single field
> > > >> >> >>> documents. 
> This is the process I use:
> > > >> >> >>> >>>
> > > >> >> >>> >>> 1.
> During the indexing, I am setting the
> > > >> >> >>> similarity to BM25
> as such:
> > > >> >> >>> >>>
> > > >> >> >>> >>>
> IndexWriter writer = new IndexWriter(dir, new
> > > >> >> >>> StandardAnalyzer(
> > > >> >> >>> >>>
> > > >> >> >>>   
> Version.LUCENE_CURRENT), true,
> > > >> >> >>> >>>
> > > >> >> >>>   
> IndexWriter.MaxFieldLength.UNLIMITED);
> > > >> >> >>> >>>
> writer.setSimilarity(new BM25Similarity());
> > > >> >> >>> >>>
> > > >> >> >>> >>> 2.
> During the Precision/Recall measurements, I
> > > >> >> >>> am using a
> > > >> >> >>> >>>
> SimpleBM25QQParser extension I added to the
> > > >> >> >>> benchmark:
> > > >> >> >>> >>>
> > > >> >> >>> >>>
> QualityQueryParser qqParser = new
> > > >> >> >>>
> SimpleBM25QQParser("title", "TEXT");
> > > >> >> >>> >>>
> > > >> >> >>> >>>
> > > >> >> >>> >>> 3.
> Here is the parser code (I set an avg doc
> > > >> >> >>> length here):
> > > >> >> >>> >>>
> > > >> >> >>> >>>
> public Query parse(QualityQuery qq) throws
> > > >> >> >>> ParseException {
> > > >> >> >>>
> >>>   BM25Parameters.setAverageLength(indexField,
> > > >> >> >>> 798.30f);//avg doc
> length
> > > >> >> >>>
> >>>   BM25Parameters.setB(0.5f);//tried
> > > >> >> >>> default values
> > > >> >> >>>
> >>>   BM25Parameters.setK1(2f);
> > > >> >> >>>
> >>>   return query = new
> > > >> >> >>>
> BM25BooleanQuery(qq.getValue(qqName), indexField,
> > > >> >> >>> >>> new
> > > >> >> >>> >>>
> StandardAnalyzer(Version.LUCENE_CURRENT));
> > > >> >> >>> >>> }
> > > >> >> >>> >>>
> > > >> >> >>> >>> 4.
> The searcher is using BM25 similarity:
> > > >> >> >>> >>>
> > > >> >> >>> >>>
> Searcher searcher = new IndexSearcher(dir,
> > > >> >> >>> true);
> > > >> >> >>> >>>
> searcher.setSimilarity(sim);
> > > >> >> >>> >>>
> > > >> >> >>> >>> Am I
> missing some steps?  Does anyone
> > > >> >> >>> have experience
> with this code?
> > > >> >> >>> >>>
> > > >> >> >>> >>>
> Thanks,
> > > >> >> >>> >>>
> > > >> >> >>> >>> Ivan
> > > >> >> >>> >>>
> > > >> >> >>> >>>
> > > >> >> >>> >>>
> > > >> >> >>> >>>
> > > >> >> >>> >>>
> > > >> >> >>>
> > > >>
> ---------------------------------------------------------------------
> > > >> >> >>> >>> To
> unsubscribe, e-mail:
> > > >> java-user-unsubscribe@lucene.apache.org
> > > >> >> >>> >>> For
> additional commands, e-mail:
> > > >> >> java-user-help@lucene.apache.org
> > > >> >> >>> >>>
> > > >> >> >>> >>>
> > > >> >> >>> >>>
> > > >> >> >>> >>
> > > >> >> >>> >>
> > > >> >> >>> > --
> > > >> >> >>> >
> > > >> >> >>>
> -----------------------------------------------------------
> > > >> >> >>> > Joaquín
> Pérez Iglesias
> > > >> >> >>> > Dpto.
> Lenguajes y Sistemas Informáticos
> > > >> >> >>> > E.T.S.I.
> Informática (UNED)
> > > >> >> >>> > Ciudad
> Universitaria
> > > >> >> >>> > C/ Juan del
> Rosal nº 16
> > > >> >> >>> > 28040 Madrid
> - Spain
> > > >> >> >>> > Phone. +34 91
> 398 89 19
> > > >> >> >>> > Fax 
>   +34 91 398 65 35
> > > >> >> >>> > Office 
> 2.11
> > > >> >> >>> > Email: joaquin.perez@lsi.uned.es
> > > >> >> >>> > web:
> > > >> http://nlp.uned.es/~jperezi/ <http://nlp.uned.es/%7Ejperezi/> <
> > http://nlp.uned.es/%7Ejperezi/><
> > > http://nlp.uned.es/%7Ejperezi/>
> > > >> >> <http://nlp.uned.es/%7Ejperezi/><
> > > >> >> http://nlp.uned.es/%7Ejperezi/>
> > > >> >> >>> >
> > > >> >> >>>
> -----------------------------------------------------------
> > > >> >> >>> >
> > > >> >> >>> >
> > > >> >> >>> >
> > > >> >> >>>
> > > >>
> ---------------------------------------------------------------------
> > > >> >> >>> > To
> unsubscribe, e-mail:
> > > java-user-unsubscribe@lucene.apache.org
> > > >> >> >>> > For
> additional commands, e-mail:
> > > >> java-user-help@lucene.apache.org
> > > >> >> >>> >
> > > >> >> >>> >
> > > >> >> >>>
> > > >> >> >>>
> > > >> >> >>> --
> > > >> >> >>> Robert Muir
> > > >> >> >>> rcmuir@gmail.com
> > > >> >> >>>
> > > >> >> >>
> > > >> >> >>
> > > >> >> >>
> > > >> >> >>
> > > >> >> >>
> > >
> ---------------------------------------------------------------------
> > > >> >> >> To unsubscribe,
> e-mail: java-user-unsubscribe@lucene.apache.org
> > > >> >> >> For additional
> commands, e-mail:
> > java-user-help@lucene.apache.org
> > > >> >> >>
> > > >> >> >>
> > > >> >> >
> > > >> >> >
> > > >> >> >
> > > >> >> >
> > >
> ---------------------------------------------------------------------
> > > >> >> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > >> >> > For additional commands,
> e-mail:
> > java-user-help@lucene.apache.org
> > > >> >> >
> > > >> >> >
> > > >> >>
> > > >> >>
> > > >> >>
> > > >> >>
> >
> ---------------------------------------------------------------------
> > > >> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > >> >> For additional commands,
> e-mail: java-user-help@lucene.apache.org
> > > >> >>
> > > >> >>
> > > >> >
> > > >> >
> > > >> > --
> > > >> > Robert Muir
> > > >> > rcmuir@gmail.com
> > > >> >
> > > >>
> > > >>
> > > >>
> > > >>
> ---------------------------------------------------------------------
> > > >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > >> For additional commands, e-mail: java-user-help@lucene.apache.org
> > > >>
> > > >>
> > > >
> > > >
> > > > --
> > > > Robert Muir
> > > > rcmuir@gmail.com
> > > >
> > >
> > >
> > >
> > >
> ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > >
> > >
> >
> >
> > --
> > Robert Muir
> > rcmuir@gmail.com
> >
> 
> 
> 
> -- 
> Robert Muir
> rcmuir@gmail.com
> 


      

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: BM25 Scoring Patch

Posted by Robert Muir <rc...@gmail.com>.
Yuval, i apologize for not having an intelligent response for your question
(if i did i would try to formulate it as a patch), but I too would like for
it to be extremely easy... maybe we can iterate on the patch.

below is how i feel about it:

i guess theoretically, the use of Similarity is how we would implement a
pluggable scoring formula, i think already supported by Solr. it would be
nice if BM25 could be just another Similarity, but i'm not even sure thats
realistic in the near future.

yet if we don't do the hard work up front to make it easy to plug in things
like BM25, then no one will implement additional scoring formulas for
Lucene, we currently make it terribly difficult to do this.

in the BM25 case we are just lucky, as Joaquin went thru a lot of
work/jumped thru a lot of hoops to make it happen.

On Wed, Feb 17, 2010 at 3:36 AM, Yuval Feinstein <yu...@answers.com> wrote:

> This is very interesting and much friendlier than a flame war.
> My practical question for Robert is:
> How can we modify the BM25 patch so that it:
> a) Becomes part of Lucene contrib.
> b) Be easier to use (preventing mistakes  such as Ivan's using the BM25
> similarity during indexing).
> c) Proceeds towards a pluggable scoring formula (Ideally, we should have an
> IndexReader/IndexSearcher/IndexWriter
> constructor enabling specifying a scoring model through an enum, with the
> default being, well, Lucene's default scoring model)?
> The easier it is to use, the more experiments people can make, and see how
> it works for them.
> A future "marketing" step could be adding BM25 to Solr, to further ease
> experimentation.
> TIA,
> Yuval
>
>
> -----Original Message-----
> From: Robert Muir [mailto:rcmuir@gmail.com]
> Sent: Tuesday, February 16, 2010 10:38 PM
> To: java-user@lucene.apache.org
> Subject: Re: BM25 Scoring Patch
>
> Joaquin, I have a typical methodology where I don't optimize any scoring
> params: be it BM25 params (I stick with your defaults), or lnb.ltc params
> (i
> stick with default slope). When doing query expansion i don't modify the
> defaults for MoreLikeThis either.
>
> I've found that changing these params can have a significant difference in
> retrieval performance, which is interesting, but I'm typically focused on
> text analysis (how is the text indexed?/stemming/stopwords). I also feel
> that such things are corpus-specific, which i generally try to avoid in my
> work...
>
> for example, in analysis work,  the text collection often has a majority of
> text in a specific tense (i.e. news), so i don't at all try to tune any
> part
> of analysis as I worry this would be corpus-specific... I do the same with
> scoring.
>
> As far as why some models perform better than others for certain languages,
> I think this is a million-dollar question. But my intuition (I don't have
> references or anything to back this up), is that probabilistic models
> outperform vector-space models when you are using approaches like n-grams:
> you don't have nice stopwords lists, stemming, decompounding etc.
>
> This is particularly interesting to me, as probabilistic model + ngram is a
> very general multilingual approach that I would like to have working well
> in
> Lucene, its also important as a "default" when we don't have a nicely tuned
> analyzer available that will work well with a vector space model. In my
> opinion, vector-space tends to fall apart without good language support.
>
>
> On Tue, Feb 16, 2010 at 3:23 PM, JOAQUIN PEREZ IGLESIAS <
> joaquin.perez@lsi.uned.es> wrote:
>
> > Ok,
> >
> > I'm not advocating the BM25 patch neither, unfortunately BM25 was not my
> > idea :-))), and I'm sure that the implementation can be improved.
> >
> > When you use the BM25 implementation, are you optimising the parameters
> > specifically per collection? (It is a key factor for improving BM25
> > performance).
> >
> > Why do you think that BM25 works better for English than in other
> > languages (apart of experiments). What are your intuitions?
> >
> > I dont't have too much experience on languages moreover of Spanish and
> > English, and it sounds pretty interesting.
> >
> > Kind Regards.
> >
> > P.S: Maybe this is not a topic for this list???
> >
> >
> > > Joaquin, I don't see this as a flame war? First of all I'd like to
> > > personally thank you for your excellent BM25 implementation!
> > >
> > > I think the selection of a retrieval model depends highly on the
> > > language/indexing approach, i.e. if we were talking East Asian
> languages
> > I
> > > think we want a probabilistic model: no argument there!
> > >
> > > All i said was that it is a myth that BM25 is "always" better than
> > > Lucene's
> > > scoring model, it really depends on what you are trying to do, how you
> > are
> > > indexing your text, properties of your corpus, how your queries are
> > > running.
> > >
> > > I don't even want to come across as advocating the lnb.ltc approach
> > > either,
> > > sure I wrote the patch, but this means nothing. I only like it as its
> > > currently a simple integration into Lucene, but long-term its best if
> we
> > > can
> > > support other models also!
> > >
> > > Finally I think there is something to be said for Lucene's default
> > > retrieval
> > > model, which in my (non-english) findings across the board isn't
> terrible
> > > at
> > > all... then again I am working with languages where analysis is really
> > the
> > > thing holding Lucene back, not scoring.
> > >
> > > On Tue, Feb 16, 2010 at 2:40 PM, JOAQUIN PEREZ IGLESIAS <
> > > joaquin.perez@lsi.uned.es> wrote:
> > >
> > >> Just some final comments (as I said I'm not interested in flame wars),
> > >>
> > >> If I obtain better results there are not problem with pooling
> otherwise
> > >> it
> > >> is biased.
> > >> The only important thing (in my opinion) is that it cannot be said
> that
> > >> BM25 is a myth.
> > >> Yes, you are right there is not an only ranking model that beats the
> > >> rest,
> > >> but there are models that generally show a better performance in more
> > >> cases.
> > >>
> > >> About CLEF I have had the same experience (VSM vs BM25) on Spanish and
> > >> English (WebCLEF) and Q&A (ResPubliQA)
> > >>
> > >> Ivan checks the parameters (b and k1), probably you can improve your
> > >> results. (that's the bad part of BM25).
> > >>
> > >> Finally we are just speaking of personal experience, so obviously you
> > >> should use the best model for your data and your own experience, on IR
> > >> there are not myths neither best ranking models. If any of us is able
> to
> > >> find the &#8220;best&#8221;  ranking model, or is able to prove that
> any
> > >> state-of-the art is a myth he should send these results to the SIGIR
> > >> conference.
> > >>
> > >> Ivan, Robert good luck with your experiments, as I said the good part
> of
> > >> IR is that you can always make experiments on your own.
> > >>
> > >> > I don't think its really a competition, I think preferably we should
> > >> have
> > >> > the flexibility to change the scoring model in lucene actually?
> > >> >
> > >> > I have found lots of cases where VSM improves on BM25, but then
> again
> > >> I
> > >> > don't work with TREC stuff, as I work with non-english collections.
> > >> >
> > >> > It doesn't contradict years of research to say that VSM isn't a
> > >> > state-of-the-art model, besides the TREC-4 results, there are CLEF
> > >> results
> > >> > where VSM models perform competitively or exceed (Finnish, Russian,
> > >> etc)
> > >> > BM25/DFR/etc.
> > >> >
> > >> > It depends on the collection, there isn't a 'best retrieval
> formula'.
> > >> >
> > >> > Note: I have no bias against BM-25, but its definitely a myth to say
> > >> there
> > >> > is a single retrieval formula that is the 'best' across the board.
> > >> >
> > >> >
> > >> > On Tue, Feb 16, 2010 at 1:53 PM, JOAQUIN PEREZ IGLESIAS <
> > >> > joaquin.perez@lsi.uned.es> wrote:
> > >> >
> > >> >> By the way,
> > >> >>
> > >> >> I don't want to start a flame war VSM vs BM25, but I really believe
> > >> that
> > >> >> I
> > >> >> have to express my opinion as Robert has done. In my experience, I
> > >> have
> > >> >> never found a case where VSM improves significantly BM25. Maybe you
> > >> can
> > >> >> find some cases under some very specific collection
> characteristics,
> > >> (as
> > >> >> average length of 300 vs 3000) or a bad usage of BM25 (not proper
> > >> >> parameters) where it can happen.
> > >> >>
> > >> >> BM25 is not just only a different way of length normalization, it
> is
> > >> >> based
> > >> >> strongly in the probabilistic framework, and parametrises
> frequencies
> > >> >> and
> > >> >> length. This is probably the most successful ranking model of the
> > >> last
> > >> >> years in Information Retrieval.
> > >> >>
> > >> >> I have never read a paper where VSM  improves any of the
> > >> >> state-of-the-art
> > >> >> ranking models (Language Models, DFR, BM25,...),  although the VSM
> > >> with
> > >> >> pivoted normalisation length can obtain nice results. This can be
> > >> proved
> > >> >> checking the last years of the TREC competition.
> > >> >>
> > >> >> Honestly to say that is a myth that BM25 improves VSM breaks the
> last
> > >> 10
> > >> >> or 15 years of research on Information Retrieval, and I really
> > >> believe
> > >> >> that is not accurate.
> > >> >>
> > >> >> The good thing of Information Retrieval is that you can always make
> > >> your
> > >> >> owns experiments and you can use the experience of a lot of years
> of
> > >> >> research.
> > >> >>
> > >> >> PS: This opinion is based on experiments on TREC and CLEF
> > >> collections,
> > >> >> obviously we can start a debate about the suitability of this type
> of
> > >> >> experimentation (concept of relevance, pooling, relevance
> > >> judgements),
> > >> >> but
> > >> >> this is a much more complex topic and I believe is far from what we
> > >> are
> > >> >> dealing here.
> > >> >>
> > >> >> PS2: In relation with TREC4 Cornell used a pivoted length
> > >> normalisation
> > >> >> and they were applying pseudo-relevance feedback, what honestly
> makes
> > >> >> much
> > >> >> more difficult the analysis of the results. Obviously their results
> > >> were
> > >> >> part of the pool.
> > >> >>
> > >> >> Sorry for the huge mail :-))))
> > >> >>
> > >> >> > Hi Ivan,
> > >> >> >
> > >> >> > the problem is that unfortunately BM25
> > >> >> > cannot be implemented overwriting
> > >> >> > the Similarity interface. Therefore BM25Similarity
> > >> >> > only computes the classic probabilistic IDF (what is
> > >> >> > interesting only at search time).
> > >> >> > If you set BM25Similarity at indexing time
> > >> >> > some basic stats are not stored
> > >> >> > correctly in the segments (like docs length).
> > >> >> >
> > >> >> > When you use BM25BooleanQuery this class
> > >> >> > will set automatically the BM25Similarity for you,
> > >> >> > therefore you don't need to do this explicitly.
> > >> >> >
> > >> >> > I tried to make this implementation with the focus on
> > >> >> > not interfering on the typical use of Lucene (so no changing
> > >> >> > DefaultSimilarity).
> > >> >> >
> > >> >> >> Joaquin, Robert,
> > >> >> >>
> > >> >> >> I followed Joaquin's recommendation and removed the call to set
> > >> >> >> similarity
> > >> >> >> to BM25 explicitly (indexer, searcher).  The results showed 55%
> > >> >> >> improvement for the MAP score (0.141->0.219) over default
> > >> similarity.
> > >> >> >>
> > >> >> >> Joaquin, how would setting the similarity to BM25 explicitly
> make
> > >> the
> > >> >> >> score worse?
> > >> >> >>
> > >> >> >> Thank you,
> > >> >> >>
> > >> >> >> Ivan
> > >> >> >>
> > >> >> >>
> > >> >> >>
> > >> >> >> --- On Tue, 2/16/10, Robert Muir <rc...@gmail.com> wrote:
> > >> >> >>
> > >> >> >>> From: Robert Muir <rc...@gmail.com>
> > >> >> >>> Subject: Re: BM25 Scoring Patch
> > >> >> >>> To: java-user@lucene.apache.org
> > >> >> >>> Date: Tuesday, February 16, 2010, 11:36 AM
> > >> >> >>> yes Ivan, if possible please report
> > >> >> >>> back any findings you can on the
> > >> >> >>> experiments you are doing!
> > >> >> >>>
> > >> >> >>> On Tue, Feb 16, 2010 at 11:22 AM, Joaquin Perez Iglesias
> > >> >> >>> <
> > >> >> >>> joaquin.perez@lsi.uned.es>
> > >> >> >>> wrote:
> > >> >> >>>
> > >> >> >>> > Hi Ivan,
> > >> >> >>> >
> > >> >> >>> > You shouldn't set the BM25Similarity for indexing or
> > >> >> >>> searching.
> > >> >> >>> > Please try removing the lines:
> > >> >> >>> >   writer.setSimilarity(new
> > >> >> >>> BM25Similarity());
> > >> >> >>> >   searcher.setSimilarity(sim);
> > >> >> >>> >
> > >> >> >>> > Please let us/me know if you improve your results with
> > >> >> >>> these changes.
> > >> >> >>> >
> > >> >> >>> >
> > >> >> >>> > Robert Muir escribió:
> > >> >> >>> >
> > >> >> >>> >  Hi Ivan, I've seen many cases where BM25
> > >> >> >>> performs worse than Lucene's
> > >> >> >>> >> default Similarity. Perhaps this is just another
> > >> >> >>> one?
> > >> >> >>> >>
> > >> >> >>> >> Again while I have not worked with this particular
> > >> >> >>> collection, I looked at
> > >> >> >>> >> the statistics and noted that its composed of
> > >> >> >>> several 'sub-collections':
> > >> >> >>> >> for
> > >> >> >>> >> example the PAT documents on disk 3 have an
> > >> >> >>> average doc length of 3543,
> > >> >> >>> >> but
> > >> >> >>> >> the AP documents on disk 1 have an avg doc length
> > >> >> >>> of 353.
> > >> >> >>> >>
> > >> >> >>> >> I have found on other collections that any
> > >> >> >>> advantages of BM25's document
> > >> >> >>> >> length normalization fall apart when 'average
> > >> >> >>> document length' doesn't
> > >> >> >>> >> make
> > >> >> >>> >> a whole lot of sense (cases like this).
> > >> >> >>> >>
> > >> >> >>> >> For this same reason, I've only found a few
> > >> >> >>> collections where BM25's doc
> > >> >> >>> >> length normalization is really significantly
> > >> >> >>> better than Lucene's.
> > >> >> >>> >>
> > >> >> >>> >> In my opinion, the results on a particular test
> > >> >> >>> collection or 2 have
> > >> >> >>> >> perhaps
> > >> >> >>> >> been taken too far and created a myth that BM25 is
> > >> >> >>> always superior to
> > >> >> >>> >> Lucene's scoring... this is not true!
> > >> >> >>> >>
> > >> >> >>> >> On Tue, Feb 16, 2010 at 9:46 AM, Ivan Provalov
> > >> >> >>> <ip...@yahoo.com>
> > >> >> >>> >> wrote:
> > >> >> >>> >>
> > >> >> >>> >>  I applied the Lucene patch mentioned in
> > >> >> >>> >>> https://issues.apache.org/jira/browse/LUCENE-2091 and
> > >> >> >>> ran the MAP
> > >> >> >>> >>> numbers
> > >> >> >>> >>> on TREC-3 collection using topics
> > >> >> >>> 151-200.  I am not getting worse
> > >> >> >>> >>> results
> > >> >> >>> >>> comparing to Lucene DefaultSimilarity.  I
> > >> >> >>> suspect, I am not using it
> > >> >> >>> >>> correctly.  I have single field
> > >> >> >>> documents.  This is the process I use:
> > >> >> >>> >>>
> > >> >> >>> >>> 1. During the indexing, I am setting the
> > >> >> >>> similarity to BM25 as such:
> > >> >> >>> >>>
> > >> >> >>> >>> IndexWriter writer = new IndexWriter(dir, new
> > >> >> >>> StandardAnalyzer(
> > >> >> >>> >>>
> > >> >> >>>    Version.LUCENE_CURRENT), true,
> > >> >> >>> >>>
> > >> >> >>>    IndexWriter.MaxFieldLength.UNLIMITED);
> > >> >> >>> >>> writer.setSimilarity(new BM25Similarity());
> > >> >> >>> >>>
> > >> >> >>> >>> 2. During the Precision/Recall measurements, I
> > >> >> >>> am using a
> > >> >> >>> >>> SimpleBM25QQParser extension I added to the
> > >> >> >>> benchmark:
> > >> >> >>> >>>
> > >> >> >>> >>> QualityQueryParser qqParser = new
> > >> >> >>> SimpleBM25QQParser("title", "TEXT");
> > >> >> >>> >>>
> > >> >> >>> >>>
> > >> >> >>> >>> 3. Here is the parser code (I set an avg doc
> > >> >> >>> length here):
> > >> >> >>> >>>
> > >> >> >>> >>> public Query parse(QualityQuery qq) throws
> > >> >> >>> ParseException {
> > >> >> >>> >>>   BM25Parameters.setAverageLength(indexField,
> > >> >> >>> 798.30f);//avg doc length
> > >> >> >>> >>>   BM25Parameters.setB(0.5f);//tried
> > >> >> >>> default values
> > >> >> >>> >>>   BM25Parameters.setK1(2f);
> > >> >> >>> >>>   return query = new
> > >> >> >>> BM25BooleanQuery(qq.getValue(qqName), indexField,
> > >> >> >>> >>> new
> > >> >> >>> >>> StandardAnalyzer(Version.LUCENE_CURRENT));
> > >> >> >>> >>> }
> > >> >> >>> >>>
> > >> >> >>> >>> 4. The searcher is using BM25 similarity:
> > >> >> >>> >>>
> > >> >> >>> >>> Searcher searcher = new IndexSearcher(dir,
> > >> >> >>> true);
> > >> >> >>> >>> searcher.setSimilarity(sim);
> > >> >> >>> >>>
> > >> >> >>> >>> Am I missing some steps?  Does anyone
> > >> >> >>> have experience with this code?
> > >> >> >>> >>>
> > >> >> >>> >>> Thanks,
> > >> >> >>> >>>
> > >> >> >>> >>> Ivan
> > >> >> >>> >>>
> > >> >> >>> >>>
> > >> >> >>> >>>
> > >> >> >>> >>>
> > >> >> >>> >>>
> > >> >> >>>
> > >> ---------------------------------------------------------------------
> > >> >> >>> >>> To unsubscribe, e-mail:
> > >> java-user-unsubscribe@lucene.apache.org
> > >> >> >>> >>> For additional commands, e-mail:
> > >> >> java-user-help@lucene.apache.org
> > >> >> >>> >>>
> > >> >> >>> >>>
> > >> >> >>> >>>
> > >> >> >>> >>
> > >> >> >>> >>
> > >> >> >>> > --
> > >> >> >>> >
> > >> >> >>> -----------------------------------------------------------
> > >> >> >>> > Joaquín Pérez Iglesias
> > >> >> >>> > Dpto. Lenguajes y Sistemas Informáticos
> > >> >> >>> > E.T.S.I. Informática (UNED)
> > >> >> >>> > Ciudad Universitaria
> > >> >> >>> > C/ Juan del Rosal nº 16
> > >> >> >>> > 28040 Madrid - Spain
> > >> >> >>> > Phone. +34 91 398 89 19
> > >> >> >>> > Fax    +34 91 398 65 35
> > >> >> >>> > Office  2.11
> > >> >> >>> > Email: joaquin.perez@lsi.uned.es
> > >> >> >>> > web:
> > >> http://nlp.uned.es/~jperezi/ <http://nlp.uned.es/%7Ejperezi/> <
> http://nlp.uned.es/%7Ejperezi/><
> > http://nlp.uned.es/%7Ejperezi/>
> > >> >> <http://nlp.uned.es/%7Ejperezi/><
> > >> >> http://nlp.uned.es/%7Ejperezi/>
> > >> >> >>> >
> > >> >> >>> -----------------------------------------------------------
> > >> >> >>> >
> > >> >> >>> >
> > >> >> >>> >
> > >> >> >>>
> > >> ---------------------------------------------------------------------
> > >> >> >>> > To unsubscribe, e-mail:
> > java-user-unsubscribe@lucene.apache.org
> > >> >> >>> > For additional commands, e-mail:
> > >> java-user-help@lucene.apache.org
> > >> >> >>> >
> > >> >> >>> >
> > >> >> >>>
> > >> >> >>>
> > >> >> >>> --
> > >> >> >>> Robert Muir
> > >> >> >>> rcmuir@gmail.com
> > >> >> >>>
> > >> >> >>
> > >> >> >>
> > >> >> >>
> > >> >> >>
> > >> >> >>
> > ---------------------------------------------------------------------
> > >> >> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > >> >> >> For additional commands, e-mail:
> java-user-help@lucene.apache.org
> > >> >> >>
> > >> >> >>
> > >> >> >
> > >> >> >
> > >> >> >
> > >> >> >
> > ---------------------------------------------------------------------
> > >> >> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > >> >> > For additional commands, e-mail:
> java-user-help@lucene.apache.org
> > >> >> >
> > >> >> >
> > >> >>
> > >> >>
> > >> >>
> > >> >>
> ---------------------------------------------------------------------
> > >> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > >> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> > >> >>
> > >> >>
> > >> >
> > >> >
> > >> > --
> > >> > Robert Muir
> > >> > rcmuir@gmail.com
> > >> >
> > >>
> > >>
> > >>
> > >> ---------------------------------------------------------------------
> > >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > >> For additional commands, e-mail: java-user-help@lucene.apache.org
> > >>
> > >>
> > >
> > >
> > > --
> > > Robert Muir
> > > rcmuir@gmail.com
> > >
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
>
>
> --
> Robert Muir
> rcmuir@gmail.com
>



-- 
Robert Muir
rcmuir@gmail.com

RE: BM25 Scoring Patch

Posted by Yuval Feinstein <yu...@answers.com>.
This is very interesting and much friendlier than a flame war.
My practical question for Robert is:
How can we modify the BM25 patch so that it:
a) Becomes part of Lucene contrib.
b) Be easier to use (preventing mistakes  such as Ivan's using the BM25 similarity during indexing).
c) Proceeds towards a pluggable scoring formula (Ideally, we should have an IndexReader/IndexSearcher/IndexWriter
constructor enabling specifying a scoring model through an enum, with the default being, well, Lucene's default scoring model)?
The easier it is to use, the more experiments people can make, and see how it works for them.
A future "marketing" step could be adding BM25 to Solr, to further ease experimentation.
TIA,
Yuval


-----Original Message-----
From: Robert Muir [mailto:rcmuir@gmail.com]
Sent: Tuesday, February 16, 2010 10:38 PM
To: java-user@lucene.apache.org
Subject: Re: BM25 Scoring Patch

Joaquin, I have a typical methodology where I don't optimize any scoring
params: be it BM25 params (I stick with your defaults), or lnb.ltc params (i
stick with default slope). When doing query expansion i don't modify the
defaults for MoreLikeThis either.

I've found that changing these params can have a significant difference in
retrieval performance, which is interesting, but I'm typically focused on
text analysis (how is the text indexed?/stemming/stopwords). I also feel
that such things are corpus-specific, which i generally try to avoid in my
work...

for example, in analysis work,  the text collection often has a majority of
text in a specific tense (i.e. news), so i don't at all try to tune any part
of analysis as I worry this would be corpus-specific... I do the same with
scoring.

As far as why some models perform better than others for certain languages,
I think this is a million-dollar question. But my intuition (I don't have
references or anything to back this up), is that probabilistic models
outperform vector-space models when you are using approaches like n-grams:
you don't have nice stopwords lists, stemming, decompounding etc.

This is particularly interesting to me, as probabilistic model + ngram is a
very general multilingual approach that I would like to have working well in
Lucene, its also important as a "default" when we don't have a nicely tuned
analyzer available that will work well with a vector space model. In my
opinion, vector-space tends to fall apart without good language support.


On Tue, Feb 16, 2010 at 3:23 PM, JOAQUIN PEREZ IGLESIAS <
joaquin.perez@lsi.uned.es> wrote:

> Ok,
>
> I'm not advocating the BM25 patch neither, unfortunately BM25 was not my
> idea :-))), and I'm sure that the implementation can be improved.
>
> When you use the BM25 implementation, are you optimising the parameters
> specifically per collection? (It is a key factor for improving BM25
> performance).
>
> Why do you think that BM25 works better for English than in other
> languages (apart of experiments). What are your intuitions?
>
> I dont't have too much experience on languages moreover of Spanish and
> English, and it sounds pretty interesting.
>
> Kind Regards.
>
> P.S: Maybe this is not a topic for this list???
>
>
> > Joaquin, I don't see this as a flame war? First of all I'd like to
> > personally thank you for your excellent BM25 implementation!
> >
> > I think the selection of a retrieval model depends highly on the
> > language/indexing approach, i.e. if we were talking East Asian languages
> I
> > think we want a probabilistic model: no argument there!
> >
> > All i said was that it is a myth that BM25 is "always" better than
> > Lucene's
> > scoring model, it really depends on what you are trying to do, how you
> are
> > indexing your text, properties of your corpus, how your queries are
> > running.
> >
> > I don't even want to come across as advocating the lnb.ltc approach
> > either,
> > sure I wrote the patch, but this means nothing. I only like it as its
> > currently a simple integration into Lucene, but long-term its best if we
> > can
> > support other models also!
> >
> > Finally I think there is something to be said for Lucene's default
> > retrieval
> > model, which in my (non-english) findings across the board isn't terrible
> > at
> > all... then again I am working with languages where analysis is really
> the
> > thing holding Lucene back, not scoring.
> >
> > On Tue, Feb 16, 2010 at 2:40 PM, JOAQUIN PEREZ IGLESIAS <
> > joaquin.perez@lsi.uned.es> wrote:
> >
> >> Just some final comments (as I said I'm not interested in flame wars),
> >>
> >> If I obtain better results there are not problem with pooling otherwise
> >> it
> >> is biased.
> >> The only important thing (in my opinion) is that it cannot be said that
> >> BM25 is a myth.
> >> Yes, you are right there is not an only ranking model that beats the
> >> rest,
> >> but there are models that generally show a better performance in more
> >> cases.
> >>
> >> About CLEF I have had the same experience (VSM vs BM25) on Spanish and
> >> English (WebCLEF) and Q&A (ResPubliQA)
> >>
> >> Ivan checks the parameters (b and k1), probably you can improve your
> >> results. (that's the bad part of BM25).
> >>
> >> Finally we are just speaking of personal experience, so obviously you
> >> should use the best model for your data and your own experience, on IR
> >> there are not myths neither best ranking models. If any of us is able to
> >> find the &#8220;best&#8221;  ranking model, or is able to prove that any
> >> state-of-the art is a myth he should send these results to the SIGIR
> >> conference.
> >>
> >> Ivan, Robert good luck with your experiments, as I said the good part of
> >> IR is that you can always make experiments on your own.
> >>
> >> > I don't think its really a competition, I think preferably we should
> >> have
> >> > the flexibility to change the scoring model in lucene actually?
> >> >
> >> > I have found lots of cases where VSM improves on BM25, but then again
> >> I
> >> > don't work with TREC stuff, as I work with non-english collections.
> >> >
> >> > It doesn't contradict years of research to say that VSM isn't a
> >> > state-of-the-art model, besides the TREC-4 results, there are CLEF
> >> results
> >> > where VSM models perform competitively or exceed (Finnish, Russian,
> >> etc)
> >> > BM25/DFR/etc.
> >> >
> >> > It depends on the collection, there isn't a 'best retrieval formula'.
> >> >
> >> > Note: I have no bias against BM-25, but its definitely a myth to say
> >> there
> >> > is a single retrieval formula that is the 'best' across the board.
> >> >
> >> >
> >> > On Tue, Feb 16, 2010 at 1:53 PM, JOAQUIN PEREZ IGLESIAS <
> >> > joaquin.perez@lsi.uned.es> wrote:
> >> >
> >> >> By the way,
> >> >>
> >> >> I don't want to start a flame war VSM vs BM25, but I really believe
> >> that
> >> >> I
> >> >> have to express my opinion as Robert has done. In my experience, I
> >> have
> >> >> never found a case where VSM improves significantly BM25. Maybe you
> >> can
> >> >> find some cases under some very specific collection characteristics,
> >> (as
> >> >> average length of 300 vs 3000) or a bad usage of BM25 (not proper
> >> >> parameters) where it can happen.
> >> >>
> >> >> BM25 is not just only a different way of length normalization, it is
> >> >> based
> >> >> strongly in the probabilistic framework, and parametrises frequencies
> >> >> and
> >> >> length. This is probably the most successful ranking model of the
> >> last
> >> >> years in Information Retrieval.
> >> >>
> >> >> I have never read a paper where VSM  improves any of the
> >> >> state-of-the-art
> >> >> ranking models (Language Models, DFR, BM25,...),  although the VSM
> >> with
> >> >> pivoted normalisation length can obtain nice results. This can be
> >> proved
> >> >> checking the last years of the TREC competition.
> >> >>
> >> >> Honestly to say that is a myth that BM25 improves VSM breaks the last
> >> 10
> >> >> or 15 years of research on Information Retrieval, and I really
> >> believe
> >> >> that is not accurate.
> >> >>
> >> >> The good thing of Information Retrieval is that you can always make
> >> your
> >> >> owns experiments and you can use the experience of a lot of years of
> >> >> research.
> >> >>
> >> >> PS: This opinion is based on experiments on TREC and CLEF
> >> collections,
> >> >> obviously we can start a debate about the suitability of this type of
> >> >> experimentation (concept of relevance, pooling, relevance
> >> judgements),
> >> >> but
> >> >> this is a much more complex topic and I believe is far from what we
> >> are
> >> >> dealing here.
> >> >>
> >> >> PS2: In relation with TREC4 Cornell used a pivoted length
> >> normalisation
> >> >> and they were applying pseudo-relevance feedback, what honestly makes
> >> >> much
> >> >> more difficult the analysis of the results. Obviously their results
> >> were
> >> >> part of the pool.
> >> >>
> >> >> Sorry for the huge mail :-))))
> >> >>
> >> >> > Hi Ivan,
> >> >> >
> >> >> > the problem is that unfortunately BM25
> >> >> > cannot be implemented overwriting
> >> >> > the Similarity interface. Therefore BM25Similarity
> >> >> > only computes the classic probabilistic IDF (what is
> >> >> > interesting only at search time).
> >> >> > If you set BM25Similarity at indexing time
> >> >> > some basic stats are not stored
> >> >> > correctly in the segments (like docs length).
> >> >> >
> >> >> > When you use BM25BooleanQuery this class
> >> >> > will set automatically the BM25Similarity for you,
> >> >> > therefore you don't need to do this explicitly.
> >> >> >
> >> >> > I tried to make this implementation with the focus on
> >> >> > not interfering on the typical use of Lucene (so no changing
> >> >> > DefaultSimilarity).
> >> >> >
> >> >> >> Joaquin, Robert,
> >> >> >>
> >> >> >> I followed Joaquin's recommendation and removed the call to set
> >> >> >> similarity
> >> >> >> to BM25 explicitly (indexer, searcher).  The results showed 55%
> >> >> >> improvement for the MAP score (0.141->0.219) over default
> >> similarity.
> >> >> >>
> >> >> >> Joaquin, how would setting the similarity to BM25 explicitly make
> >> the
> >> >> >> score worse?
> >> >> >>
> >> >> >> Thank you,
> >> >> >>
> >> >> >> Ivan
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >> --- On Tue, 2/16/10, Robert Muir <rc...@gmail.com> wrote:
> >> >> >>
> >> >> >>> From: Robert Muir <rc...@gmail.com>
> >> >> >>> Subject: Re: BM25 Scoring Patch
> >> >> >>> To: java-user@lucene.apache.org
> >> >> >>> Date: Tuesday, February 16, 2010, 11:36 AM
> >> >> >>> yes Ivan, if possible please report
> >> >> >>> back any findings you can on the
> >> >> >>> experiments you are doing!
> >> >> >>>
> >> >> >>> On Tue, Feb 16, 2010 at 11:22 AM, Joaquin Perez Iglesias
> >> >> >>> <
> >> >> >>> joaquin.perez@lsi.uned.es>
> >> >> >>> wrote:
> >> >> >>>
> >> >> >>> > Hi Ivan,
> >> >> >>> >
> >> >> >>> > You shouldn't set the BM25Similarity for indexing or
> >> >> >>> searching.
> >> >> >>> > Please try removing the lines:
> >> >> >>> >   writer.setSimilarity(new
> >> >> >>> BM25Similarity());
> >> >> >>> >   searcher.setSimilarity(sim);
> >> >> >>> >
> >> >> >>> > Please let us/me know if you improve your results with
> >> >> >>> these changes.
> >> >> >>> >
> >> >> >>> >
> >> >> >>> > Robert Muir escribió:
> >> >> >>> >
> >> >> >>> >  Hi Ivan, I've seen many cases where BM25
> >> >> >>> performs worse than Lucene's
> >> >> >>> >> default Similarity. Perhaps this is just another
> >> >> >>> one?
> >> >> >>> >>
> >> >> >>> >> Again while I have not worked with this particular
> >> >> >>> collection, I looked at
> >> >> >>> >> the statistics and noted that its composed of
> >> >> >>> several 'sub-collections':
> >> >> >>> >> for
> >> >> >>> >> example the PAT documents on disk 3 have an
> >> >> >>> average doc length of 3543,
> >> >> >>> >> but
> >> >> >>> >> the AP documents on disk 1 have an avg doc length
> >> >> >>> of 353.
> >> >> >>> >>
> >> >> >>> >> I have found on other collections that any
> >> >> >>> advantages of BM25's document
> >> >> >>> >> length normalization fall apart when 'average
> >> >> >>> document length' doesn't
> >> >> >>> >> make
> >> >> >>> >> a whole lot of sense (cases like this).
> >> >> >>> >>
> >> >> >>> >> For this same reason, I've only found a few
> >> >> >>> collections where BM25's doc
> >> >> >>> >> length normalization is really significantly
> >> >> >>> better than Lucene's.
> >> >> >>> >>
> >> >> >>> >> In my opinion, the results on a particular test
> >> >> >>> collection or 2 have
> >> >> >>> >> perhaps
> >> >> >>> >> been taken too far and created a myth that BM25 is
> >> >> >>> always superior to
> >> >> >>> >> Lucene's scoring... this is not true!
> >> >> >>> >>
> >> >> >>> >> On Tue, Feb 16, 2010 at 9:46 AM, Ivan Provalov
> >> >> >>> <ip...@yahoo.com>
> >> >> >>> >> wrote:
> >> >> >>> >>
> >> >> >>> >>  I applied the Lucene patch mentioned in
> >> >> >>> >>> https://issues.apache.org/jira/browse/LUCENE-2091 and
> >> >> >>> ran the MAP
> >> >> >>> >>> numbers
> >> >> >>> >>> on TREC-3 collection using topics
> >> >> >>> 151-200.  I am not getting worse
> >> >> >>> >>> results
> >> >> >>> >>> comparing to Lucene DefaultSimilarity.  I
> >> >> >>> suspect, I am not using it
> >> >> >>> >>> correctly.  I have single field
> >> >> >>> documents.  This is the process I use:
> >> >> >>> >>>
> >> >> >>> >>> 1. During the indexing, I am setting the
> >> >> >>> similarity to BM25 as such:
> >> >> >>> >>>
> >> >> >>> >>> IndexWriter writer = new IndexWriter(dir, new
> >> >> >>> StandardAnalyzer(
> >> >> >>> >>>
> >> >> >>>    Version.LUCENE_CURRENT), true,
> >> >> >>> >>>
> >> >> >>>    IndexWriter.MaxFieldLength.UNLIMITED);
> >> >> >>> >>> writer.setSimilarity(new BM25Similarity());
> >> >> >>> >>>
> >> >> >>> >>> 2. During the Precision/Recall measurements, I
> >> >> >>> am using a
> >> >> >>> >>> SimpleBM25QQParser extension I added to the
> >> >> >>> benchmark:
> >> >> >>> >>>
> >> >> >>> >>> QualityQueryParser qqParser = new
> >> >> >>> SimpleBM25QQParser("title", "TEXT");
> >> >> >>> >>>
> >> >> >>> >>>
> >> >> >>> >>> 3. Here is the parser code (I set an avg doc
> >> >> >>> length here):
> >> >> >>> >>>
> >> >> >>> >>> public Query parse(QualityQuery qq) throws
> >> >> >>> ParseException {
> >> >> >>> >>>   BM25Parameters.setAverageLength(indexField,
> >> >> >>> 798.30f);//avg doc length
> >> >> >>> >>>   BM25Parameters.setB(0.5f);//tried
> >> >> >>> default values
> >> >> >>> >>>   BM25Parameters.setK1(2f);
> >> >> >>> >>>   return query = new
> >> >> >>> BM25BooleanQuery(qq.getValue(qqName), indexField,
> >> >> >>> >>> new
> >> >> >>> >>> StandardAnalyzer(Version.LUCENE_CURRENT));
> >> >> >>> >>> }
> >> >> >>> >>>
> >> >> >>> >>> 4. The searcher is using BM25 similarity:
> >> >> >>> >>>
> >> >> >>> >>> Searcher searcher = new IndexSearcher(dir,
> >> >> >>> true);
> >> >> >>> >>> searcher.setSimilarity(sim);
> >> >> >>> >>>
> >> >> >>> >>> Am I missing some steps?  Does anyone
> >> >> >>> have experience with this code?
> >> >> >>> >>>
> >> >> >>> >>> Thanks,
> >> >> >>> >>>
> >> >> >>> >>> Ivan
> >> >> >>> >>>
> >> >> >>> >>>
> >> >> >>> >>>
> >> >> >>> >>>
> >> >> >>> >>>
> >> >> >>>
> >> ---------------------------------------------------------------------
> >> >> >>> >>> To unsubscribe, e-mail:
> >> java-user-unsubscribe@lucene.apache.org
> >> >> >>> >>> For additional commands, e-mail:
> >> >> java-user-help@lucene.apache.org
> >> >> >>> >>>
> >> >> >>> >>>
> >> >> >>> >>>
> >> >> >>> >>
> >> >> >>> >>
> >> >> >>> > --
> >> >> >>> >
> >> >> >>> -----------------------------------------------------------
> >> >> >>> > Joaquín Pérez Iglesias
> >> >> >>> > Dpto. Lenguajes y Sistemas Informáticos
> >> >> >>> > E.T.S.I. Informática (UNED)
> >> >> >>> > Ciudad Universitaria
> >> >> >>> > C/ Juan del Rosal nº 16
> >> >> >>> > 28040 Madrid - Spain
> >> >> >>> > Phone. +34 91 398 89 19
> >> >> >>> > Fax    +34 91 398 65 35
> >> >> >>> > Office  2.11
> >> >> >>> > Email: joaquin.perez@lsi.uned.es
> >> >> >>> > web:
> >> http://nlp.uned.es/~jperezi/ <http://nlp.uned.es/%7Ejperezi/><
> http://nlp.uned.es/%7Ejperezi/>
> >> >> <http://nlp.uned.es/%7Ejperezi/><
> >> >> http://nlp.uned.es/%7Ejperezi/>
> >> >> >>> >
> >> >> >>> -----------------------------------------------------------
> >> >> >>> >
> >> >> >>> >
> >> >> >>> >
> >> >> >>>
> >> ---------------------------------------------------------------------
> >> >> >>> > To unsubscribe, e-mail:
> java-user-unsubscribe@lucene.apache.org
> >> >> >>> > For additional commands, e-mail:
> >> java-user-help@lucene.apache.org
> >> >> >>> >
> >> >> >>> >
> >> >> >>>
> >> >> >>>
> >> >> >>> --
> >> >> >>> Robert Muir
> >> >> >>> rcmuir@gmail.com
> >> >> >>>
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >>
> ---------------------------------------------------------------------
> >> >> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> >> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >> >> >>
> >> >> >>
> >> >> >
> >> >> >
> >> >> >
> >> >> >
> ---------------------------------------------------------------------
> >> >> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> >> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >> >> >
> >> >> >
> >> >>
> >> >>
> >> >>
> >> >> ---------------------------------------------------------------------
> >> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >> >>
> >> >>
> >> >
> >> >
> >> > --
> >> > Robert Muir
> >> > rcmuir@gmail.com
> >> >
> >>
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>
> >>
> >
> >
> > --
> > Robert Muir
> > rcmuir@gmail.com
> >
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


--
Robert Muir
rcmuir@gmail.com

Re: BM25 Scoring Patch

Posted by Robert Muir <rc...@gmail.com>.
Joaquin, I have a typical methodology where I don't optimize any scoring
params: be it BM25 params (I stick with your defaults), or lnb.ltc params (i
stick with default slope). When doing query expansion i don't modify the
defaults for MoreLikeThis either.

I've found that changing these params can have a significant difference in
retrieval performance, which is interesting, but I'm typically focused on
text analysis (how is the text indexed?/stemming/stopwords). I also feel
that such things are corpus-specific, which i generally try to avoid in my
work...

for example, in analysis work,  the text collection often has a majority of
text in a specific tense (i.e. news), so i don't at all try to tune any part
of analysis as I worry this would be corpus-specific... I do the same with
scoring.

As far as why some models perform better than others for certain languages,
I think this is a million-dollar question. But my intuition (I don't have
references or anything to back this up), is that probabilistic models
outperform vector-space models when you are using approaches like n-grams:
you don't have nice stopwords lists, stemming, decompounding etc.

This is particularly interesting to me, as probabilistic model + ngram is a
very general multilingual approach that I would like to have working well in
Lucene, its also important as a "default" when we don't have a nicely tuned
analyzer available that will work well with a vector space model. In my
opinion, vector-space tends to fall apart without good language support.


On Tue, Feb 16, 2010 at 3:23 PM, JOAQUIN PEREZ IGLESIAS <
joaquin.perez@lsi.uned.es> wrote:

> Ok,
>
> I'm not advocating the BM25 patch neither, unfortunately BM25 was not my
> idea :-))), and I'm sure that the implementation can be improved.
>
> When you use the BM25 implementation, are you optimising the parameters
> specifically per collection? (It is a key factor for improving BM25
> performance).
>
> Why do you think that BM25 works better for English than in other
> languages (apart of experiments). What are your intuitions?
>
> I dont't have too much experience on languages moreover of Spanish and
> English, and it sounds pretty interesting.
>
> Kind Regards.
>
> P.S: Maybe this is not a topic for this list???
>
>
> > Joaquin, I don't see this as a flame war? First of all I'd like to
> > personally thank you for your excellent BM25 implementation!
> >
> > I think the selection of a retrieval model depends highly on the
> > language/indexing approach, i.e. if we were talking East Asian languages
> I
> > think we want a probabilistic model: no argument there!
> >
> > All i said was that it is a myth that BM25 is "always" better than
> > Lucene's
> > scoring model, it really depends on what you are trying to do, how you
> are
> > indexing your text, properties of your corpus, how your queries are
> > running.
> >
> > I don't even want to come across as advocating the lnb.ltc approach
> > either,
> > sure I wrote the patch, but this means nothing. I only like it as its
> > currently a simple integration into Lucene, but long-term its best if we
> > can
> > support other models also!
> >
> > Finally I think there is something to be said for Lucene's default
> > retrieval
> > model, which in my (non-english) findings across the board isn't terrible
> > at
> > all... then again I am working with languages where analysis is really
> the
> > thing holding Lucene back, not scoring.
> >
> > On Tue, Feb 16, 2010 at 2:40 PM, JOAQUIN PEREZ IGLESIAS <
> > joaquin.perez@lsi.uned.es> wrote:
> >
> >> Just some final comments (as I said I'm not interested in flame wars),
> >>
> >> If I obtain better results there are not problem with pooling otherwise
> >> it
> >> is biased.
> >> The only important thing (in my opinion) is that it cannot be said that
> >> BM25 is a myth.
> >> Yes, you are right there is not an only ranking model that beats the
> >> rest,
> >> but there are models that generally show a better performance in more
> >> cases.
> >>
> >> About CLEF I have had the same experience (VSM vs BM25) on Spanish and
> >> English (WebCLEF) and Q&A (ResPubliQA)
> >>
> >> Ivan checks the parameters (b and k1), probably you can improve your
> >> results. (that's the bad part of BM25).
> >>
> >> Finally we are just speaking of personal experience, so obviously you
> >> should use the best model for your data and your own experience, on IR
> >> there are not myths neither best ranking models. If any of us is able to
> >> find the &#8220;best&#8221;  ranking model, or is able to prove that any
> >> state-of-the art is a myth he should send these results to the SIGIR
> >> conference.
> >>
> >> Ivan, Robert good luck with your experiments, as I said the good part of
> >> IR is that you can always make experiments on your own.
> >>
> >> > I don't think its really a competition, I think preferably we should
> >> have
> >> > the flexibility to change the scoring model in lucene actually?
> >> >
> >> > I have found lots of cases where VSM improves on BM25, but then again
> >> I
> >> > don't work with TREC stuff, as I work with non-english collections.
> >> >
> >> > It doesn't contradict years of research to say that VSM isn't a
> >> > state-of-the-art model, besides the TREC-4 results, there are CLEF
> >> results
> >> > where VSM models perform competitively or exceed (Finnish, Russian,
> >> etc)
> >> > BM25/DFR/etc.
> >> >
> >> > It depends on the collection, there isn't a 'best retrieval formula'.
> >> >
> >> > Note: I have no bias against BM-25, but its definitely a myth to say
> >> there
> >> > is a single retrieval formula that is the 'best' across the board.
> >> >
> >> >
> >> > On Tue, Feb 16, 2010 at 1:53 PM, JOAQUIN PEREZ IGLESIAS <
> >> > joaquin.perez@lsi.uned.es> wrote:
> >> >
> >> >> By the way,
> >> >>
> >> >> I don't want to start a flame war VSM vs BM25, but I really believe
> >> that
> >> >> I
> >> >> have to express my opinion as Robert has done. In my experience, I
> >> have
> >> >> never found a case where VSM improves significantly BM25. Maybe you
> >> can
> >> >> find some cases under some very specific collection characteristics,
> >> (as
> >> >> average length of 300 vs 3000) or a bad usage of BM25 (not proper
> >> >> parameters) where it can happen.
> >> >>
> >> >> BM25 is not just only a different way of length normalization, it is
> >> >> based
> >> >> strongly in the probabilistic framework, and parametrises frequencies
> >> >> and
> >> >> length. This is probably the most successful ranking model of the
> >> last
> >> >> years in Information Retrieval.
> >> >>
> >> >> I have never read a paper where VSM  improves any of the
> >> >> state-of-the-art
> >> >> ranking models (Language Models, DFR, BM25,...),  although the VSM
> >> with
> >> >> pivoted normalisation length can obtain nice results. This can be
> >> proved
> >> >> checking the last years of the TREC competition.
> >> >>
> >> >> Honestly to say that is a myth that BM25 improves VSM breaks the last
> >> 10
> >> >> or 15 years of research on Information Retrieval, and I really
> >> believe
> >> >> that is not accurate.
> >> >>
> >> >> The good thing of Information Retrieval is that you can always make
> >> your
> >> >> owns experiments and you can use the experience of a lot of years of
> >> >> research.
> >> >>
> >> >> PS: This opinion is based on experiments on TREC and CLEF
> >> collections,
> >> >> obviously we can start a debate about the suitability of this type of
> >> >> experimentation (concept of relevance, pooling, relevance
> >> judgements),
> >> >> but
> >> >> this is a much more complex topic and I believe is far from what we
> >> are
> >> >> dealing here.
> >> >>
> >> >> PS2: In relation with TREC4 Cornell used a pivoted length
> >> normalisation
> >> >> and they were applying pseudo-relevance feedback, what honestly makes
> >> >> much
> >> >> more difficult the analysis of the results. Obviously their results
> >> were
> >> >> part of the pool.
> >> >>
> >> >> Sorry for the huge mail :-))))
> >> >>
> >> >> > Hi Ivan,
> >> >> >
> >> >> > the problem is that unfortunately BM25
> >> >> > cannot be implemented overwriting
> >> >> > the Similarity interface. Therefore BM25Similarity
> >> >> > only computes the classic probabilistic IDF (what is
> >> >> > interesting only at search time).
> >> >> > If you set BM25Similarity at indexing time
> >> >> > some basic stats are not stored
> >> >> > correctly in the segments (like docs length).
> >> >> >
> >> >> > When you use BM25BooleanQuery this class
> >> >> > will set automatically the BM25Similarity for you,
> >> >> > therefore you don't need to do this explicitly.
> >> >> >
> >> >> > I tried to make this implementation with the focus on
> >> >> > not interfering on the typical use of Lucene (so no changing
> >> >> > DefaultSimilarity).
> >> >> >
> >> >> >> Joaquin, Robert,
> >> >> >>
> >> >> >> I followed Joaquin's recommendation and removed the call to set
> >> >> >> similarity
> >> >> >> to BM25 explicitly (indexer, searcher).  The results showed 55%
> >> >> >> improvement for the MAP score (0.141->0.219) over default
> >> similarity.
> >> >> >>
> >> >> >> Joaquin, how would setting the similarity to BM25 explicitly make
> >> the
> >> >> >> score worse?
> >> >> >>
> >> >> >> Thank you,
> >> >> >>
> >> >> >> Ivan
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >> --- On Tue, 2/16/10, Robert Muir <rc...@gmail.com> wrote:
> >> >> >>
> >> >> >>> From: Robert Muir <rc...@gmail.com>
> >> >> >>> Subject: Re: BM25 Scoring Patch
> >> >> >>> To: java-user@lucene.apache.org
> >> >> >>> Date: Tuesday, February 16, 2010, 11:36 AM
> >> >> >>> yes Ivan, if possible please report
> >> >> >>> back any findings you can on the
> >> >> >>> experiments you are doing!
> >> >> >>>
> >> >> >>> On Tue, Feb 16, 2010 at 11:22 AM, Joaquin Perez Iglesias
> >> >> >>> <
> >> >> >>> joaquin.perez@lsi.uned.es>
> >> >> >>> wrote:
> >> >> >>>
> >> >> >>> > Hi Ivan,
> >> >> >>> >
> >> >> >>> > You shouldn't set the BM25Similarity for indexing or
> >> >> >>> searching.
> >> >> >>> > Please try removing the lines:
> >> >> >>> >   writer.setSimilarity(new
> >> >> >>> BM25Similarity());
> >> >> >>> >   searcher.setSimilarity(sim);
> >> >> >>> >
> >> >> >>> > Please let us/me know if you improve your results with
> >> >> >>> these changes.
> >> >> >>> >
> >> >> >>> >
> >> >> >>> > Robert Muir escribió:
> >> >> >>> >
> >> >> >>> >  Hi Ivan, I've seen many cases where BM25
> >> >> >>> performs worse than Lucene's
> >> >> >>> >> default Similarity. Perhaps this is just another
> >> >> >>> one?
> >> >> >>> >>
> >> >> >>> >> Again while I have not worked with this particular
> >> >> >>> collection, I looked at
> >> >> >>> >> the statistics and noted that its composed of
> >> >> >>> several 'sub-collections':
> >> >> >>> >> for
> >> >> >>> >> example the PAT documents on disk 3 have an
> >> >> >>> average doc length of 3543,
> >> >> >>> >> but
> >> >> >>> >> the AP documents on disk 1 have an avg doc length
> >> >> >>> of 353.
> >> >> >>> >>
> >> >> >>> >> I have found on other collections that any
> >> >> >>> advantages of BM25's document
> >> >> >>> >> length normalization fall apart when 'average
> >> >> >>> document length' doesn't
> >> >> >>> >> make
> >> >> >>> >> a whole lot of sense (cases like this).
> >> >> >>> >>
> >> >> >>> >> For this same reason, I've only found a few
> >> >> >>> collections where BM25's doc
> >> >> >>> >> length normalization is really significantly
> >> >> >>> better than Lucene's.
> >> >> >>> >>
> >> >> >>> >> In my opinion, the results on a particular test
> >> >> >>> collection or 2 have
> >> >> >>> >> perhaps
> >> >> >>> >> been taken too far and created a myth that BM25 is
> >> >> >>> always superior to
> >> >> >>> >> Lucene's scoring... this is not true!
> >> >> >>> >>
> >> >> >>> >> On Tue, Feb 16, 2010 at 9:46 AM, Ivan Provalov
> >> >> >>> <ip...@yahoo.com>
> >> >> >>> >> wrote:
> >> >> >>> >>
> >> >> >>> >>  I applied the Lucene patch mentioned in
> >> >> >>> >>> https://issues.apache.org/jira/browse/LUCENE-2091 and
> >> >> >>> ran the MAP
> >> >> >>> >>> numbers
> >> >> >>> >>> on TREC-3 collection using topics
> >> >> >>> 151-200.  I am not getting worse
> >> >> >>> >>> results
> >> >> >>> >>> comparing to Lucene DefaultSimilarity.  I
> >> >> >>> suspect, I am not using it
> >> >> >>> >>> correctly.  I have single field
> >> >> >>> documents.  This is the process I use:
> >> >> >>> >>>
> >> >> >>> >>> 1. During the indexing, I am setting the
> >> >> >>> similarity to BM25 as such:
> >> >> >>> >>>
> >> >> >>> >>> IndexWriter writer = new IndexWriter(dir, new
> >> >> >>> StandardAnalyzer(
> >> >> >>> >>>
> >> >> >>>    Version.LUCENE_CURRENT), true,
> >> >> >>> >>>
> >> >> >>>    IndexWriter.MaxFieldLength.UNLIMITED);
> >> >> >>> >>> writer.setSimilarity(new BM25Similarity());
> >> >> >>> >>>
> >> >> >>> >>> 2. During the Precision/Recall measurements, I
> >> >> >>> am using a
> >> >> >>> >>> SimpleBM25QQParser extension I added to the
> >> >> >>> benchmark:
> >> >> >>> >>>
> >> >> >>> >>> QualityQueryParser qqParser = new
> >> >> >>> SimpleBM25QQParser("title", "TEXT");
> >> >> >>> >>>
> >> >> >>> >>>
> >> >> >>> >>> 3. Here is the parser code (I set an avg doc
> >> >> >>> length here):
> >> >> >>> >>>
> >> >> >>> >>> public Query parse(QualityQuery qq) throws
> >> >> >>> ParseException {
> >> >> >>> >>>   BM25Parameters.setAverageLength(indexField,
> >> >> >>> 798.30f);//avg doc length
> >> >> >>> >>>   BM25Parameters.setB(0.5f);//tried
> >> >> >>> default values
> >> >> >>> >>>   BM25Parameters.setK1(2f);
> >> >> >>> >>>   return query = new
> >> >> >>> BM25BooleanQuery(qq.getValue(qqName), indexField,
> >> >> >>> >>> new
> >> >> >>> >>> StandardAnalyzer(Version.LUCENE_CURRENT));
> >> >> >>> >>> }
> >> >> >>> >>>
> >> >> >>> >>> 4. The searcher is using BM25 similarity:
> >> >> >>> >>>
> >> >> >>> >>> Searcher searcher = new IndexSearcher(dir,
> >> >> >>> true);
> >> >> >>> >>> searcher.setSimilarity(sim);
> >> >> >>> >>>
> >> >> >>> >>> Am I missing some steps?  Does anyone
> >> >> >>> have experience with this code?
> >> >> >>> >>>
> >> >> >>> >>> Thanks,
> >> >> >>> >>>
> >> >> >>> >>> Ivan
> >> >> >>> >>>
> >> >> >>> >>>
> >> >> >>> >>>
> >> >> >>> >>>
> >> >> >>> >>>
> >> >> >>>
> >> ---------------------------------------------------------------------
> >> >> >>> >>> To unsubscribe, e-mail:
> >> java-user-unsubscribe@lucene.apache.org
> >> >> >>> >>> For additional commands, e-mail:
> >> >> java-user-help@lucene.apache.org
> >> >> >>> >>>
> >> >> >>> >>>
> >> >> >>> >>>
> >> >> >>> >>
> >> >> >>> >>
> >> >> >>> > --
> >> >> >>> >
> >> >> >>> -----------------------------------------------------------
> >> >> >>> > Joaquín Pérez Iglesias
> >> >> >>> > Dpto. Lenguajes y Sistemas Informáticos
> >> >> >>> > E.T.S.I. Informática (UNED)
> >> >> >>> > Ciudad Universitaria
> >> >> >>> > C/ Juan del Rosal nº 16
> >> >> >>> > 28040 Madrid - Spain
> >> >> >>> > Phone. +34 91 398 89 19
> >> >> >>> > Fax    +34 91 398 65 35
> >> >> >>> > Office  2.11
> >> >> >>> > Email: joaquin.perez@lsi.uned.es
> >> >> >>> > web:
> >> http://nlp.uned.es/~jperezi/ <http://nlp.uned.es/%7Ejperezi/><
> http://nlp.uned.es/%7Ejperezi/>
> >> >> <http://nlp.uned.es/%7Ejperezi/><
> >> >> http://nlp.uned.es/%7Ejperezi/>
> >> >> >>> >
> >> >> >>> -----------------------------------------------------------
> >> >> >>> >
> >> >> >>> >
> >> >> >>> >
> >> >> >>>
> >> ---------------------------------------------------------------------
> >> >> >>> > To unsubscribe, e-mail:
> java-user-unsubscribe@lucene.apache.org
> >> >> >>> > For additional commands, e-mail:
> >> java-user-help@lucene.apache.org
> >> >> >>> >
> >> >> >>> >
> >> >> >>>
> >> >> >>>
> >> >> >>> --
> >> >> >>> Robert Muir
> >> >> >>> rcmuir@gmail.com
> >> >> >>>
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >>
> ---------------------------------------------------------------------
> >> >> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> >> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >> >> >>
> >> >> >>
> >> >> >
> >> >> >
> >> >> >
> >> >> >
> ---------------------------------------------------------------------
> >> >> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> >> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >> >> >
> >> >> >
> >> >>
> >> >>
> >> >>
> >> >> ---------------------------------------------------------------------
> >> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >> >>
> >> >>
> >> >
> >> >
> >> > --
> >> > Robert Muir
> >> > rcmuir@gmail.com
> >> >
> >>
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>
> >>
> >
> >
> > --
> > Robert Muir
> > rcmuir@gmail.com
> >
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


-- 
Robert Muir
rcmuir@gmail.com

Re: BM25 Scoring Patch

Posted by JOAQUIN PEREZ IGLESIAS <jo...@lsi.uned.es>.
Ok,

I'm not advocating the BM25 patch neither, unfortunately BM25 was not my
idea :-))), and I'm sure that the implementation can be improved.

When you use the BM25 implementation, are you optimising the parameters
specifically per collection? (It is a key factor for improving BM25
performance).

Why do you think that BM25 works better for English than in other
languages (apart of experiments). What are your intuitions?

I dont't have too much experience on languages moreover of Spanish and
English, and it sounds pretty interesting.

Kind Regards.

P.S: Maybe this is not a topic for this list???


> Joaquin, I don't see this as a flame war? First of all I'd like to
> personally thank you for your excellent BM25 implementation!
>
> I think the selection of a retrieval model depends highly on the
> language/indexing approach, i.e. if we were talking East Asian languages I
> think we want a probabilistic model: no argument there!
>
> All i said was that it is a myth that BM25 is "always" better than
> Lucene's
> scoring model, it really depends on what you are trying to do, how you are
> indexing your text, properties of your corpus, how your queries are
> running.
>
> I don't even want to come across as advocating the lnb.ltc approach
> either,
> sure I wrote the patch, but this means nothing. I only like it as its
> currently a simple integration into Lucene, but long-term its best if we
> can
> support other models also!
>
> Finally I think there is something to be said for Lucene's default
> retrieval
> model, which in my (non-english) findings across the board isn't terrible
> at
> all... then again I am working with languages where analysis is really the
> thing holding Lucene back, not scoring.
>
> On Tue, Feb 16, 2010 at 2:40 PM, JOAQUIN PEREZ IGLESIAS <
> joaquin.perez@lsi.uned.es> wrote:
>
>> Just some final comments (as I said I'm not interested in flame wars),
>>
>> If I obtain better results there are not problem with pooling otherwise
>> it
>> is biased.
>> The only important thing (in my opinion) is that it cannot be said that
>> BM25 is a myth.
>> Yes, you are right there is not an only ranking model that beats the
>> rest,
>> but there are models that generally show a better performance in more
>> cases.
>>
>> About CLEF I have had the same experience (VSM vs BM25) on Spanish and
>> English (WebCLEF) and Q&A (ResPubliQA)
>>
>> Ivan checks the parameters (b and k1), probably you can improve your
>> results. (that's the bad part of BM25).
>>
>> Finally we are just speaking of personal experience, so obviously you
>> should use the best model for your data and your own experience, on IR
>> there are not myths neither best ranking models. If any of us is able to
>> find the &#8220;best&#8221;  ranking model, or is able to prove that any
>> state-of-the art is a myth he should send these results to the SIGIR
>> conference.
>>
>> Ivan, Robert good luck with your experiments, as I said the good part of
>> IR is that you can always make experiments on your own.
>>
>> > I don't think its really a competition, I think preferably we should
>> have
>> > the flexibility to change the scoring model in lucene actually?
>> >
>> > I have found lots of cases where VSM improves on BM25, but then again
>> I
>> > don't work with TREC stuff, as I work with non-english collections.
>> >
>> > It doesn't contradict years of research to say that VSM isn't a
>> > state-of-the-art model, besides the TREC-4 results, there are CLEF
>> results
>> > where VSM models perform competitively or exceed (Finnish, Russian,
>> etc)
>> > BM25/DFR/etc.
>> >
>> > It depends on the collection, there isn't a 'best retrieval formula'.
>> >
>> > Note: I have no bias against BM-25, but its definitely a myth to say
>> there
>> > is a single retrieval formula that is the 'best' across the board.
>> >
>> >
>> > On Tue, Feb 16, 2010 at 1:53 PM, JOAQUIN PEREZ IGLESIAS <
>> > joaquin.perez@lsi.uned.es> wrote:
>> >
>> >> By the way,
>> >>
>> >> I don't want to start a flame war VSM vs BM25, but I really believe
>> that
>> >> I
>> >> have to express my opinion as Robert has done. In my experience, I
>> have
>> >> never found a case where VSM improves significantly BM25. Maybe you
>> can
>> >> find some cases under some very specific collection characteristics,
>> (as
>> >> average length of 300 vs 3000) or a bad usage of BM25 (not proper
>> >> parameters) where it can happen.
>> >>
>> >> BM25 is not just only a different way of length normalization, it is
>> >> based
>> >> strongly in the probabilistic framework, and parametrises frequencies
>> >> and
>> >> length. This is probably the most successful ranking model of the
>> last
>> >> years in Information Retrieval.
>> >>
>> >> I have never read a paper where VSM  improves any of the
>> >> state-of-the-art
>> >> ranking models (Language Models, DFR, BM25,...),  although the VSM
>> with
>> >> pivoted normalisation length can obtain nice results. This can be
>> proved
>> >> checking the last years of the TREC competition.
>> >>
>> >> Honestly to say that is a myth that BM25 improves VSM breaks the last
>> 10
>> >> or 15 years of research on Information Retrieval, and I really
>> believe
>> >> that is not accurate.
>> >>
>> >> The good thing of Information Retrieval is that you can always make
>> your
>> >> owns experiments and you can use the experience of a lot of years of
>> >> research.
>> >>
>> >> PS: This opinion is based on experiments on TREC and CLEF
>> collections,
>> >> obviously we can start a debate about the suitability of this type of
>> >> experimentation (concept of relevance, pooling, relevance
>> judgements),
>> >> but
>> >> this is a much more complex topic and I believe is far from what we
>> are
>> >> dealing here.
>> >>
>> >> PS2: In relation with TREC4 Cornell used a pivoted length
>> normalisation
>> >> and they were applying pseudo-relevance feedback, what honestly makes
>> >> much
>> >> more difficult the analysis of the results. Obviously their results
>> were
>> >> part of the pool.
>> >>
>> >> Sorry for the huge mail :-))))
>> >>
>> >> > Hi Ivan,
>> >> >
>> >> > the problem is that unfortunately BM25
>> >> > cannot be implemented overwriting
>> >> > the Similarity interface. Therefore BM25Similarity
>> >> > only computes the classic probabilistic IDF (what is
>> >> > interesting only at search time).
>> >> > If you set BM25Similarity at indexing time
>> >> > some basic stats are not stored
>> >> > correctly in the segments (like docs length).
>> >> >
>> >> > When you use BM25BooleanQuery this class
>> >> > will set automatically the BM25Similarity for you,
>> >> > therefore you don't need to do this explicitly.
>> >> >
>> >> > I tried to make this implementation with the focus on
>> >> > not interfering on the typical use of Lucene (so no changing
>> >> > DefaultSimilarity).
>> >> >
>> >> >> Joaquin, Robert,
>> >> >>
>> >> >> I followed Joaquin's recommendation and removed the call to set
>> >> >> similarity
>> >> >> to BM25 explicitly (indexer, searcher).  The results showed 55%
>> >> >> improvement for the MAP score (0.141->0.219) over default
>> similarity.
>> >> >>
>> >> >> Joaquin, how would setting the similarity to BM25 explicitly make
>> the
>> >> >> score worse?
>> >> >>
>> >> >> Thank you,
>> >> >>
>> >> >> Ivan
>> >> >>
>> >> >>
>> >> >>
>> >> >> --- On Tue, 2/16/10, Robert Muir <rc...@gmail.com> wrote:
>> >> >>
>> >> >>> From: Robert Muir <rc...@gmail.com>
>> >> >>> Subject: Re: BM25 Scoring Patch
>> >> >>> To: java-user@lucene.apache.org
>> >> >>> Date: Tuesday, February 16, 2010, 11:36 AM
>> >> >>> yes Ivan, if possible please report
>> >> >>> back any findings you can on the
>> >> >>> experiments you are doing!
>> >> >>>
>> >> >>> On Tue, Feb 16, 2010 at 11:22 AM, Joaquin Perez Iglesias
>> >> >>> <
>> >> >>> joaquin.perez@lsi.uned.es>
>> >> >>> wrote:
>> >> >>>
>> >> >>> > Hi Ivan,
>> >> >>> >
>> >> >>> > You shouldn't set the BM25Similarity for indexing or
>> >> >>> searching.
>> >> >>> > Please try removing the lines:
>> >> >>> >   writer.setSimilarity(new
>> >> >>> BM25Similarity());
>> >> >>> >   searcher.setSimilarity(sim);
>> >> >>> >
>> >> >>> > Please let us/me know if you improve your results with
>> >> >>> these changes.
>> >> >>> >
>> >> >>> >
>> >> >>> > Robert Muir escribió:
>> >> >>> >
>> >> >>> >  Hi Ivan, I've seen many cases where BM25
>> >> >>> performs worse than Lucene's
>> >> >>> >> default Similarity. Perhaps this is just another
>> >> >>> one?
>> >> >>> >>
>> >> >>> >> Again while I have not worked with this particular
>> >> >>> collection, I looked at
>> >> >>> >> the statistics and noted that its composed of
>> >> >>> several 'sub-collections':
>> >> >>> >> for
>> >> >>> >> example the PAT documents on disk 3 have an
>> >> >>> average doc length of 3543,
>> >> >>> >> but
>> >> >>> >> the AP documents on disk 1 have an avg doc length
>> >> >>> of 353.
>> >> >>> >>
>> >> >>> >> I have found on other collections that any
>> >> >>> advantages of BM25's document
>> >> >>> >> length normalization fall apart when 'average
>> >> >>> document length' doesn't
>> >> >>> >> make
>> >> >>> >> a whole lot of sense (cases like this).
>> >> >>> >>
>> >> >>> >> For this same reason, I've only found a few
>> >> >>> collections where BM25's doc
>> >> >>> >> length normalization is really significantly
>> >> >>> better than Lucene's.
>> >> >>> >>
>> >> >>> >> In my opinion, the results on a particular test
>> >> >>> collection or 2 have
>> >> >>> >> perhaps
>> >> >>> >> been taken too far and created a myth that BM25 is
>> >> >>> always superior to
>> >> >>> >> Lucene's scoring... this is not true!
>> >> >>> >>
>> >> >>> >> On Tue, Feb 16, 2010 at 9:46 AM, Ivan Provalov
>> >> >>> <ip...@yahoo.com>
>> >> >>> >> wrote:
>> >> >>> >>
>> >> >>> >>  I applied the Lucene patch mentioned in
>> >> >>> >>> https://issues.apache.org/jira/browse/LUCENE-2091 and
>> >> >>> ran the MAP
>> >> >>> >>> numbers
>> >> >>> >>> on TREC-3 collection using topics
>> >> >>> 151-200.  I am not getting worse
>> >> >>> >>> results
>> >> >>> >>> comparing to Lucene DefaultSimilarity.  I
>> >> >>> suspect, I am not using it
>> >> >>> >>> correctly.  I have single field
>> >> >>> documents.  This is the process I use:
>> >> >>> >>>
>> >> >>> >>> 1. During the indexing, I am setting the
>> >> >>> similarity to BM25 as such:
>> >> >>> >>>
>> >> >>> >>> IndexWriter writer = new IndexWriter(dir, new
>> >> >>> StandardAnalyzer(
>> >> >>> >>>
>> >> >>>    Version.LUCENE_CURRENT), true,
>> >> >>> >>>
>> >> >>>    IndexWriter.MaxFieldLength.UNLIMITED);
>> >> >>> >>> writer.setSimilarity(new BM25Similarity());
>> >> >>> >>>
>> >> >>> >>> 2. During the Precision/Recall measurements, I
>> >> >>> am using a
>> >> >>> >>> SimpleBM25QQParser extension I added to the
>> >> >>> benchmark:
>> >> >>> >>>
>> >> >>> >>> QualityQueryParser qqParser = new
>> >> >>> SimpleBM25QQParser("title", "TEXT");
>> >> >>> >>>
>> >> >>> >>>
>> >> >>> >>> 3. Here is the parser code (I set an avg doc
>> >> >>> length here):
>> >> >>> >>>
>> >> >>> >>> public Query parse(QualityQuery qq) throws
>> >> >>> ParseException {
>> >> >>> >>>   BM25Parameters.setAverageLength(indexField,
>> >> >>> 798.30f);//avg doc length
>> >> >>> >>>   BM25Parameters.setB(0.5f);//tried
>> >> >>> default values
>> >> >>> >>>   BM25Parameters.setK1(2f);
>> >> >>> >>>   return query = new
>> >> >>> BM25BooleanQuery(qq.getValue(qqName), indexField,
>> >> >>> >>> new
>> >> >>> >>> StandardAnalyzer(Version.LUCENE_CURRENT));
>> >> >>> >>> }
>> >> >>> >>>
>> >> >>> >>> 4. The searcher is using BM25 similarity:
>> >> >>> >>>
>> >> >>> >>> Searcher searcher = new IndexSearcher(dir,
>> >> >>> true);
>> >> >>> >>> searcher.setSimilarity(sim);
>> >> >>> >>>
>> >> >>> >>> Am I missing some steps?  Does anyone
>> >> >>> have experience with this code?
>> >> >>> >>>
>> >> >>> >>> Thanks,
>> >> >>> >>>
>> >> >>> >>> Ivan
>> >> >>> >>>
>> >> >>> >>>
>> >> >>> >>>
>> >> >>> >>>
>> >> >>> >>>
>> >> >>>
>> ---------------------------------------------------------------------
>> >> >>> >>> To unsubscribe, e-mail:
>> java-user-unsubscribe@lucene.apache.org
>> >> >>> >>> For additional commands, e-mail:
>> >> java-user-help@lucene.apache.org
>> >> >>> >>>
>> >> >>> >>>
>> >> >>> >>>
>> >> >>> >>
>> >> >>> >>
>> >> >>> > --
>> >> >>> >
>> >> >>> -----------------------------------------------------------
>> >> >>> > Joaquín Pérez Iglesias
>> >> >>> > Dpto. Lenguajes y Sistemas Informáticos
>> >> >>> > E.T.S.I. Informática (UNED)
>> >> >>> > Ciudad Universitaria
>> >> >>> > C/ Juan del Rosal nº 16
>> >> >>> > 28040 Madrid - Spain
>> >> >>> > Phone. +34 91 398 89 19
>> >> >>> > Fax    +34 91 398 65 35
>> >> >>> > Office  2.11
>> >> >>> > Email: joaquin.perez@lsi.uned.es
>> >> >>> > web:
>> http://nlp.uned.es/~jperezi/<http://nlp.uned.es/%7Ejperezi/>
>> >> <http://nlp.uned.es/%7Ejperezi/><
>> >> http://nlp.uned.es/%7Ejperezi/>
>> >> >>> >
>> >> >>> -----------------------------------------------------------
>> >> >>> >
>> >> >>> >
>> >> >>> >
>> >> >>>
>> ---------------------------------------------------------------------
>> >> >>> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> >> >>> > For additional commands, e-mail:
>> java-user-help@lucene.apache.org
>> >> >>> >
>> >> >>> >
>> >> >>>
>> >> >>>
>> >> >>> --
>> >> >>> Robert Muir
>> >> >>> rcmuir@gmail.com
>> >> >>>
>> >> >>
>> >> >>
>> >> >>
>> >> >>
>> >> >> ---------------------------------------------------------------------
>> >> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> >> >> For additional commands, e-mail: java-user-help@lucene.apache.org
>> >> >>
>> >> >>
>> >> >
>> >> >
>> >> >
>> >> > ---------------------------------------------------------------------
>> >> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> >> > For additional commands, e-mail: java-user-help@lucene.apache.org
>> >> >
>> >> >
>> >>
>> >>
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> >> For additional commands, e-mail: java-user-help@lucene.apache.org
>> >>
>> >>
>> >
>> >
>> > --
>> > Robert Muir
>> > rcmuir@gmail.com
>> >
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>
>
> --
> Robert Muir
> rcmuir@gmail.com
>



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: BM25 Scoring Patch

Posted by Robert Muir <rc...@gmail.com>.
Joaquin, I don't see this as a flame war? First of all I'd like to
personally thank you for your excellent BM25 implementation!

I think the selection of a retrieval model depends highly on the
language/indexing approach, i.e. if we were talking East Asian languages I
think we want a probabilistic model: no argument there!

All i said was that it is a myth that BM25 is "always" better than Lucene's
scoring model, it really depends on what you are trying to do, how you are
indexing your text, properties of your corpus, how your queries are running.

I don't even want to come across as advocating the lnb.ltc approach either,
sure I wrote the patch, but this means nothing. I only like it as its
currently a simple integration into Lucene, but long-term its best if we can
support other models also!

Finally I think there is something to be said for Lucene's default retrieval
model, which in my (non-english) findings across the board isn't terrible at
all... then again I am working with languages where analysis is really the
thing holding Lucene back, not scoring.

On Tue, Feb 16, 2010 at 2:40 PM, JOAQUIN PEREZ IGLESIAS <
joaquin.perez@lsi.uned.es> wrote:

> Just some final comments (as I said I'm not interested in flame wars),
>
> If I obtain better results there are not problem with pooling otherwise it
> is biased.
> The only important thing (in my opinion) is that it cannot be said that
> BM25 is a myth.
> Yes, you are right there is not an only ranking model that beats the rest,
> but there are models that generally show a better performance in more
> cases.
>
> About CLEF I have had the same experience (VSM vs BM25) on Spanish and
> English (WebCLEF) and Q&A (ResPubliQA)
>
> Ivan checks the parameters (b and k1), probably you can improve your
> results. (that's the bad part of BM25).
>
> Finally we are just speaking of personal experience, so obviously you
> should use the best model for your data and your own experience, on IR
> there are not myths neither best ranking models. If any of us is able to
> find the &#8220;best&#8221;  ranking model, or is able to prove that any
> state-of-the art is a myth he should send these results to the SIGIR
> conference.
>
> Ivan, Robert good luck with your experiments, as I said the good part of
> IR is that you can always make experiments on your own.
>
> > I don't think its really a competition, I think preferably we should have
> > the flexibility to change the scoring model in lucene actually?
> >
> > I have found lots of cases where VSM improves on BM25, but then again I
> > don't work with TREC stuff, as I work with non-english collections.
> >
> > It doesn't contradict years of research to say that VSM isn't a
> > state-of-the-art model, besides the TREC-4 results, there are CLEF
> results
> > where VSM models perform competitively or exceed (Finnish, Russian, etc)
> > BM25/DFR/etc.
> >
> > It depends on the collection, there isn't a 'best retrieval formula'.
> >
> > Note: I have no bias against BM-25, but its definitely a myth to say
> there
> > is a single retrieval formula that is the 'best' across the board.
> >
> >
> > On Tue, Feb 16, 2010 at 1:53 PM, JOAQUIN PEREZ IGLESIAS <
> > joaquin.perez@lsi.uned.es> wrote:
> >
> >> By the way,
> >>
> >> I don't want to start a flame war VSM vs BM25, but I really believe that
> >> I
> >> have to express my opinion as Robert has done. In my experience, I have
> >> never found a case where VSM improves significantly BM25. Maybe you can
> >> find some cases under some very specific collection characteristics, (as
> >> average length of 300 vs 3000) or a bad usage of BM25 (not proper
> >> parameters) where it can happen.
> >>
> >> BM25 is not just only a different way of length normalization, it is
> >> based
> >> strongly in the probabilistic framework, and parametrises frequencies
> >> and
> >> length. This is probably the most successful ranking model of the last
> >> years in Information Retrieval.
> >>
> >> I have never read a paper where VSM  improves any of the
> >> state-of-the-art
> >> ranking models (Language Models, DFR, BM25,...),  although the VSM with
> >> pivoted normalisation length can obtain nice results. This can be proved
> >> checking the last years of the TREC competition.
> >>
> >> Honestly to say that is a myth that BM25 improves VSM breaks the last 10
> >> or 15 years of research on Information Retrieval, and I really believe
> >> that is not accurate.
> >>
> >> The good thing of Information Retrieval is that you can always make your
> >> owns experiments and you can use the experience of a lot of years of
> >> research.
> >>
> >> PS: This opinion is based on experiments on TREC and CLEF collections,
> >> obviously we can start a debate about the suitability of this type of
> >> experimentation (concept of relevance, pooling, relevance judgements),
> >> but
> >> this is a much more complex topic and I believe is far from what we are
> >> dealing here.
> >>
> >> PS2: In relation with TREC4 Cornell used a pivoted length normalisation
> >> and they were applying pseudo-relevance feedback, what honestly makes
> >> much
> >> more difficult the analysis of the results. Obviously their results were
> >> part of the pool.
> >>
> >> Sorry for the huge mail :-))))
> >>
> >> > Hi Ivan,
> >> >
> >> > the problem is that unfortunately BM25
> >> > cannot be implemented overwriting
> >> > the Similarity interface. Therefore BM25Similarity
> >> > only computes the classic probabilistic IDF (what is
> >> > interesting only at search time).
> >> > If you set BM25Similarity at indexing time
> >> > some basic stats are not stored
> >> > correctly in the segments (like docs length).
> >> >
> >> > When you use BM25BooleanQuery this class
> >> > will set automatically the BM25Similarity for you,
> >> > therefore you don't need to do this explicitly.
> >> >
> >> > I tried to make this implementation with the focus on
> >> > not interfering on the typical use of Lucene (so no changing
> >> > DefaultSimilarity).
> >> >
> >> >> Joaquin, Robert,
> >> >>
> >> >> I followed Joaquin's recommendation and removed the call to set
> >> >> similarity
> >> >> to BM25 explicitly (indexer, searcher).  The results showed 55%
> >> >> improvement for the MAP score (0.141->0.219) over default similarity.
> >> >>
> >> >> Joaquin, how would setting the similarity to BM25 explicitly make the
> >> >> score worse?
> >> >>
> >> >> Thank you,
> >> >>
> >> >> Ivan
> >> >>
> >> >>
> >> >>
> >> >> --- On Tue, 2/16/10, Robert Muir <rc...@gmail.com> wrote:
> >> >>
> >> >>> From: Robert Muir <rc...@gmail.com>
> >> >>> Subject: Re: BM25 Scoring Patch
> >> >>> To: java-user@lucene.apache.org
> >> >>> Date: Tuesday, February 16, 2010, 11:36 AM
> >> >>> yes Ivan, if possible please report
> >> >>> back any findings you can on the
> >> >>> experiments you are doing!
> >> >>>
> >> >>> On Tue, Feb 16, 2010 at 11:22 AM, Joaquin Perez Iglesias
> >> >>> <
> >> >>> joaquin.perez@lsi.uned.es>
> >> >>> wrote:
> >> >>>
> >> >>> > Hi Ivan,
> >> >>> >
> >> >>> > You shouldn't set the BM25Similarity for indexing or
> >> >>> searching.
> >> >>> > Please try removing the lines:
> >> >>> >   writer.setSimilarity(new
> >> >>> BM25Similarity());
> >> >>> >   searcher.setSimilarity(sim);
> >> >>> >
> >> >>> > Please let us/me know if you improve your results with
> >> >>> these changes.
> >> >>> >
> >> >>> >
> >> >>> > Robert Muir escribió:
> >> >>> >
> >> >>> >  Hi Ivan, I've seen many cases where BM25
> >> >>> performs worse than Lucene's
> >> >>> >> default Similarity. Perhaps this is just another
> >> >>> one?
> >> >>> >>
> >> >>> >> Again while I have not worked with this particular
> >> >>> collection, I looked at
> >> >>> >> the statistics and noted that its composed of
> >> >>> several 'sub-collections':
> >> >>> >> for
> >> >>> >> example the PAT documents on disk 3 have an
> >> >>> average doc length of 3543,
> >> >>> >> but
> >> >>> >> the AP documents on disk 1 have an avg doc length
> >> >>> of 353.
> >> >>> >>
> >> >>> >> I have found on other collections that any
> >> >>> advantages of BM25's document
> >> >>> >> length normalization fall apart when 'average
> >> >>> document length' doesn't
> >> >>> >> make
> >> >>> >> a whole lot of sense (cases like this).
> >> >>> >>
> >> >>> >> For this same reason, I've only found a few
> >> >>> collections where BM25's doc
> >> >>> >> length normalization is really significantly
> >> >>> better than Lucene's.
> >> >>> >>
> >> >>> >> In my opinion, the results on a particular test
> >> >>> collection or 2 have
> >> >>> >> perhaps
> >> >>> >> been taken too far and created a myth that BM25 is
> >> >>> always superior to
> >> >>> >> Lucene's scoring... this is not true!
> >> >>> >>
> >> >>> >> On Tue, Feb 16, 2010 at 9:46 AM, Ivan Provalov
> >> >>> <ip...@yahoo.com>
> >> >>> >> wrote:
> >> >>> >>
> >> >>> >>  I applied the Lucene patch mentioned in
> >> >>> >>> https://issues.apache.org/jira/browse/LUCENE-2091 and
> >> >>> ran the MAP
> >> >>> >>> numbers
> >> >>> >>> on TREC-3 collection using topics
> >> >>> 151-200.  I am not getting worse
> >> >>> >>> results
> >> >>> >>> comparing to Lucene DefaultSimilarity.  I
> >> >>> suspect, I am not using it
> >> >>> >>> correctly.  I have single field
> >> >>> documents.  This is the process I use:
> >> >>> >>>
> >> >>> >>> 1. During the indexing, I am setting the
> >> >>> similarity to BM25 as such:
> >> >>> >>>
> >> >>> >>> IndexWriter writer = new IndexWriter(dir, new
> >> >>> StandardAnalyzer(
> >> >>> >>>
> >> >>>    Version.LUCENE_CURRENT), true,
> >> >>> >>>
> >> >>>    IndexWriter.MaxFieldLength.UNLIMITED);
> >> >>> >>> writer.setSimilarity(new BM25Similarity());
> >> >>> >>>
> >> >>> >>> 2. During the Precision/Recall measurements, I
> >> >>> am using a
> >> >>> >>> SimpleBM25QQParser extension I added to the
> >> >>> benchmark:
> >> >>> >>>
> >> >>> >>> QualityQueryParser qqParser = new
> >> >>> SimpleBM25QQParser("title", "TEXT");
> >> >>> >>>
> >> >>> >>>
> >> >>> >>> 3. Here is the parser code (I set an avg doc
> >> >>> length here):
> >> >>> >>>
> >> >>> >>> public Query parse(QualityQuery qq) throws
> >> >>> ParseException {
> >> >>> >>>   BM25Parameters.setAverageLength(indexField,
> >> >>> 798.30f);//avg doc length
> >> >>> >>>   BM25Parameters.setB(0.5f);//tried
> >> >>> default values
> >> >>> >>>   BM25Parameters.setK1(2f);
> >> >>> >>>   return query = new
> >> >>> BM25BooleanQuery(qq.getValue(qqName), indexField,
> >> >>> >>> new
> >> >>> >>> StandardAnalyzer(Version.LUCENE_CURRENT));
> >> >>> >>> }
> >> >>> >>>
> >> >>> >>> 4. The searcher is using BM25 similarity:
> >> >>> >>>
> >> >>> >>> Searcher searcher = new IndexSearcher(dir,
> >> >>> true);
> >> >>> >>> searcher.setSimilarity(sim);
> >> >>> >>>
> >> >>> >>> Am I missing some steps?  Does anyone
> >> >>> have experience with this code?
> >> >>> >>>
> >> >>> >>> Thanks,
> >> >>> >>>
> >> >>> >>> Ivan
> >> >>> >>>
> >> >>> >>>
> >> >>> >>>
> >> >>> >>>
> >> >>> >>>
> >> >>>
> ---------------------------------------------------------------------
> >> >>> >>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> >>> >>> For additional commands, e-mail:
> >> java-user-help@lucene.apache.org
> >> >>> >>>
> >> >>> >>>
> >> >>> >>>
> >> >>> >>
> >> >>> >>
> >> >>> > --
> >> >>> >
> >> >>> -----------------------------------------------------------
> >> >>> > Joaquín Pérez Iglesias
> >> >>> > Dpto. Lenguajes y Sistemas Informáticos
> >> >>> > E.T.S.I. Informática (UNED)
> >> >>> > Ciudad Universitaria
> >> >>> > C/ Juan del Rosal nº 16
> >> >>> > 28040 Madrid - Spain
> >> >>> > Phone. +34 91 398 89 19
> >> >>> > Fax    +34 91 398 65 35
> >> >>> > Office  2.11
> >> >>> > Email: joaquin.perez@lsi.uned.es
> >> >>> > web:   http://nlp.uned.es/~jperezi/<http://nlp.uned.es/%7Ejperezi/>
> >> <http://nlp.uned.es/%7Ejperezi/><
> >> http://nlp.uned.es/%7Ejperezi/>
> >> >>> >
> >> >>> -----------------------------------------------------------
> >> >>> >
> >> >>> >
> >> >>> >
> >> >>>
> ---------------------------------------------------------------------
> >> >>> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> >>> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >> >>> >
> >> >>> >
> >> >>>
> >> >>>
> >> >>> --
> >> >>> Robert Muir
> >> >>> rcmuir@gmail.com
> >> >>>
> >> >>
> >> >>
> >> >>
> >> >>
> >> >> ---------------------------------------------------------------------
> >> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >> >>
> >> >>
> >> >
> >> >
> >> >
> >> > ---------------------------------------------------------------------
> >> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >> >
> >> >
> >>
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>
> >>
> >
> >
> > --
> > Robert Muir
> > rcmuir@gmail.com
> >
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


-- 
Robert Muir
rcmuir@gmail.com

Re: BM25 Scoring Patch

Posted by JOAQUIN PEREZ IGLESIAS <jo...@lsi.uned.es>.
Just some final comments (as I said I'm not interested in flame wars),

If I obtain better results there are not problem with pooling otherwise it
is biased.
The only important thing (in my opinion) is that it cannot be said that
BM25 is a myth.
Yes, you are right there is not an only ranking model that beats the rest,
but there are models that generally show a better performance in more
cases.

About CLEF I have had the same experience (VSM vs BM25) on Spanish and
English (WebCLEF) and Q&A (ResPubliQA)

Ivan checks the parameters (b and k1), probably you can improve your
results. (that's the bad part of BM25).

Finally we are just speaking of personal experience, so obviously you
should use the best model for your data and your own experience, on IR
there are not myths neither best ranking models. If any of us is able to
find the &#8220;best&#8221;  ranking model, or is able to prove that any
state-of-the art is a myth he should send these results to the SIGIR
conference.

Ivan, Robert good luck with your experiments, as I said the good part of
IR is that you can always make experiments on your own.

> I don't think its really a competition, I think preferably we should have
> the flexibility to change the scoring model in lucene actually?
>
> I have found lots of cases where VSM improves on BM25, but then again I
> don't work with TREC stuff, as I work with non-english collections.
>
> It doesn't contradict years of research to say that VSM isn't a
> state-of-the-art model, besides the TREC-4 results, there are CLEF results
> where VSM models perform competitively or exceed (Finnish, Russian, etc)
> BM25/DFR/etc.
>
> It depends on the collection, there isn't a 'best retrieval formula'.
>
> Note: I have no bias against BM-25, but its definitely a myth to say there
> is a single retrieval formula that is the 'best' across the board.
>
>
> On Tue, Feb 16, 2010 at 1:53 PM, JOAQUIN PEREZ IGLESIAS <
> joaquin.perez@lsi.uned.es> wrote:
>
>> By the way,
>>
>> I don't want to start a flame war VSM vs BM25, but I really believe that
>> I
>> have to express my opinion as Robert has done. In my experience, I have
>> never found a case where VSM improves significantly BM25. Maybe you can
>> find some cases under some very specific collection characteristics, (as
>> average length of 300 vs 3000) or a bad usage of BM25 (not proper
>> parameters) where it can happen.
>>
>> BM25 is not just only a different way of length normalization, it is
>> based
>> strongly in the probabilistic framework, and parametrises frequencies
>> and
>> length. This is probably the most successful ranking model of the last
>> years in Information Retrieval.
>>
>> I have never read a paper where VSM  improves any of the
>> state-of-the-art
>> ranking models (Language Models, DFR, BM25,...),  although the VSM with
>> pivoted normalisation length can obtain nice results. This can be proved
>> checking the last years of the TREC competition.
>>
>> Honestly to say that is a myth that BM25 improves VSM breaks the last 10
>> or 15 years of research on Information Retrieval, and I really believe
>> that is not accurate.
>>
>> The good thing of Information Retrieval is that you can always make your
>> owns experiments and you can use the experience of a lot of years of
>> research.
>>
>> PS: This opinion is based on experiments on TREC and CLEF collections,
>> obviously we can start a debate about the suitability of this type of
>> experimentation (concept of relevance, pooling, relevance judgements),
>> but
>> this is a much more complex topic and I believe is far from what we are
>> dealing here.
>>
>> PS2: In relation with TREC4 Cornell used a pivoted length normalisation
>> and they were applying pseudo-relevance feedback, what honestly makes
>> much
>> more difficult the analysis of the results. Obviously their results were
>> part of the pool.
>>
>> Sorry for the huge mail :-))))
>>
>> > Hi Ivan,
>> >
>> > the problem is that unfortunately BM25
>> > cannot be implemented overwriting
>> > the Similarity interface. Therefore BM25Similarity
>> > only computes the classic probabilistic IDF (what is
>> > interesting only at search time).
>> > If you set BM25Similarity at indexing time
>> > some basic stats are not stored
>> > correctly in the segments (like docs length).
>> >
>> > When you use BM25BooleanQuery this class
>> > will set automatically the BM25Similarity for you,
>> > therefore you don't need to do this explicitly.
>> >
>> > I tried to make this implementation with the focus on
>> > not interfering on the typical use of Lucene (so no changing
>> > DefaultSimilarity).
>> >
>> >> Joaquin, Robert,
>> >>
>> >> I followed Joaquin's recommendation and removed the call to set
>> >> similarity
>> >> to BM25 explicitly (indexer, searcher).  The results showed 55%
>> >> improvement for the MAP score (0.141->0.219) over default similarity.
>> >>
>> >> Joaquin, how would setting the similarity to BM25 explicitly make the
>> >> score worse?
>> >>
>> >> Thank you,
>> >>
>> >> Ivan
>> >>
>> >>
>> >>
>> >> --- On Tue, 2/16/10, Robert Muir <rc...@gmail.com> wrote:
>> >>
>> >>> From: Robert Muir <rc...@gmail.com>
>> >>> Subject: Re: BM25 Scoring Patch
>> >>> To: java-user@lucene.apache.org
>> >>> Date: Tuesday, February 16, 2010, 11:36 AM
>> >>> yes Ivan, if possible please report
>> >>> back any findings you can on the
>> >>> experiments you are doing!
>> >>>
>> >>> On Tue, Feb 16, 2010 at 11:22 AM, Joaquin Perez Iglesias
>> >>> <
>> >>> joaquin.perez@lsi.uned.es>
>> >>> wrote:
>> >>>
>> >>> > Hi Ivan,
>> >>> >
>> >>> > You shouldn't set the BM25Similarity for indexing or
>> >>> searching.
>> >>> > Please try removing the lines:
>> >>> >   writer.setSimilarity(new
>> >>> BM25Similarity());
>> >>> >   searcher.setSimilarity(sim);
>> >>> >
>> >>> > Please let us/me know if you improve your results with
>> >>> these changes.
>> >>> >
>> >>> >
>> >>> > Robert Muir escribió:
>> >>> >
>> >>> >  Hi Ivan, I've seen many cases where BM25
>> >>> performs worse than Lucene's
>> >>> >> default Similarity. Perhaps this is just another
>> >>> one?
>> >>> >>
>> >>> >> Again while I have not worked with this particular
>> >>> collection, I looked at
>> >>> >> the statistics and noted that its composed of
>> >>> several 'sub-collections':
>> >>> >> for
>> >>> >> example the PAT documents on disk 3 have an
>> >>> average doc length of 3543,
>> >>> >> but
>> >>> >> the AP documents on disk 1 have an avg doc length
>> >>> of 353.
>> >>> >>
>> >>> >> I have found on other collections that any
>> >>> advantages of BM25's document
>> >>> >> length normalization fall apart when 'average
>> >>> document length' doesn't
>> >>> >> make
>> >>> >> a whole lot of sense (cases like this).
>> >>> >>
>> >>> >> For this same reason, I've only found a few
>> >>> collections where BM25's doc
>> >>> >> length normalization is really significantly
>> >>> better than Lucene's.
>> >>> >>
>> >>> >> In my opinion, the results on a particular test
>> >>> collection or 2 have
>> >>> >> perhaps
>> >>> >> been taken too far and created a myth that BM25 is
>> >>> always superior to
>> >>> >> Lucene's scoring... this is not true!
>> >>> >>
>> >>> >> On Tue, Feb 16, 2010 at 9:46 AM, Ivan Provalov
>> >>> <ip...@yahoo.com>
>> >>> >> wrote:
>> >>> >>
>> >>> >>  I applied the Lucene patch mentioned in
>> >>> >>> https://issues.apache.org/jira/browse/LUCENE-2091 and
>> >>> ran the MAP
>> >>> >>> numbers
>> >>> >>> on TREC-3 collection using topics
>> >>> 151-200.  I am not getting worse
>> >>> >>> results
>> >>> >>> comparing to Lucene DefaultSimilarity.  I
>> >>> suspect, I am not using it
>> >>> >>> correctly.  I have single field
>> >>> documents.  This is the process I use:
>> >>> >>>
>> >>> >>> 1. During the indexing, I am setting the
>> >>> similarity to BM25 as such:
>> >>> >>>
>> >>> >>> IndexWriter writer = new IndexWriter(dir, new
>> >>> StandardAnalyzer(
>> >>> >>>
>> >>>    Version.LUCENE_CURRENT), true,
>> >>> >>>
>> >>>    IndexWriter.MaxFieldLength.UNLIMITED);
>> >>> >>> writer.setSimilarity(new BM25Similarity());
>> >>> >>>
>> >>> >>> 2. During the Precision/Recall measurements, I
>> >>> am using a
>> >>> >>> SimpleBM25QQParser extension I added to the
>> >>> benchmark:
>> >>> >>>
>> >>> >>> QualityQueryParser qqParser = new
>> >>> SimpleBM25QQParser("title", "TEXT");
>> >>> >>>
>> >>> >>>
>> >>> >>> 3. Here is the parser code (I set an avg doc
>> >>> length here):
>> >>> >>>
>> >>> >>> public Query parse(QualityQuery qq) throws
>> >>> ParseException {
>> >>> >>>   BM25Parameters.setAverageLength(indexField,
>> >>> 798.30f);//avg doc length
>> >>> >>>   BM25Parameters.setB(0.5f);//tried
>> >>> default values
>> >>> >>>   BM25Parameters.setK1(2f);
>> >>> >>>   return query = new
>> >>> BM25BooleanQuery(qq.getValue(qqName), indexField,
>> >>> >>> new
>> >>> >>> StandardAnalyzer(Version.LUCENE_CURRENT));
>> >>> >>> }
>> >>> >>>
>> >>> >>> 4. The searcher is using BM25 similarity:
>> >>> >>>
>> >>> >>> Searcher searcher = new IndexSearcher(dir,
>> >>> true);
>> >>> >>> searcher.setSimilarity(sim);
>> >>> >>>
>> >>> >>> Am I missing some steps?  Does anyone
>> >>> have experience with this code?
>> >>> >>>
>> >>> >>> Thanks,
>> >>> >>>
>> >>> >>> Ivan
>> >>> >>>
>> >>> >>>
>> >>> >>>
>> >>> >>>
>> >>> >>>
>> >>> ---------------------------------------------------------------------
>> >>> >>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> >>> >>> For additional commands, e-mail:
>> java-user-help@lucene.apache.org
>> >>> >>>
>> >>> >>>
>> >>> >>>
>> >>> >>
>> >>> >>
>> >>> > --
>> >>> >
>> >>> -----------------------------------------------------------
>> >>> > Joaquín Pérez Iglesias
>> >>> > Dpto. Lenguajes y Sistemas Informáticos
>> >>> > E.T.S.I. Informática (UNED)
>> >>> > Ciudad Universitaria
>> >>> > C/ Juan del Rosal nº 16
>> >>> > 28040 Madrid - Spain
>> >>> > Phone. +34 91 398 89 19
>> >>> > Fax    +34 91 398 65 35
>> >>> > Office  2.11
>> >>> > Email: joaquin.perez@lsi.uned.es
>> >>> > web:   http://nlp.uned.es/~jperezi/
>> <http://nlp.uned.es/%7Ejperezi/><
>> http://nlp.uned.es/%7Ejperezi/>
>> >>> >
>> >>> -----------------------------------------------------------
>> >>> >
>> >>> >
>> >>> >
>> >>> ---------------------------------------------------------------------
>> >>> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> >>> > For additional commands, e-mail: java-user-help@lucene.apache.org
>> >>> >
>> >>> >
>> >>>
>> >>>
>> >>> --
>> >>> Robert Muir
>> >>> rcmuir@gmail.com
>> >>>
>> >>
>> >>
>> >>
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> >> For additional commands, e-mail: java-user-help@lucene.apache.org
>> >>
>> >>
>> >
>> >
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> > For additional commands, e-mail: java-user-help@lucene.apache.org
>> >
>> >
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>
>
> --
> Robert Muir
> rcmuir@gmail.com
>



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: BM25 Scoring Patch

Posted by Robert Muir <rc...@gmail.com>.
I don't think its really a competition, I think preferably we should have
the flexibility to change the scoring model in lucene actually?

I have found lots of cases where VSM improves on BM25, but then again I
don't work with TREC stuff, as I work with non-english collections.

It doesn't contradict years of research to say that VSM isn't a
state-of-the-art model, besides the TREC-4 results, there are CLEF results
where VSM models perform competitively or exceed (Finnish, Russian, etc)
BM25/DFR/etc.

It depends on the collection, there isn't a 'best retrieval formula'.

Note: I have no bias against BM-25, but its definitely a myth to say there
is a single retrieval formula that is the 'best' across the board.


On Tue, Feb 16, 2010 at 1:53 PM, JOAQUIN PEREZ IGLESIAS <
joaquin.perez@lsi.uned.es> wrote:

> By the way,
>
> I don't want to start a flame war VSM vs BM25, but I really believe that I
> have to express my opinion as Robert has done. In my experience, I have
> never found a case where VSM improves significantly BM25. Maybe you can
> find some cases under some very specific collection characteristics, (as
> average length of 300 vs 3000) or a bad usage of BM25 (not proper
> parameters) where it can happen.
>
> BM25 is not just only a different way of length normalization, it is based
> strongly in the probabilistic framework, and parametrises frequencies and
> length. This is probably the most successful ranking model of the last
> years in Information Retrieval.
>
> I have never read a paper where VSM  improves any of the state-of-the-art
> ranking models (Language Models, DFR, BM25,...),  although the VSM with
> pivoted normalisation length can obtain nice results. This can be proved
> checking the last years of the TREC competition.
>
> Honestly to say that is a myth that BM25 improves VSM breaks the last 10
> or 15 years of research on Information Retrieval, and I really believe
> that is not accurate.
>
> The good thing of Information Retrieval is that you can always make your
> owns experiments and you can use the experience of a lot of years of
> research.
>
> PS: This opinion is based on experiments on TREC and CLEF collections,
> obviously we can start a debate about the suitability of this type of
> experimentation (concept of relevance, pooling, relevance judgements), but
> this is a much more complex topic and I believe is far from what we are
> dealing here.
>
> PS2: In relation with TREC4 Cornell used a pivoted length normalisation
> and they were applying pseudo-relevance feedback, what honestly makes much
> more difficult the analysis of the results. Obviously their results were
> part of the pool.
>
> Sorry for the huge mail :-))))
>
> > Hi Ivan,
> >
> > the problem is that unfortunately BM25
> > cannot be implemented overwriting
> > the Similarity interface. Therefore BM25Similarity
> > only computes the classic probabilistic IDF (what is
> > interesting only at search time).
> > If you set BM25Similarity at indexing time
> > some basic stats are not stored
> > correctly in the segments (like docs length).
> >
> > When you use BM25BooleanQuery this class
> > will set automatically the BM25Similarity for you,
> > therefore you don't need to do this explicitly.
> >
> > I tried to make this implementation with the focus on
> > not interfering on the typical use of Lucene (so no changing
> > DefaultSimilarity).
> >
> >> Joaquin, Robert,
> >>
> >> I followed Joaquin's recommendation and removed the call to set
> >> similarity
> >> to BM25 explicitly (indexer, searcher).  The results showed 55%
> >> improvement for the MAP score (0.141->0.219) over default similarity.
> >>
> >> Joaquin, how would setting the similarity to BM25 explicitly make the
> >> score worse?
> >>
> >> Thank you,
> >>
> >> Ivan
> >>
> >>
> >>
> >> --- On Tue, 2/16/10, Robert Muir <rc...@gmail.com> wrote:
> >>
> >>> From: Robert Muir <rc...@gmail.com>
> >>> Subject: Re: BM25 Scoring Patch
> >>> To: java-user@lucene.apache.org
> >>> Date: Tuesday, February 16, 2010, 11:36 AM
> >>> yes Ivan, if possible please report
> >>> back any findings you can on the
> >>> experiments you are doing!
> >>>
> >>> On Tue, Feb 16, 2010 at 11:22 AM, Joaquin Perez Iglesias
> >>> <
> >>> joaquin.perez@lsi.uned.es>
> >>> wrote:
> >>>
> >>> > Hi Ivan,
> >>> >
> >>> > You shouldn't set the BM25Similarity for indexing or
> >>> searching.
> >>> > Please try removing the lines:
> >>> >   writer.setSimilarity(new
> >>> BM25Similarity());
> >>> >   searcher.setSimilarity(sim);
> >>> >
> >>> > Please let us/me know if you improve your results with
> >>> these changes.
> >>> >
> >>> >
> >>> > Robert Muir escribió:
> >>> >
> >>> >  Hi Ivan, I've seen many cases where BM25
> >>> performs worse than Lucene's
> >>> >> default Similarity. Perhaps this is just another
> >>> one?
> >>> >>
> >>> >> Again while I have not worked with this particular
> >>> collection, I looked at
> >>> >> the statistics and noted that its composed of
> >>> several 'sub-collections':
> >>> >> for
> >>> >> example the PAT documents on disk 3 have an
> >>> average doc length of 3543,
> >>> >> but
> >>> >> the AP documents on disk 1 have an avg doc length
> >>> of 353.
> >>> >>
> >>> >> I have found on other collections that any
> >>> advantages of BM25's document
> >>> >> length normalization fall apart when 'average
> >>> document length' doesn't
> >>> >> make
> >>> >> a whole lot of sense (cases like this).
> >>> >>
> >>> >> For this same reason, I've only found a few
> >>> collections where BM25's doc
> >>> >> length normalization is really significantly
> >>> better than Lucene's.
> >>> >>
> >>> >> In my opinion, the results on a particular test
> >>> collection or 2 have
> >>> >> perhaps
> >>> >> been taken too far and created a myth that BM25 is
> >>> always superior to
> >>> >> Lucene's scoring... this is not true!
> >>> >>
> >>> >> On Tue, Feb 16, 2010 at 9:46 AM, Ivan Provalov
> >>> <ip...@yahoo.com>
> >>> >> wrote:
> >>> >>
> >>> >>  I applied the Lucene patch mentioned in
> >>> >>> https://issues.apache.org/jira/browse/LUCENE-2091 and
> >>> ran the MAP
> >>> >>> numbers
> >>> >>> on TREC-3 collection using topics
> >>> 151-200.  I am not getting worse
> >>> >>> results
> >>> >>> comparing to Lucene DefaultSimilarity.  I
> >>> suspect, I am not using it
> >>> >>> correctly.  I have single field
> >>> documents.  This is the process I use:
> >>> >>>
> >>> >>> 1. During the indexing, I am setting the
> >>> similarity to BM25 as such:
> >>> >>>
> >>> >>> IndexWriter writer = new IndexWriter(dir, new
> >>> StandardAnalyzer(
> >>> >>>
> >>>    Version.LUCENE_CURRENT), true,
> >>> >>>
> >>>    IndexWriter.MaxFieldLength.UNLIMITED);
> >>> >>> writer.setSimilarity(new BM25Similarity());
> >>> >>>
> >>> >>> 2. During the Precision/Recall measurements, I
> >>> am using a
> >>> >>> SimpleBM25QQParser extension I added to the
> >>> benchmark:
> >>> >>>
> >>> >>> QualityQueryParser qqParser = new
> >>> SimpleBM25QQParser("title", "TEXT");
> >>> >>>
> >>> >>>
> >>> >>> 3. Here is the parser code (I set an avg doc
> >>> length here):
> >>> >>>
> >>> >>> public Query parse(QualityQuery qq) throws
> >>> ParseException {
> >>> >>>   BM25Parameters.setAverageLength(indexField,
> >>> 798.30f);//avg doc length
> >>> >>>   BM25Parameters.setB(0.5f);//tried
> >>> default values
> >>> >>>   BM25Parameters.setK1(2f);
> >>> >>>   return query = new
> >>> BM25BooleanQuery(qq.getValue(qqName), indexField,
> >>> >>> new
> >>> >>> StandardAnalyzer(Version.LUCENE_CURRENT));
> >>> >>> }
> >>> >>>
> >>> >>> 4. The searcher is using BM25 similarity:
> >>> >>>
> >>> >>> Searcher searcher = new IndexSearcher(dir,
> >>> true);
> >>> >>> searcher.setSimilarity(sim);
> >>> >>>
> >>> >>> Am I missing some steps?  Does anyone
> >>> have experience with this code?
> >>> >>>
> >>> >>> Thanks,
> >>> >>>
> >>> >>> Ivan
> >>> >>>
> >>> >>>
> >>> >>>
> >>> >>>
> >>> >>>
> >>> ---------------------------------------------------------------------
> >>> >>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >>> >>> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>> >>>
> >>> >>>
> >>> >>>
> >>> >>
> >>> >>
> >>> > --
> >>> >
> >>> -----------------------------------------------------------
> >>> > Joaquín Pérez Iglesias
> >>> > Dpto. Lenguajes y Sistemas Informáticos
> >>> > E.T.S.I. Informática (UNED)
> >>> > Ciudad Universitaria
> >>> > C/ Juan del Rosal nº 16
> >>> > 28040 Madrid - Spain
> >>> > Phone. +34 91 398 89 19
> >>> > Fax    +34 91 398 65 35
> >>> > Office  2.11
> >>> > Email: joaquin.perez@lsi.uned.es
> >>> > web:   http://nlp.uned.es/~jperezi/ <http://nlp.uned.es/%7Ejperezi/><
> http://nlp.uned.es/%7Ejperezi/>
> >>> >
> >>> -----------------------------------------------------------
> >>> >
> >>> >
> >>> >
> >>> ---------------------------------------------------------------------
> >>> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >>> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >>> >
> >>> >
> >>>
> >>>
> >>> --
> >>> Robert Muir
> >>> rcmuir@gmail.com
> >>>
> >>
> >>
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>
> >>
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


-- 
Robert Muir
rcmuir@gmail.com

Re: BM25 Scoring Patch

Posted by JOAQUIN PEREZ IGLESIAS <jo...@lsi.uned.es>.
By the way,

I don't want to start a flame war VSM vs BM25, but I really believe that I
have to express my opinion as Robert has done. In my experience, I have
never found a case where VSM improves significantly BM25. Maybe you can
find some cases under some very specific collection characteristics, (as
average length of 300 vs 3000) or a bad usage of BM25 (not proper
parameters) where it can happen.

BM25 is not just only a different way of length normalization, it is based
strongly in the probabilistic framework, and parametrises frequencies and
length. This is probably the most successful ranking model of the last
years in Information Retrieval.

I have never read a paper where VSM  improves any of the state-of-the-art
ranking models (Language Models, DFR, BM25,...),  although the VSM with
pivoted normalisation length can obtain nice results. This can be proved
checking the last years of the TREC competition.

Honestly to say that is a myth that BM25 improves VSM breaks the last 10
or 15 years of research on Information Retrieval, and I really believe
that is not accurate.

The good thing of Information Retrieval is that you can always make your
owns experiments and you can use the experience of a lot of years of
research.

PS: This opinion is based on experiments on TREC and CLEF collections,
obviously we can start a debate about the suitability of this type of
experimentation (concept of relevance, pooling, relevance judgements), but
this is a much more complex topic and I believe is far from what we are
dealing here.

PS2: In relation with TREC4 Cornell used a pivoted length normalisation
and they were applying pseudo-relevance feedback, what honestly makes much
more difficult the analysis of the results. Obviously their results were
part of the pool.

Sorry for the huge mail :-))))

> Hi Ivan,
>
> the problem is that unfortunately BM25
> cannot be implemented overwriting
> the Similarity interface. Therefore BM25Similarity
> only computes the classic probabilistic IDF (what is
> interesting only at search time).
> If you set BM25Similarity at indexing time
> some basic stats are not stored
> correctly in the segments (like docs length).
>
> When you use BM25BooleanQuery this class
> will set automatically the BM25Similarity for you,
> therefore you don't need to do this explicitly.
>
> I tried to make this implementation with the focus on
> not interfering on the typical use of Lucene (so no changing
> DefaultSimilarity).
>
>> Joaquin, Robert,
>>
>> I followed Joaquin's recommendation and removed the call to set
>> similarity
>> to BM25 explicitly (indexer, searcher).  The results showed 55%
>> improvement for the MAP score (0.141->0.219) over default similarity.
>>
>> Joaquin, how would setting the similarity to BM25 explicitly make the
>> score worse?
>>
>> Thank you,
>>
>> Ivan
>>
>>
>>
>> --- On Tue, 2/16/10, Robert Muir <rc...@gmail.com> wrote:
>>
>>> From: Robert Muir <rc...@gmail.com>
>>> Subject: Re: BM25 Scoring Patch
>>> To: java-user@lucene.apache.org
>>> Date: Tuesday, February 16, 2010, 11:36 AM
>>> yes Ivan, if possible please report
>>> back any findings you can on the
>>> experiments you are doing!
>>>
>>> On Tue, Feb 16, 2010 at 11:22 AM, Joaquin Perez Iglesias
>>> <
>>> joaquin.perez@lsi.uned.es>
>>> wrote:
>>>
>>> > Hi Ivan,
>>> >
>>> > You shouldn't set the BM25Similarity for indexing or
>>> searching.
>>> > Please try removing the lines:
>>> >   writer.setSimilarity(new
>>> BM25Similarity());
>>> >   searcher.setSimilarity(sim);
>>> >
>>> > Please let us/me know if you improve your results with
>>> these changes.
>>> >
>>> >
>>> > Robert Muir escribió:
>>> >
>>> >  Hi Ivan, I've seen many cases where BM25
>>> performs worse than Lucene's
>>> >> default Similarity. Perhaps this is just another
>>> one?
>>> >>
>>> >> Again while I have not worked with this particular
>>> collection, I looked at
>>> >> the statistics and noted that its composed of
>>> several 'sub-collections':
>>> >> for
>>> >> example the PAT documents on disk 3 have an
>>> average doc length of 3543,
>>> >> but
>>> >> the AP documents on disk 1 have an avg doc length
>>> of 353.
>>> >>
>>> >> I have found on other collections that any
>>> advantages of BM25's document
>>> >> length normalization fall apart when 'average
>>> document length' doesn't
>>> >> make
>>> >> a whole lot of sense (cases like this).
>>> >>
>>> >> For this same reason, I've only found a few
>>> collections where BM25's doc
>>> >> length normalization is really significantly
>>> better than Lucene's.
>>> >>
>>> >> In my opinion, the results on a particular test
>>> collection or 2 have
>>> >> perhaps
>>> >> been taken too far and created a myth that BM25 is
>>> always superior to
>>> >> Lucene's scoring... this is not true!
>>> >>
>>> >> On Tue, Feb 16, 2010 at 9:46 AM, Ivan Provalov
>>> <ip...@yahoo.com>
>>> >> wrote:
>>> >>
>>> >>  I applied the Lucene patch mentioned in
>>> >>> https://issues.apache.org/jira/browse/LUCENE-2091 and
>>> ran the MAP
>>> >>> numbers
>>> >>> on TREC-3 collection using topics
>>> 151-200.  I am not getting worse
>>> >>> results
>>> >>> comparing to Lucene DefaultSimilarity.  I
>>> suspect, I am not using it
>>> >>> correctly.  I have single field
>>> documents.  This is the process I use:
>>> >>>
>>> >>> 1. During the indexing, I am setting the
>>> similarity to BM25 as such:
>>> >>>
>>> >>> IndexWriter writer = new IndexWriter(dir, new
>>> StandardAnalyzer(
>>> >>>
>>>    Version.LUCENE_CURRENT), true,
>>> >>>
>>>    IndexWriter.MaxFieldLength.UNLIMITED);
>>> >>> writer.setSimilarity(new BM25Similarity());
>>> >>>
>>> >>> 2. During the Precision/Recall measurements, I
>>> am using a
>>> >>> SimpleBM25QQParser extension I added to the
>>> benchmark:
>>> >>>
>>> >>> QualityQueryParser qqParser = new
>>> SimpleBM25QQParser("title", "TEXT");
>>> >>>
>>> >>>
>>> >>> 3. Here is the parser code (I set an avg doc
>>> length here):
>>> >>>
>>> >>> public Query parse(QualityQuery qq) throws
>>> ParseException {
>>> >>>   BM25Parameters.setAverageLength(indexField,
>>> 798.30f);//avg doc length
>>> >>>   BM25Parameters.setB(0.5f);//tried
>>> default values
>>> >>>   BM25Parameters.setK1(2f);
>>> >>>   return query = new
>>> BM25BooleanQuery(qq.getValue(qqName), indexField,
>>> >>> new
>>> >>> StandardAnalyzer(Version.LUCENE_CURRENT));
>>> >>> }
>>> >>>
>>> >>> 4. The searcher is using BM25 similarity:
>>> >>>
>>> >>> Searcher searcher = new IndexSearcher(dir,
>>> true);
>>> >>> searcher.setSimilarity(sim);
>>> >>>
>>> >>> Am I missing some steps?  Does anyone
>>> have experience with this code?
>>> >>>
>>> >>> Thanks,
>>> >>>
>>> >>> Ivan
>>> >>>
>>> >>>
>>> >>>
>>> >>>
>>> >>>
>>> ---------------------------------------------------------------------
>>> >>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> >>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>> >>>
>>> >>>
>>> >>>
>>> >>
>>> >>
>>> > --
>>> >
>>> -----------------------------------------------------------
>>> > Joaquín Pérez Iglesias
>>> > Dpto. Lenguajes y Sistemas Informáticos
>>> > E.T.S.I. Informática (UNED)
>>> > Ciudad Universitaria
>>> > C/ Juan del Rosal nº 16
>>> > 28040 Madrid - Spain
>>> > Phone. +34 91 398 89 19
>>> > Fax    +34 91 398 65 35
>>> > Office  2.11
>>> > Email: joaquin.perez@lsi.uned.es
>>> > web:   http://nlp.uned.es/~jperezi/ <http://nlp.uned.es/%7Ejperezi/>
>>> >
>>> -----------------------------------------------------------
>>> >
>>> >
>>> >
>>> ---------------------------------------------------------------------
>>> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> > For additional commands, e-mail: java-user-help@lucene.apache.org
>>> >
>>> >
>>>
>>>
>>> --
>>> Robert Muir
>>> rcmuir@gmail.com
>>>
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: BM25 Scoring Patch

Posted by JOAQUIN PEREZ IGLESIAS <jo...@lsi.uned.es>.
Hi Ivan,

the problem is that unfortunately BM25
cannot be implemented overwriting
the Similarity interface. Therefore BM25Similarity
only computes the classic probabilistic IDF (what is
interesting only at search time).
If you set BM25Similarity at indexing time
some basic stats are not stored
correctly in the segments (like docs length).

When you use BM25BooleanQuery this class
will set automatically the BM25Similarity for you,
therefore you don't need to do this explicitly.

I tried to make this implementation with the focus on
not interfering on the typical use of Lucene (so no changing
DefaultSimilarity).

> Joaquin, Robert,
>
> I followed Joaquin's recommendation and removed the call to set similarity
> to BM25 explicitly (indexer, searcher).  The results showed 55%
> improvement for the MAP score (0.141->0.219) over default similarity.
>
> Joaquin, how would setting the similarity to BM25 explicitly make the
> score worse?
>
> Thank you,
>
> Ivan
>
>
>
> --- On Tue, 2/16/10, Robert Muir <rc...@gmail.com> wrote:
>
>> From: Robert Muir <rc...@gmail.com>
>> Subject: Re: BM25 Scoring Patch
>> To: java-user@lucene.apache.org
>> Date: Tuesday, February 16, 2010, 11:36 AM
>> yes Ivan, if possible please report
>> back any findings you can on the
>> experiments you are doing!
>>
>> On Tue, Feb 16, 2010 at 11:22 AM, Joaquin Perez Iglesias
>> <
>> joaquin.perez@lsi.uned.es>
>> wrote:
>>
>> > Hi Ivan,
>> >
>> > You shouldn't set the BM25Similarity for indexing or
>> searching.
>> > Please try removing the lines:
>> >   writer.setSimilarity(new
>> BM25Similarity());
>> >   searcher.setSimilarity(sim);
>> >
>> > Please let us/me know if you improve your results with
>> these changes.
>> >
>> >
>> > Robert Muir escribió:
>> >
>> >  Hi Ivan, I've seen many cases where BM25
>> performs worse than Lucene's
>> >> default Similarity. Perhaps this is just another
>> one?
>> >>
>> >> Again while I have not worked with this particular
>> collection, I looked at
>> >> the statistics and noted that its composed of
>> several 'sub-collections':
>> >> for
>> >> example the PAT documents on disk 3 have an
>> average doc length of 3543,
>> >> but
>> >> the AP documents on disk 1 have an avg doc length
>> of 353.
>> >>
>> >> I have found on other collections that any
>> advantages of BM25's document
>> >> length normalization fall apart when 'average
>> document length' doesn't
>> >> make
>> >> a whole lot of sense (cases like this).
>> >>
>> >> For this same reason, I've only found a few
>> collections where BM25's doc
>> >> length normalization is really significantly
>> better than Lucene's.
>> >>
>> >> In my opinion, the results on a particular test
>> collection or 2 have
>> >> perhaps
>> >> been taken too far and created a myth that BM25 is
>> always superior to
>> >> Lucene's scoring... this is not true!
>> >>
>> >> On Tue, Feb 16, 2010 at 9:46 AM, Ivan Provalov
>> <ip...@yahoo.com>
>> >> wrote:
>> >>
>> >>  I applied the Lucene patch mentioned in
>> >>> https://issues.apache.org/jira/browse/LUCENE-2091 and
>> ran the MAP
>> >>> numbers
>> >>> on TREC-3 collection using topics
>> 151-200.  I am not getting worse
>> >>> results
>> >>> comparing to Lucene DefaultSimilarity.  I
>> suspect, I am not using it
>> >>> correctly.  I have single field
>> documents.  This is the process I use:
>> >>>
>> >>> 1. During the indexing, I am setting the
>> similarity to BM25 as such:
>> >>>
>> >>> IndexWriter writer = new IndexWriter(dir, new
>> StandardAnalyzer(
>> >>>
>>    Version.LUCENE_CURRENT), true,
>> >>>
>>    IndexWriter.MaxFieldLength.UNLIMITED);
>> >>> writer.setSimilarity(new BM25Similarity());
>> >>>
>> >>> 2. During the Precision/Recall measurements, I
>> am using a
>> >>> SimpleBM25QQParser extension I added to the
>> benchmark:
>> >>>
>> >>> QualityQueryParser qqParser = new
>> SimpleBM25QQParser("title", "TEXT");
>> >>>
>> >>>
>> >>> 3. Here is the parser code (I set an avg doc
>> length here):
>> >>>
>> >>> public Query parse(QualityQuery qq) throws
>> ParseException {
>> >>>   BM25Parameters.setAverageLength(indexField,
>> 798.30f);//avg doc length
>> >>>   BM25Parameters.setB(0.5f);//tried
>> default values
>> >>>   BM25Parameters.setK1(2f);
>> >>>   return query = new
>> BM25BooleanQuery(qq.getValue(qqName), indexField,
>> >>> new
>> >>> StandardAnalyzer(Version.LUCENE_CURRENT));
>> >>> }
>> >>>
>> >>> 4. The searcher is using BM25 similarity:
>> >>>
>> >>> Searcher searcher = new IndexSearcher(dir,
>> true);
>> >>> searcher.setSimilarity(sim);
>> >>>
>> >>> Am I missing some steps?  Does anyone
>> have experience with this code?
>> >>>
>> >>> Thanks,
>> >>>
>> >>> Ivan
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> ---------------------------------------------------------------------
>> >>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> >>> For additional commands, e-mail: java-user-help@lucene.apache.org
>> >>>
>> >>>
>> >>>
>> >>
>> >>
>> > --
>> >
>> -----------------------------------------------------------
>> > Joaquín Pérez Iglesias
>> > Dpto. Lenguajes y Sistemas Informáticos
>> > E.T.S.I. Informática (UNED)
>> > Ciudad Universitaria
>> > C/ Juan del Rosal nº 16
>> > 28040 Madrid - Spain
>> > Phone. +34 91 398 89 19
>> > Fax    +34 91 398 65 35
>> > Office  2.11
>> > Email: joaquin.perez@lsi.uned.es
>> > web:   http://nlp.uned.es/~jperezi/ <http://nlp.uned.es/%7Ejperezi/>
>> >
>> -----------------------------------------------------------
>> >
>> >
>> >
>> ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> > For additional commands, e-mail: java-user-help@lucene.apache.org
>> >
>> >
>>
>>
>> --
>> Robert Muir
>> rcmuir@gmail.com
>>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: BM25 Scoring Patch

Posted by Ivan Provalov <ip...@yahoo.com>.
Joaquin, Robert,

I followed Joaquin's recommendation and removed the call to set similarity to BM25 explicitly (indexer, searcher).  The results showed 55% improvement for the MAP score (0.141->0.219) over default similarity.  

Joaquin, how would setting the similarity to BM25 explicitly make the score worse?

Thank you,

Ivan



--- On Tue, 2/16/10, Robert Muir <rc...@gmail.com> wrote:

> From: Robert Muir <rc...@gmail.com>
> Subject: Re: BM25 Scoring Patch
> To: java-user@lucene.apache.org
> Date: Tuesday, February 16, 2010, 11:36 AM
> yes Ivan, if possible please report
> back any findings you can on the
> experiments you are doing!
> 
> On Tue, Feb 16, 2010 at 11:22 AM, Joaquin Perez Iglesias
> <
> joaquin.perez@lsi.uned.es>
> wrote:
> 
> > Hi Ivan,
> >
> > You shouldn't set the BM25Similarity for indexing or
> searching.
> > Please try removing the lines:
> >   writer.setSimilarity(new
> BM25Similarity());
> >   searcher.setSimilarity(sim);
> >
> > Please let us/me know if you improve your results with
> these changes.
> >
> >
> > Robert Muir escribió:
> >
> >  Hi Ivan, I've seen many cases where BM25
> performs worse than Lucene's
> >> default Similarity. Perhaps this is just another
> one?
> >>
> >> Again while I have not worked with this particular
> collection, I looked at
> >> the statistics and noted that its composed of
> several 'sub-collections':
> >> for
> >> example the PAT documents on disk 3 have an
> average doc length of 3543,
> >> but
> >> the AP documents on disk 1 have an avg doc length
> of 353.
> >>
> >> I have found on other collections that any
> advantages of BM25's document
> >> length normalization fall apart when 'average
> document length' doesn't
> >> make
> >> a whole lot of sense (cases like this).
> >>
> >> For this same reason, I've only found a few
> collections where BM25's doc
> >> length normalization is really significantly
> better than Lucene's.
> >>
> >> In my opinion, the results on a particular test
> collection or 2 have
> >> perhaps
> >> been taken too far and created a myth that BM25 is
> always superior to
> >> Lucene's scoring... this is not true!
> >>
> >> On Tue, Feb 16, 2010 at 9:46 AM, Ivan Provalov
> <ip...@yahoo.com>
> >> wrote:
> >>
> >>  I applied the Lucene patch mentioned in
> >>> https://issues.apache.org/jira/browse/LUCENE-2091 and
> ran the MAP
> >>> numbers
> >>> on TREC-3 collection using topics
> 151-200.  I am not getting worse
> >>> results
> >>> comparing to Lucene DefaultSimilarity.  I
> suspect, I am not using it
> >>> correctly.  I have single field
> documents.  This is the process I use:
> >>>
> >>> 1. During the indexing, I am setting the
> similarity to BM25 as such:
> >>>
> >>> IndexWriter writer = new IndexWriter(dir, new
> StandardAnalyzer(
> >>>           
>    Version.LUCENE_CURRENT), true,
> >>>           
>    IndexWriter.MaxFieldLength.UNLIMITED);
> >>> writer.setSimilarity(new BM25Similarity());
> >>>
> >>> 2. During the Precision/Recall measurements, I
> am using a
> >>> SimpleBM25QQParser extension I added to the
> benchmark:
> >>>
> >>> QualityQueryParser qqParser = new
> SimpleBM25QQParser("title", "TEXT");
> >>>
> >>>
> >>> 3. Here is the parser code (I set an avg doc
> length here):
> >>>
> >>> public Query parse(QualityQuery qq) throws
> ParseException {
> >>>   BM25Parameters.setAverageLength(indexField,
> 798.30f);//avg doc length
> >>>   BM25Parameters.setB(0.5f);//tried
> default values
> >>>   BM25Parameters.setK1(2f);
> >>>   return query = new
> BM25BooleanQuery(qq.getValue(qqName), indexField,
> >>> new
> >>> StandardAnalyzer(Version.LUCENE_CURRENT));
> >>> }
> >>>
> >>> 4. The searcher is using BM25 similarity:
> >>>
> >>> Searcher searcher = new IndexSearcher(dir,
> true);
> >>> searcher.setSimilarity(sim);
> >>>
> >>> Am I missing some steps?  Does anyone
> have experience with this code?
> >>>
> >>> Thanks,
> >>>
> >>> Ivan
> >>>
> >>>
> >>>
> >>>
> >>>
> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >>> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>>
> >>>
> >>>
> >>
> >>
> > --
> >
> -----------------------------------------------------------
> > Joaquín Pérez Iglesias
> > Dpto. Lenguajes y Sistemas Informáticos
> > E.T.S.I. Informática (UNED)
> > Ciudad Universitaria
> > C/ Juan del Rosal nº 16
> > 28040 Madrid - Spain
> > Phone. +34 91 398 89 19
> > Fax    +34 91 398 65 35
> > Office  2.11
> > Email: joaquin.perez@lsi.uned.es
> > web:   http://nlp.uned.es/~jperezi/ <http://nlp.uned.es/%7Ejperezi/>
> >
> -----------------------------------------------------------
> >
> >
> >
> ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
> 
> 
> -- 
> Robert Muir
> rcmuir@gmail.com
> 


      

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: BM25 Scoring Patch

Posted by Robert Muir <rc...@gmail.com>.
yes Ivan, if possible please report back any findings you can on the
experiments you are doing!

On Tue, Feb 16, 2010 at 11:22 AM, Joaquin Perez Iglesias <
joaquin.perez@lsi.uned.es> wrote:

> Hi Ivan,
>
> You shouldn't set the BM25Similarity for indexing or searching.
> Please try removing the lines:
>   writer.setSimilarity(new BM25Similarity());
>   searcher.setSimilarity(sim);
>
> Please let us/me know if you improve your results with these changes.
>
>
> Robert Muir escribió:
>
>  Hi Ivan, I've seen many cases where BM25 performs worse than Lucene's
>> default Similarity. Perhaps this is just another one?
>>
>> Again while I have not worked with this particular collection, I looked at
>> the statistics and noted that its composed of several 'sub-collections':
>> for
>> example the PAT documents on disk 3 have an average doc length of 3543,
>> but
>> the AP documents on disk 1 have an avg doc length of 353.
>>
>> I have found on other collections that any advantages of BM25's document
>> length normalization fall apart when 'average document length' doesn't
>> make
>> a whole lot of sense (cases like this).
>>
>> For this same reason, I've only found a few collections where BM25's doc
>> length normalization is really significantly better than Lucene's.
>>
>> In my opinion, the results on a particular test collection or 2 have
>> perhaps
>> been taken too far and created a myth that BM25 is always superior to
>> Lucene's scoring... this is not true!
>>
>> On Tue, Feb 16, 2010 at 9:46 AM, Ivan Provalov <ip...@yahoo.com>
>> wrote:
>>
>>  I applied the Lucene patch mentioned in
>>> https://issues.apache.org/jira/browse/LUCENE-2091 and ran the MAP
>>> numbers
>>> on TREC-3 collection using topics 151-200.  I am not getting worse
>>> results
>>> comparing to Lucene DefaultSimilarity.  I suspect, I am not using it
>>> correctly.  I have single field documents.  This is the process I use:
>>>
>>> 1. During the indexing, I am setting the similarity to BM25 as such:
>>>
>>> IndexWriter writer = new IndexWriter(dir, new StandardAnalyzer(
>>>               Version.LUCENE_CURRENT), true,
>>>               IndexWriter.MaxFieldLength.UNLIMITED);
>>> writer.setSimilarity(new BM25Similarity());
>>>
>>> 2. During the Precision/Recall measurements, I am using a
>>> SimpleBM25QQParser extension I added to the benchmark:
>>>
>>> QualityQueryParser qqParser = new SimpleBM25QQParser("title", "TEXT");
>>>
>>>
>>> 3. Here is the parser code (I set an avg doc length here):
>>>
>>> public Query parse(QualityQuery qq) throws ParseException {
>>>   BM25Parameters.setAverageLength(indexField, 798.30f);//avg doc length
>>>   BM25Parameters.setB(0.5f);//tried default values
>>>   BM25Parameters.setK1(2f);
>>>   return query = new BM25BooleanQuery(qq.getValue(qqName), indexField,
>>> new
>>> StandardAnalyzer(Version.LUCENE_CURRENT));
>>> }
>>>
>>> 4. The searcher is using BM25 similarity:
>>>
>>> Searcher searcher = new IndexSearcher(dir, true);
>>> searcher.setSimilarity(sim);
>>>
>>> Am I missing some steps?  Does anyone have experience with this code?
>>>
>>> Thanks,
>>>
>>> Ivan
>>>
>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>>
>>
>>
> --
> -----------------------------------------------------------
> Joaquín Pérez Iglesias
> Dpto. Lenguajes y Sistemas Informáticos
> E.T.S.I. Informática (UNED)
> Ciudad Universitaria
> C/ Juan del Rosal nº 16
> 28040 Madrid - Spain
> Phone. +34 91 398 89 19
> Fax    +34 91 398 65 35
> Office  2.11
> Email: joaquin.perez@lsi.uned.es
> web:   http://nlp.uned.es/~jperezi/ <http://nlp.uned.es/%7Ejperezi/>
> -----------------------------------------------------------
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


-- 
Robert Muir
rcmuir@gmail.com

Re: BM25 Scoring Patch

Posted by Joaquin Perez Iglesias <jo...@lsi.uned.es>.
Hi Ivan,

You shouldn't set the BM25Similarity for indexing or searching.
Please try removing the lines:
    writer.setSimilarity(new BM25Similarity());
    searcher.setSimilarity(sim);

Please let us/me know if you improve your results with these changes.


Robert Muir escribió:
> Hi Ivan, I've seen many cases where BM25 performs worse than Lucene's
> default Similarity. Perhaps this is just another one?
> 
> Again while I have not worked with this particular collection, I looked at
> the statistics and noted that its composed of several 'sub-collections': for
> example the PAT documents on disk 3 have an average doc length of 3543, but
> the AP documents on disk 1 have an avg doc length of 353.
> 
> I have found on other collections that any advantages of BM25's document
> length normalization fall apart when 'average document length' doesn't make
> a whole lot of sense (cases like this).
> 
> For this same reason, I've only found a few collections where BM25's doc
> length normalization is really significantly better than Lucene's.
> 
> In my opinion, the results on a particular test collection or 2 have perhaps
> been taken too far and created a myth that BM25 is always superior to
> Lucene's scoring... this is not true!
> 
> On Tue, Feb 16, 2010 at 9:46 AM, Ivan Provalov <ip...@yahoo.com> wrote:
> 
>> I applied the Lucene patch mentioned in
>> https://issues.apache.org/jira/browse/LUCENE-2091 and ran the MAP numbers
>> on TREC-3 collection using topics 151-200.  I am not getting worse results
>> comparing to Lucene DefaultSimilarity.  I suspect, I am not using it
>> correctly.  I have single field documents.  This is the process I use:
>>
>> 1. During the indexing, I am setting the similarity to BM25 as such:
>>
>> IndexWriter writer = new IndexWriter(dir, new StandardAnalyzer(
>>                Version.LUCENE_CURRENT), true,
>>                IndexWriter.MaxFieldLength.UNLIMITED);
>> writer.setSimilarity(new BM25Similarity());
>>
>> 2. During the Precision/Recall measurements, I am using a
>> SimpleBM25QQParser extension I added to the benchmark:
>>
>> QualityQueryParser qqParser = new SimpleBM25QQParser("title", "TEXT");
>>
>>
>> 3. Here is the parser code (I set an avg doc length here):
>>
>> public Query parse(QualityQuery qq) throws ParseException {
>>    BM25Parameters.setAverageLength(indexField, 798.30f);//avg doc length
>>    BM25Parameters.setB(0.5f);//tried default values
>>    BM25Parameters.setK1(2f);
>>    return query = new BM25BooleanQuery(qq.getValue(qqName), indexField, new
>> StandardAnalyzer(Version.LUCENE_CURRENT));
>> }
>>
>> 4. The searcher is using BM25 similarity:
>>
>> Searcher searcher = new IndexSearcher(dir, true);
>> searcher.setSimilarity(sim);
>>
>> Am I missing some steps?  Does anyone have experience with this code?
>>
>> Thanks,
>>
>> Ivan
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
> 
> 

-- 
-----------------------------------------------------------
Joaquín Pérez Iglesias
Dpto. Lenguajes y Sistemas Informáticos
E.T.S.I. Informática (UNED)
Ciudad Universitaria
C/ Juan del Rosal nº 16
28040 Madrid - Spain
Phone. +34 91 398 89 19
Fax    +34 91 398 65 35
Office  2.11
Email: joaquin.perez@lsi.uned.es
web:   http://nlp.uned.es/~jperezi/
-----------------------------------------------------------

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: BM25 Scoring Patch

Posted by Robert Muir <rc...@gmail.com>.
Hi Ivan, I've seen many cases where BM25 performs worse than Lucene's
default Similarity. Perhaps this is just another one?

Again while I have not worked with this particular collection, I looked at
the statistics and noted that its composed of several 'sub-collections': for
example the PAT documents on disk 3 have an average doc length of 3543, but
the AP documents on disk 1 have an avg doc length of 353.

I have found on other collections that any advantages of BM25's document
length normalization fall apart when 'average document length' doesn't make
a whole lot of sense (cases like this).

For this same reason, I've only found a few collections where BM25's doc
length normalization is really significantly better than Lucene's.

In my opinion, the results on a particular test collection or 2 have perhaps
been taken too far and created a myth that BM25 is always superior to
Lucene's scoring... this is not true!

On Tue, Feb 16, 2010 at 9:46 AM, Ivan Provalov <ip...@yahoo.com> wrote:

> I applied the Lucene patch mentioned in
> https://issues.apache.org/jira/browse/LUCENE-2091 and ran the MAP numbers
> on TREC-3 collection using topics 151-200.  I am not getting worse results
> comparing to Lucene DefaultSimilarity.  I suspect, I am not using it
> correctly.  I have single field documents.  This is the process I use:
>
> 1. During the indexing, I am setting the similarity to BM25 as such:
>
> IndexWriter writer = new IndexWriter(dir, new StandardAnalyzer(
>                Version.LUCENE_CURRENT), true,
>                IndexWriter.MaxFieldLength.UNLIMITED);
> writer.setSimilarity(new BM25Similarity());
>
> 2. During the Precision/Recall measurements, I am using a
> SimpleBM25QQParser extension I added to the benchmark:
>
> QualityQueryParser qqParser = new SimpleBM25QQParser("title", "TEXT");
>
>
> 3. Here is the parser code (I set an avg doc length here):
>
> public Query parse(QualityQuery qq) throws ParseException {
>    BM25Parameters.setAverageLength(indexField, 798.30f);//avg doc length
>    BM25Parameters.setB(0.5f);//tried default values
>    BM25Parameters.setK1(2f);
>    return query = new BM25BooleanQuery(qq.getValue(qqName), indexField, new
> StandardAnalyzer(Version.LUCENE_CURRENT));
> }
>
> 4. The searcher is using BM25 similarity:
>
> Searcher searcher = new IndexSearcher(dir, true);
> searcher.setSimilarity(sim);
>
> Am I missing some steps?  Does anyone have experience with this code?
>
> Thanks,
>
> Ivan
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


-- 
Robert Muir
rcmuir@gmail.com