You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Koji Sekiguchi <ko...@r.email.ne.jp> on 2007/04/11 19:07:34 UTC

strange idf in Lucene 2.1

Hello,

I have the following three documents in my index:

- Java programming is required to write Lucene application.
- Java is a popular computer language. I like Java.
- Perl is not a kind of jewelry. It is a programming language.

With Lucene 2.0, if I search "java" and print explanation, the output is:

1 0.53033006 Java is a popular computer language. I like Java.
0.53033006 = fieldWeight(text:java in 1), product of:
1.4142135 = tf(termFreq(text:java)=2)
1.0 = idf(docFreq=2)
0.375 = fieldNorm(field=text, doc=1)

0 0.375 Java programming is required to write Lucene application.
0.375 = fieldWeight(text:java in 0), product of:
1.0 = tf(termFreq(text:java)=1)
1.0 = idf(docFreq=2)
0.375 = fieldNorm(field=text, doc=0)

But when I use Lucene 2.1, the output is:

4 0.62702066 Java is a popular computer language. I like Java.
0.62702066 = (MATCH) fieldWeight(text:java in 4), product of:
1.4142135 = tf(termFreq(text:java)=2)
1.1823215 = idf(docFreq=4)
0.375 = fieldNorm(field=text, doc=4)

3 0.44337058 Java programming is required to write Lucene application.
0.44337058 = (MATCH) fieldWeight(text:java in 3), product of:
1.0 = tf(termFreq(text:java)=1)
1.1823215 = idf(docFreq=4)
0.375 = fieldNorm(field=text, doc=3)

I don't understand why the idf is not 1.0 (and docFreq is not 2)
when I use Lucene 2.1.

The program is attached at the bottom of this mail.
In the program, I added these three documents to the index,
then deleted all of them, and then added them to the index on purpose.
If I optimize the index, idf gets into 1.0 with Lucene 2.1 (uncomment in
the program).
Is it a feature?

Thank you,

Koji

---

public class Test1 {

private static String[] contents = {
"Java programming is required to write Lucene application.",
"Java is a popular computer language. I like Java.",
"Perl is not a kind of jewelry. It is a programming language."
};
private static String F = "text";
private static String QUERY = "java";
private static Analyzer analyzer = new StandardAnalyzer();
private static Directory dir = new RAMDirectory();

public static void main(String[] args) throws IOException {
makeIndex( true );
deleteAll();
makeIndex( false );
searchIndex();
}

private static void makeIndex( boolean create ) throws IOException{
IndexWriter writer = new IndexWriter( dir, analyzer, create );
for( String content : contents ){
Document doc = new Document();
doc.add( new Field( F, content, Store.YES, Index.TOKENIZED ) );
writer.addDocument( doc );
}
//writer.optimize();
writer.close();
}

private static void deleteAll() throws IOException{
IndexReader reader = IndexReader.open( dir );
int max = reader.maxDoc();
for( int i = 0; i < max; i++ )
reader.deleteDocument( i );
reader.close();
}

private static void searchIndex() throws IOException{
IndexSearcher searcher = new IndexSearcher( dir );
Query query = new TermQuery( new Term( F, QUERY ) );
Hits hits = searcher.search( query );
for( int i = 0; i < hits.length(); i++ ){
int id = hits.id( i );
float score = hits.score( i );
Document doc = hits.doc( i );
System.out.println( id + "\t" + score + "\t" + doc.get( F ) );
Explanation exp = searcher.explain( query, id );
System.out.println( exp.toString() );
}
searcher.close();
}
}


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: strange idf in Lucene 2.1

Posted by Bill Janssen <ja...@parc.com>.

> docfreqs (idfs) do not take into account deleted docs.
> This is more of an engineering tradeoff rather than a feature.
> If we could cheaply and easily update idfs when documents are deleted
> from an index, we would.

Wow.  So is it fair to say that the stored IDF is really the
cumulative IDF for all the documents that have ever been in the index
since it was last optimized?

Bill

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: strange idf in Lucene 2.1

Posted by Yonik Seeley <yo...@apache.org>.

On 4/12/07, Koji Sekiguchi <ko...@r.email.ne.jp> wrote:
>  > Is the index completely removed between the 2.0 and 2.1 runs?
>
> Sure. If you see my program, you'll find I'm using RAMDirectory.

OK, I think it's due to the change in merge policy.
Lucene 2.0 could under-merge (not enough) or over-merge (before necessary).
I think you are hitting a case of over-merging in Lucene 2.0, where
the second time you write documents, the two segments are merged into
one, squeezing out the deleted docs.

In Lucene 2.1, you end up with two segments (one with the deleted docs).

-Yonik

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: strange idf in Lucene 2.1

Posted by Koji Sekiguchi <ko...@r.email.ne.jp>.

 > Is the index completely removed between the 2.0 and 2.1 runs?

Sure. If you see my program, you'll find I'm using RAMDirectory.

regards,

Koji



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: strange idf in Lucene 2.1

Posted by Yonik Seeley <yo...@apache.org>.

On 4/12/07, Koji Sekiguchi <ko...@r.email.ne.jp> wrote:
> Chris,
> > i'm not understanding this part of the thread ... are you saying that if
> > you have two identical setups, the only difference being that one uses 2.0
> > and the other uses 2.1, then you see different idfs after
> > adding/deleting/re-adding many docs?
>
>
> Exactly. Please try to run the program which was attached in my first mail:

Is the index completely removed between the 2.0 and 2.1 runs?

-Yonik

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: strange idf in Lucene 2.1

Posted by Koji Sekiguchi <ko...@r.email.ne.jp>.

Chris,

> i'm not understanding this part of the thread ... are you saying that if
> you have two identical setups, the only difference being that one uses 2.0
> and the other uses 2.1, then you see different idfs after
> adding/deleting/re-adding many docs?


Exactly. Please try to run the program which was attached in my first mail:

http://www.gossamer-threads.com/lists/lucene/java-user/48086

regards,

Koji


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: strange idf in Lucene 2.1

Posted by Chris Hostetter <ho...@fucit.org>.

:  > This should be the same for Lucene 2.0 and 2.1.
:
: I understand. But I think we could well come accross this issue
: with Lucene 2.1 than 2.0?

i'm not understanding this part of the thread ... are you saying that if
you have two identical setups, the only difference being that one uses 2.0
and the other uses 2.1, then you see different idfs after
adding/deleting/re-adding many docs?

if so, i'm at a loss to explain that.



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: strange idf in Lucene 2.1

Posted by Koji Sekiguchi <ko...@r.email.ne.jp>.

Yonik,

Thank you for your explanation.
In passing, I realized this issue by my customer. They are using Solr.
To reproduce the issue with Solr, post exampledocs/*.xml twice
and issue a query with q=ipod&debugQuery=on.

 > This should be the same for Lucene 2.0 and 2.1.

I understand. But I think we could well come accross this issue
with Lucene 2.1 than 2.0?

Thank you again,

Koji



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: strange idf in Lucene 2.1

Posted by Yonik Seeley <yo...@apache.org>.

On 4/12/07, Chris Hostetter <ho...@fucit.org> wrote:
>
> : But not which terms have an odd IDF value because of those deleted
> : documents.  How much does the IDF value contribute to the "score" in
> : search?
>
> all idf's are affected equally, because the 'numDocs" value used is
> allways the same

There is that part of it, but it's smaller compared to potential docfreq skew.

If I had a document with "ABC" in it, and it only appeared in that
document, the docfreq is 1.  If I delete that doc, and re-add it, the
docfreq may all of a sudden appear to be 2.

-Yonik

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: strange idf in Lucene 2.1

Posted by Chris Hostetter <ho...@fucit.org>.

: But if now the index goes through a massive update, where almost all the
: docs containing TC are deleted, and TC is not in any newly added doc,
: practically TC becomes rare too, and hence D2 should probably be scored
: higher than D1. But IDF(TC) might not (yet) reflect the massive docs
: deletion, and the scores are wrongly biased so D1 is still scored higher
: than D2.

yeah ... i was only thinking about the numDocs change (which would be the
same for idf(TC) and idf(TR)) and forgot that docFreq is ignorant of
deletes as well.

: I didn't follow the code for that, just thinking IDFs and scoring aloud, so
: I hope I am not missing something, but in any case this is just for the
: sake of discussion, because in reality you don't expect index statistics to
: change that dramatically, ahead of merges.

that's really the key issue ot remember ... you might notice this when
deleting/re-adding 90% of the docs in an index consisting of only 10 docs,
because you'll likely still only have one segment -- but if you do the
same thing in an index of 100,000 docs you're going to get some segment
merges which will help keep things balanced.


-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: strange idf in Lucene 2.1

Posted by Doron Cohen <DO...@il.ibm.com>.

Chris Hostetter <ho...@fucit.org> wrote on 12/04/2007 15:22:20:

>
> : But not which terms have an odd IDF value because of those deleted
> : documents.  How much does the IDF value contribute to the "score" in
> : search?
>
> all idf's are affected equally, because the 'numDocs" value used is
> allways the same ... it really shouldn't affect the scores from a query,
> it just makes it hard to compare the scores you get from one index reader
> with the scores you get from a new index reader after deleting and
> readding a bunch of documents.

Not sure about the extreme case - assume two words, one common and one
rare, making up an OR query:
   TC  (a very common term)
   TR  (a very rare term)
   IDF(TC) << IDF(TR)
   Q = TC TR
   D1 = document with one occurrence of TR
   D2 = document with three occurrences of TC.
==> D1 is scored higher than D2

But if now the index goes through a massive update, where almost all the
docs containing TC are deleted, and TC is not in any newly added doc,
practically TC becomes rare too, and hence D2 should probably be scored
higher than D1. But IDF(TC) might not (yet) reflect the massive docs
deletion, and the scores are wrongly biased so D1 is still scored higher
than D2.

I didn't follow the code for that, just thinking IDFs and scoring aloud, so
I hope I am not missing something, but in any case this is just for the
sake of discussion, because in reality you don't expect index statistics to
change that dramatically, ahead of merges.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: strange idf in Lucene 2.1

Posted by Chris Hostetter <ho...@fucit.org>.

: But not which terms have an odd IDF value because of those deleted
: documents.  How much does the IDF value contribute to the "score" in
: search?

all idf's are affected equally, because the 'numDocs" value used is
allways the same ... it really shouldn't affect the scores from a query,
it just makes it hard to compare the scores you get from one index reader
with the scores you get from a new index reader after deleting and
readding a bunch of documents.




-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: strange idf in Lucene 2.1

Posted by Bill Janssen <ja...@parc.com>.

> The difference between IndexReader.maxDoc() and numDocs() tells you
> how many documents have been marked for deletion but still take up
> space in the index.

But not which terms have an odd IDF value because of those deleted
documents.  How much does the IDF value contribute to the "score" in
search?

Bill

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: strange idf in Lucene 2.1

Posted by Yonik Seeley <yo...@apache.org>.

On 4/12/07, Bill Janssen <ja...@parc.com> wrote:
> > docfreqs (idfs) do not take into account deleted docs.
> > This is more of an engineering tradeoff rather than a feature.
> > If we could cheaply and easily update idfs when documents are deleted
> > from an index, we would.
>
> Wow.  So is it fair to say that the stored IDF is really the
> cumulative IDF for all the documents that have ever been in the index
> since it was last optimized?

Not quite... all documents that are marked as deleted, but haven't
actually been removed from the index.  Adding new documents sometimes
causes segments to me merged, and the resulting new segment will have
no deleted docs.

The difference between IndexReader.maxDoc() and numDocs() tells you
how many documents have been marked for deletion but still take up
space in the index.

-Yonik

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: strange idf in Lucene 2.1

Posted by Yonik Seeley <yo...@apache.org>.

On 4/11/07, Koji Sekiguchi <ko...@r.email.ne.jp> wrote:
> In the program, I added these three documents to the index,
> then deleted all of them, and then added them to the index on purpose.
> If I optimize the index, idf gets into 1.0 with Lucene 2.1 (uncomment in
> the program).
> Is it a feature?

docfreqs (idfs) do not take into account deleted docs.
This is more of an engineering tradeoff rather than a feature.
If we could cheaply and easily update idfs when documents are deleted
from an index, we would.

This should be the same for Lucene 2.0 and 2.1.

-Yonik

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org