You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Yuval Feinstein <yu...@answers.com> on 2010/02/09 15:26:37 UTC

Different replicas return different scores

We are running a large sharded Lucene-based application.
Our configuration supports near real-time updates, by incrementally
Updating documents (using delete then add) on the shards.
Every shard is replicated to several machines in order to improve performance.
We replicate the shard by sending the same deletion and addition commands to all the replicas,
Where they may be performed in a different order. (We delete a set of documents, say 1000 at a time,
Then add them one-by-one semi-asynchronously).
Lately we have noticed a subtle difference in query scores across different replicas of the same shard.
Further investigation showed that the only noticeable difference between the replicas was the index directory structure:
1.      Different replicas have different sets of segments - most segment files are the same, but some are different.
2.      The numbers of deleted documents are different between two replicas of the same shard.
Is this a known behavior of Java Lucene?
How can we change this behavior? We want different replicas returning the exact same score per query hits.
(We would rather not optimize the index as we believe this will harm performance.)

TIA,
Yuval and Ophir



RE: Do deleted documents affect scores?

Posted by Yuval Feinstein <yu...@answers.com>.
Thanks Ian and Andrzej.
You solved a mystery for us.
-- Yuval

________________________________________
From: Andrzej Bialecki [ab@getopt.org]
Sent: Thursday, February 11, 2010 6:53 PM
To: java-user@lucene.apache.org
Subject: Re: Do deleted documents affect scores?

On 2010-02-11 17:35, Ian Lea wrote:
> I'm pretty sure that the answer is no and a quick test on a small
> index with/without deleted docs showed no difference in the scores,
> using 3.0.  But that was hardly a rigorous test and I don't know
> enough about lucene internals and scoring to give a definitive answer.
>
> Shouldn't be too hard for you to verify or disprove: build an index
> and throw loads of updates and deletes at it, checking scores as you
> go.

Actually, deleted docs do affect scoring for a time - IDF of a term is
not updated until you optimize (or when Lucene decides to merge segments).


--
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Do deleted documents affect scores?

Posted by Andrzej Bialecki <ab...@getopt.org>.
On 2010-02-11 17:35, Ian Lea wrote:
> I'm pretty sure that the answer is no and a quick test on a small
> index with/without deleted docs showed no difference in the scores,
> using 3.0.  But that was hardly a rigorous test and I don't know
> enough about lucene internals and scoring to give a definitive answer.
>
> Shouldn't be too hard for you to verify or disprove: build an index
> and throw loads of updates and deletes at it, checking scores as you
> go.

Actually, deleted docs do affect scoring for a time - IDF of a term is 
not updated until you optimize (or when Lucene decides to merge segments).


-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Do deleted documents affect scores?

Posted by Ian Lea <ia...@gmail.com>.
I'm pretty sure that the answer is no and a quick test on a small
index with/without deleted docs showed no difference in the scores,
using 3.0.  But that was hardly a rigorous test and I don't know
enough about lucene internals and scoring to give a definitive answer.

Shouldn't be too hard for you to verify or disprove: build an index
and throw loads of updates and deletes at it, checking scores as you
go.


--
Ian.


On Thu, Feb 11, 2010 at 7:34 AM, Yuval Feinstein <yu...@answers.com> wrote:
> I want to focus my previous question.
> Say we have two Lucene indexes: A and B.
> Index A contains documents a and b.
> Index B used to contain documents a, b and c,
> But c was deleted.
> All documents share some vocabulary.
> If we search using terms common to documents b and c,
> Can we get a different score for document b in index A and index B?
> Note that both indexes are identical with regard to the non-deleted documents,
> And only differ by the deleted document c.
> Thanks,
> Yuval
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Do deleted documents affect scores?

Posted by Yuval Feinstein <yu...@answers.com>.
I want to focus my previous question.
Say we have two Lucene indexes: A and B.
Index A contains documents a and b.
Index B used to contain documents a, b and c,
But c was deleted.
All documents share some vocabulary.
If we search using terms common to documents b and c,
Can we get a different score for document b in index A and index B?
Note that both indexes are identical with regard to the non-deleted documents,
And only differ by the deleted document c.
Thanks,
Yuval


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


RE: Different replicas return different scores

Posted by Yuval Feinstein <yu...@answers.com>.
Thanks for these directions, Ian.
We are running Lucene 2.9.1 on CentOs 5 64-bit machines.
We do use compound file format, and will look into replacing it with the simple files,
although I believe this will create too many files. 
We will also consider the rsync option.
Thanks again,
-- Yuval

-----Original Message-----
From: Ian Lea [mailto:ian.lea@gmail.com] 
Sent: Tuesday, February 09, 2010 6:13 PM
To: java-user@lucene.apache.org
Subject: Re: Different replicas return different scores

Since the update commands may run in different order on different
shards you might get different sets of segments because merges happen
to be triggered at different points in the different batches of
updates.  But you shouldn't have different numbers of deleted docs if
you have really been applying the same updates to all the shards.
Could some updates have been missed?  Or docs added then deleted or
something?  Maybe there are other variations between the shards and
that is causing the variation in query scores.

As an alternative approach you could have one master index per shard
that takes all the updates and then send that index out to the shard
servers.  If you don't use compound file format, and don't optimize,
the file changes are typically quite small with default or sensible
merge settings and can be distributed quickly using rsync.  You can
have more control by using MergePolicy and friends.

What version of lucene are you running?


--
Ian.


On Tue, Feb 9, 2010 at 2:26 PM, Yuval Feinstein <yu...@answers.com> wrote:
> We are running a large sharded Lucene-based application.
> Our configuration supports near real-time updates, by incrementally
> Updating documents (using delete then add) on the shards.
> Every shard is replicated to several machines in order to improve performance.
> We replicate the shard by sending the same deletion and addition commands to all the replicas,
> Where they may be performed in a different order. (We delete a set of documents, say 1000 at a time,
> Then add them one-by-one semi-asynchronously).
> Lately we have noticed a subtle difference in query scores across different replicas of the same shard.
> Further investigation showed that the only noticeable difference between the replicas was the index directory structure:
> 1.      Different replicas have different sets of segments - most segment files are the same, but some are different.
> 2.      The numbers of deleted documents are different between two replicas of the same shard.
> Is this a known behavior of Java Lucene?
> How can we change this behavior? We want different replicas returning the exact same score per query hits.
> (We would rather not optimize the index as we believe this will harm performance.)
>
> TIA,
> Yuval and Ophir
>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Different replicas return different scores

Posted by Ian Lea <ia...@gmail.com>.
Since the update commands may run in different order on different
shards you might get different sets of segments because merges happen
to be triggered at different points in the different batches of
updates.  But you shouldn't have different numbers of deleted docs if
you have really been applying the same updates to all the shards.
Could some updates have been missed?  Or docs added then deleted or
something?  Maybe there are other variations between the shards and
that is causing the variation in query scores.

As an alternative approach you could have one master index per shard
that takes all the updates and then send that index out to the shard
servers.  If you don't use compound file format, and don't optimize,
the file changes are typically quite small with default or sensible
merge settings and can be distributed quickly using rsync.  You can
have more control by using MergePolicy and friends.

What version of lucene are you running?


--
Ian.


On Tue, Feb 9, 2010 at 2:26 PM, Yuval Feinstein <yu...@answers.com> wrote:
> We are running a large sharded Lucene-based application.
> Our configuration supports near real-time updates, by incrementally
> Updating documents (using delete then add) on the shards.
> Every shard is replicated to several machines in order to improve performance.
> We replicate the shard by sending the same deletion and addition commands to all the replicas,
> Where they may be performed in a different order. (We delete a set of documents, say 1000 at a time,
> Then add them one-by-one semi-asynchronously).
> Lately we have noticed a subtle difference in query scores across different replicas of the same shard.
> Further investigation showed that the only noticeable difference between the replicas was the index directory structure:
> 1.      Different replicas have different sets of segments - most segment files are the same, but some are different.
> 2.      The numbers of deleted documents are different between two replicas of the same shard.
> Is this a known behavior of Java Lucene?
> How can we change this behavior? We want different replicas returning the exact same score per query hits.
> (We would rather not optimize the index as we believe this will harm performance.)
>
> TIA,
> Yuval and Ophir
>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org