You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by vicky desai <vi...@germinait.com> on 2015/03/16 16:11:38 UTC

Solr Deleted Docs Issue

Hi,

I am having an issue with my solr setup. In my solr config I have set
following property
*<mergeFactor>10</mergeFactor>*

Now consider following situation. I have* 200* documents in my index. I need
to update all the 200 docs
If total commit operations I hit are* 20* i.e I update batches of 10 docs
merging is done after every 10th update and so the max Segment Count I can
have is 10 which is fine. However even when merging happens deleted docs are
not cleared and I end up with 100 deleted docs in index. 

If this operation is continuously done I would end up with a large set of
deleted docs which will affect the performance of the queries I hit on this
solr.

Can anyone please help me if I have missed a config or if this is an
expected behaviour



--
View this message in context: http://lucene.472066.n3.nabble.com/Solr-Deleted-Docs-Issue-tp4193292.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr Deleted Docs Issue

Posted by Shawn Heisey <ap...@elyograg.org>.
On 3/19/2015 12:24 AM, vicky desai wrote:
> I fail to understand why this deleted docs are not removed from index on
> merging. Is there a good documentation which explains how exactly is merging
> done?
>
> What can I do to solve this problem other than optimization?

Deleted docs *are* removed by automatic merging -- but only from the
specific segments that are merged, and only docs deleted before the
merge starts.  Deleted docs residing in other index segments are unaffected.

If you are replacing/updating/deleting documents in your index on a
regular basis, then there will always be deleted documents in the index,
unless you optimize.  As long as you don't do it frequently, there is
nothing wrong with optimizing your index, you just need to be aware of
the cost -- optimizing causes a large amount of I/O, which can affect
Solr performance while the optimize is happening and for a short time
afterwards.

What actual problem are you trying to solve by getting rid of your
deleted documents?  With 2-3 million total docs and about half a million
deleted docs, as long as you have enough memory in the system for
effective disk caching, I don't think performance will be a major
factor.  If you are finding that it does cause much lower performance,
you probably need more RAM in the server.

http://wiki.apache.org/solr/SolrPerformanceProblems

The only other thing that deleted documents might do to your search
results is affect the order of documents returned when you do not
explicitly sort them and rely on relevancy ranking, because the terms in
the deleted documents will affect the similarity calculation.

The most accessible information we have on how merging happens is the
visualization blog post that Erick already shared with you.  The third
video shows how the default merge policy works in recent Solr versions,
with a mergeFactor of 10 ... if you count the number of segments, you
will see that there are quite a lot more than 10 segments in the index
at all times.

http://blog.mikemccandless.com/2011/02/visualizing-lucenes-segment-merges.html

Each of the bars in the graph shows deleted documents with a dark gray
color, and you'll notice that it continually changes while the video
plays ... and the index never reaches a state with minimal deleted
documents.

Thanks,
Shawn


Re: Solr Deleted Docs Issue

Posted by vicky desai <vi...@germinait.com>.
Hi,

Thanks erick and shawn for the reply.

Just wanted to clarify that commit size of 10 was only an example and in
production commit is handled via auto-commit feature of solr.
The requirement we have is to store around 20-30 lakh docs out of which
around 5-6 lakh docs get updated daily. What I have observed is though merge
factor seems to work we always end up with around 6 lakh deleted docs in
index daily.
On optimizing all this deleted docs are removed. We benefit on memory as
well as query speed on optimization. But as I understand its a small time
gain and situation repeats itself daily.

I fail to understand why this deleted docs are not removed from index on
merging. Is there a good documentation which explains how exactly is merging
done?

What can I do to solve this problem other than optimization?



--
View this message in context: http://lucene.472066.n3.nabble.com/Solr-Deleted-Docs-Issue-tp4193292p4193937.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr Deleted Docs Issue

Posted by Erick Erickson <er...@gmail.com>.
bq: If this operation is continuously done I would end up with a large set of
deleted docs which will affect the performance of the queries I hit on this
solr.

No, you won't. They'll be "merged away" as background segments are merged.
Here's a great visualization of the process, the third one down is the
default TieredMergePolicy.

In general, even in the case of replacing all the docs, you'll have 10% of your
corpus be deleted docs. The % of deleted docs in a segment weighs quite
heavily when it comest to the decision of which segment to merge (note that
merging purges the deleted docs).

Also in general, the results of small tests like this simply do not generalize.
i.e. the number of deleted docs in a 200 doc sample size can't be
extrapolated to a reasonable-sized corpus.

Finally, I don't know if this is something temporary, but the implication of
"If total commit operations I hit are 20" is that you're committing after every
batch of docs is sent to Solr. You should not do this, let your autocommit
settings handle this.

Here's Mike's blog:
http://blog.mikemccandless.com/2011/02/visualizing-lucenes-segment-merges.html

Best,
Erick

On Mon, Mar 16, 2015 at 8:51 AM, Shawn Heisey <ap...@elyograg.org> wrote:
> On 3/16/2015 9:11 AM, vicky desai wrote:
>> I am having an issue with my solr setup. In my solr config I have set
>> following property
>> *<mergeFactor>10</mergeFactor>*
>
> The mergeFactor setting is deprecated ... but you are setting it to the
> default value of 10 anyway, so that's not really a big deal.  It's
> possible that mergeFactor will no longer work in 5.0, but I'm not sure
> on that.  You should instead use the settings specific to the merge
> policy, which normally is TieredMergePolicy.
>
> Note that when mergeFactor is 10, you *will* end up with more than 10
> segments in your index.  There are multiple merge tiers, each one can
> have up to 10 segments before it is merged.
>
>> Now consider following situation. I have* 200* documents in my index. I need
>> to update all the 200 docs
>> If total commit operations I hit are* 20* i.e I update batches of 10 docs
>> merging is done after every 10th update and so the max Segment Count I can
>> have is 10 which is fine. However even when merging happens deleted docs are
>> not cleared and I end up with 100 deleted docs in index.
>>
>> If this operation is continuously done I would end up with a large set of
>> deleted docs which will affect the performance of the queries I hit on this
>> solr.
>
> Because there are multiple merge tiers and you cannot easily
> pre-determine which segments will be chosen for a particular merge, the
> merge behavior may not be exactly what you expect.
>
> The only guaranteed way to get rid of your deleted docs is to do an
> optimize operation, which forces a merge of the entire index down to a
> single segment.  This gets rid of all deleted docs in those segments.
> If you index more data while you are doing the optimize, then you may
> end up with additional deleted docs.
>
> Thanks,
> Shawn
>

Re: Solr Deleted Docs Issue

Posted by Shawn Heisey <ap...@elyograg.org>.
On 3/16/2015 9:11 AM, vicky desai wrote:
> I am having an issue with my solr setup. In my solr config I have set
> following property
> *<mergeFactor>10</mergeFactor>*

The mergeFactor setting is deprecated ... but you are setting it to the
default value of 10 anyway, so that's not really a big deal.  It's
possible that mergeFactor will no longer work in 5.0, but I'm not sure
on that.  You should instead use the settings specific to the merge
policy, which normally is TieredMergePolicy.

Note that when mergeFactor is 10, you *will* end up with more than 10
segments in your index.  There are multiple merge tiers, each one can
have up to 10 segments before it is merged.

> Now consider following situation. I have* 200* documents in my index. I need
> to update all the 200 docs
> If total commit operations I hit are* 20* i.e I update batches of 10 docs
> merging is done after every 10th update and so the max Segment Count I can
> have is 10 which is fine. However even when merging happens deleted docs are
> not cleared and I end up with 100 deleted docs in index. 
>
> If this operation is continuously done I would end up with a large set of
> deleted docs which will affect the performance of the queries I hit on this
> solr.

Because there are multiple merge tiers and you cannot easily
pre-determine which segments will be chosen for a particular merge, the
merge behavior may not be exactly what you expect.

The only guaranteed way to get rid of your deleted docs is to do an
optimize operation, which forces a merge of the entire index down to a
single segment.  This gets rid of all deleted docs in those segments. 
If you index more data while you are doing the optimize, then you may
end up with additional deleted docs.

Thanks,
Shawn