You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by ig01 <in...@elbitsystems.com> on 2014/12/31 10:32:37 UTC

Frequent deletions

Hello,
We perform frequent deletions from our index, which greatly increases the
index size.
How can we perform an optimization in order to reduce the size.
Please advise,
Thanks.




--
View this message in context: http://lucene.472066.n3.nabble.com/Frequent-deletions-tp4176689.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Frequent deletions

Posted by Shawn Heisey <ap...@elyograg.org>.
On 1/13/2015 12:10 AM, ig01 wrote:
> Unfortunately this is the case, we do have hundreds of millions of documents
> on one 
> Solr instance/server. All our configs and schema are with default
> configurations. Our index
> size is 180G, does that mean that we need at least 180G heap size?

If you have hundreds of millions of documents and the index is only
180GB, they must be REALLY tiny documents.

The number of documents has a lot more impact on the heap requirements
than the index size on disk.  As described in my previous email, I have
about 130GB of total index on my dev Solr server, and the heap is only
7GB.  Everything I ask that machine to do, which includes optimizing
shards that are up to 20GB each, works flawlessly.

When a Solr index has 500 million documents, the amount of memory
required to construct a single entry in the filterCache is over 60MB.
The size of the filterCache in the default example config is 512 ...
which means that if that cache ends up fully utilized, that's in the
neighborhood of 30GB of RAM required for just one Solr cache.  The
amount of memory required for the Lucene FieldCache could be insane with
500 million documents, depending on the exact nature of the queries that
you are doing.

The index size on disk has a different tie to memory -- the RAM that is
not allocated to programs is automatically used by the operating system
for caching data on the disk.  If you have plenty of RAM so the OS disk
cache can effectively keep relevant parts of the index in memory,
performance will not suffer.  Anytime Solr must actually ask the disk
for index data, it will be slow.

With 120GB out of the 140GB total allocated to Solr, that leaves 20GB to
cache 180GB of index data.  That's almost certainly not enough.
Although the OS disk cache requirements have no direct correlation with
OOME exceptions, slow performance due to insufficient caching might lead
*indirectly* to OOME, because the slow performance means that it's more
likely you'll have many queries happening at the same time, which will
lead to larger heap requirements.

Thanks,
Shawn


Re: Frequent deletions

Posted by ig01 <in...@elbitsystems.com>.
Hi,

Unfortunately this is the case, we do have hundreds of millions of documents
on one 
Solr instance/server. All our configs and schema are with default
configurations. Our index
size is 180G, does that mean that we need at least 180G heap size?

Thanks.




--
View this message in context: http://lucene.472066.n3.nabble.com/Frequent-deletions-tp4176689p4179122.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Frequent deletions

Posted by Shawn Heisey <ap...@elyograg.org>.
On 1/10/2015 11:46 PM, ig01 wrote:
> Thank you all for your response,
> The thing is that we have 180G index while half of it are deleted documents.
> We  tried to run an optimization in order to shrink index size but it
> crashes on ‘out of memory’ when the process reaches 120G.   
> Is it possible to optimize parts of the index? 
> Please advise what can we do in this situation.

If you are getting "OutOfMemoryError" exceptions from Java, that means
your heap isn't large enough to accomplish what you have asked the
program to do (between the configuration and what you have actually
requested).  You'll either need to allocate more memory to the heap, or
you need to change your config so less memory is required.

I see from a later reply that the 120GB size you have mentioned is your
Java heap.  Unless you've got hundreds of millions of documents on one
Solr instance/server (which would not be a good idea) and/or a serious
misconfiguration, I cannot imagine needing a heap that big for Solr.

The largest index on my dev Solr server has 98 million documents in
seven shards, with a total index size a little over 120GB (six shards
each 20GB and a seventh shard that's less than 1GB), and my heap size is
7 gigabytes.  There is a smaller index as well with 17 million docs in
three shards, that one is about 10GB on disk.  Unlike the production
servers, the dev server has all the index data contained on one server.

Here's a wiki page that covers things which cause large heap
requirements.  A later section also describes steps you can take to
reduce memory usage.

https://wiki.apache.org/solr/SolrPerformanceProblems#Java_Heap

How many documents do you have on a single Solr server?  Can you use a
site like http://apaste.info to share your solrconfig.xml?  I don't know
if we'll need the schema, but it might be a good idea to share that as well.

Thanks,
Shawn


Re: Frequent deletions

Posted by ig01 <in...@elbitsystems.com>.
Hi,

We gave 120G to JVM, while we have 140G memory on this machine.
We use the default merge policy("TieredMergePolicy"), and there are 54
segments in our index.
We tried to perform an optimization with different numbers of maxSegments
(53 and less)
it didn't help.
How much memory we need for 180G optimization?
Is every update deletes the document and creates a new one?
How can commit with expungeDeletes=true affect performance?
Currently we do not have a performance issue.

Thanks in advance.



--
View this message in context: http://lucene.472066.n3.nabble.com/Frequent-deletions-tp4176689p4178875.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Frequent deletions

Posted by David Santamauro <da...@gmail.com>.
[ disclaimer: this worked for me, ymmv ... ]

I just battled this. Turns out incrementally optimizing using the
maxSegments attribute was the most efficient solution for me. In
particular when you are actually running out of disk space. 

#!/bin/bash

# n-segments I started with
high=400
# n-segments I want to optimize down to
low=300

for i in $(seq $high -10 $low); do
  # your optimize call with maxSegments=$i
  sleep 2
done

I was able to shrink my +3TB index by about 300GB optimizing
from 400 segments down to 300 (10 at a time). It optimized out the .del
for those segments that had one and, the best part, because you are only
rewriting 10 segments per loop, disk space footprint stays tolerable ... 
At least compared to a commit @expungeDeletes=true or of course, an
optimize without @maxSegments which basically rewrites the entire index.

NOTE: it wreaks havoc on the system, so expect search slowdown and best
not to index while this is going on either.

David


On Sun, 2015-01-11 at 06:46 -0700, ig01 wrote:
> Hi,
> 
> It's not an option for us, all the documents in our index have same deletion
> probability.
> Is there any other solution to perform an optimization in order to reduce
> index size?
> 
> Thanks in advance.
> 
> 
> 
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Frequent-deletions-tp4176689p4178720.html
> Sent from the Solr - User mailing list archive at Nabble.com.




Re: Frequent deletions

Posted by Alexandre Rafalovitch <ar...@gmail.com>.
I believe if you delete all documents in a segment, that segment as a
whole goes away.

A segment is created on every commit whether you reopen the searcher
or not. Do you know what documents would be deleted later (are there
are natural clusters). If yes, perhaps there is a way to index them so
that most of deleted documents would end up taking the whole segment
on disk.

A bit of a long shot, not sure if it is useful.

Regards,
   Alex.
----
Sign up for my Solr resources newsletter at http://www.solr-start.com/


On 11 January 2015 at 08:46, ig01 <in...@elbitsystems.com> wrote:
> Hi,
>
> It's not an option for us, all the documents in our index have same deletion
> probability.
> Is there any other solution to perform an optimization in order to reduce
> index size?
>
> Thanks in advance.
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Frequent-deletions-tp4176689p4178720.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: Frequent deletions

Posted by ig01 <in...@elbitsystems.com>.
Hi,

It's not an option for us, all the documents in our index have same deletion
probability.
Is there any other solution to perform an optimization in order to reduce
index size?

Thanks in advance.



--
View this message in context: http://lucene.472066.n3.nabble.com/Frequent-deletions-tp4176689p4178720.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Frequent deletions

Posted by "Jürgen Wagner (DVT)" <ju...@devoteam.com>.
Maybe you should consider creating different generations of indexes and
not keep everything in one index. If the likelihood of documents being
deleted is rather high in, e.g., the first week or so, you could have
one index for the high-probability of deletion documents (the fresh
ones) and a second one for the potentially longer-lived documents.
Without knowing the temporal distribution of deletion probabilities, it
is hard to say what would be the ideal index topology.

Apart from that, I have made the experience that in some cases where
Solr would produce the notorious out-of-memory exceptions, Elasticsearch
seems to be a bit more robust. You may want to give it a try as well.

Best regards,
--Jürgen

On 11.01.2015 07:46, ig01 wrote:
> Thank you all for your response,
> The thing is that we have 180G index while half of it are deleted documents.
> We  tried to run an optimization in order to shrink index size but it
> crashes on ‘out of memory’ when the process reaches 120G.   
> Is it possible to optimize parts of the index? 
> Please advise what can we do in this situation.
>
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Frequent-deletions-tp4176689p4178700.html
> Sent from the Solr - User mailing list archive at Nabble.com.


-- 

Mit freundlichen Grüßen/Kind regards/Cordialement vôtre/Atentamente/С
уважением
*i.A. Jürgen Wagner*
Head of Competence Center "Intelligence"
& Senior Cloud Consultant

Devoteam GmbH, Industriestr. 3, 70565 Stuttgart, Germany
Phone: +49 6151 868-8725, Fax: +49 711 13353-53, Mobile: +49 171 864 1543
E-Mail: juergen.wagner@devoteam.com
<ma...@devoteam.com>, URL: www.devoteam.de
<http://www.devoteam.de/>

------------------------------------------------------------------------
Managing Board: Jürgen Hatzipantelis (CEO)
Address of Record: 64331 Weiterstadt, Germany; Commercial Register:
Amtsgericht Darmstadt HRB 6450; Tax Number: DE 172 993 071



Re: Frequent deletions

Posted by "Michał B. ." <m....@gmail.com>.
Not directly in your subject but you could look at this patch
https://issues.apache.org/jira/browse/SOLR-6841 it implements visualization
of solr(lucene) segments with exact information of how much deletions are
present in each segment. Looking at this one you could - of course next
time - react little bit earlier.

2015-01-11 7:46 GMT+01:00 ig01 <in...@elbitsystems.com>:

> Thank you all for your response,
> The thing is that we have 180G index while half of it are deleted
> documents.
> We  tried to run an optimization in order to shrink index size but it
> crashes on ‘out of memory’ when the process reaches 120G.
> Is it possible to optimize parts of the index?
> Please advise what can we do in this situation.
>
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Frequent-deletions-tp4176689p4178700.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>



-- 
Michał Bieńkowski

Re: Frequent deletions

Posted by Erick Erickson <er...@gmail.com>.
OK, why can't you give the JVM more memory, perhaps on
a one-time basis to get past this problem? You've never
told us how much memory you give the JVM in the first place.

Best,
Erick

On Sun, Jan 11, 2015 at 7:54 AM, Jack Krupansky
<ja...@gmail.com> wrote:
> Usually, Lucene will be optimizing (merging) segments on the fly so that
> you should only have a fraction of your total deletions present in the
> index and should never have an absolute need to do an old-fashioned full
> optimize.
>
> What merge policy are you using?
>
> Is Solr otherwise running fine other than this optimize operation?
>
>
> -- Jack Krupansky
>
> On Sun, Jan 11, 2015 at 1:46 AM, ig01 <in...@elbitsystems.com> wrote:
>
>> Thank you all for your response,
>> The thing is that we have 180G index while half of it are deleted
>> documents.
>> We  tried to run an optimization in order to shrink index size but it
>> crashes on ‘out of memory’ when the process reaches 120G.
>> Is it possible to optimize parts of the index?
>> Please advise what can we do in this situation.
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://lucene.472066.n3.nabble.com/Frequent-deletions-tp4176689p4178700.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>>

Re: Frequent deletions

Posted by Jack Krupansky <ja...@gmail.com>.
Usually, Lucene will be optimizing (merging) segments on the fly so that
you should only have a fraction of your total deletions present in the
index and should never have an absolute need to do an old-fashioned full
optimize.

What merge policy are you using?

Is Solr otherwise running fine other than this optimize operation?


-- Jack Krupansky

On Sun, Jan 11, 2015 at 1:46 AM, ig01 <in...@elbitsystems.com> wrote:

> Thank you all for your response,
> The thing is that we have 180G index while half of it are deleted
> documents.
> We  tried to run an optimization in order to shrink index size but it
> crashes on ‘out of memory’ when the process reaches 120G.
> Is it possible to optimize parts of the index?
> Please advise what can we do in this situation.
>
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Frequent-deletions-tp4176689p4178700.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: Frequent deletions

Posted by ig01 <in...@elbitsystems.com>.
Thank you all for your response,
The thing is that we have 180G index while half of it are deleted documents.
We  tried to run an optimization in order to shrink index size but it
crashes on ‘out of memory’ when the process reaches 120G.   
Is it possible to optimize parts of the index? 
Please advise what can we do in this situation.




--
View this message in context: http://lucene.472066.n3.nabble.com/Frequent-deletions-tp4176689p4178700.html
Sent from the Solr - User mailing list archive at Nabble.com.

RE: Frequent deletions

Posted by Amey Jadiye <am...@codeinventory.com>.
Well, we are doing same thing(in a way). we have to do frequent deletions in mass, at a time we are deleting around 20M+ documents.All i am doing is after deletion i am firing the below command on each of our solr node and keep some patience as it take way much time.

curl -vvv "http://node1.solr.xxxxx.com/collection1/update?optimize=true&distrib=false" >> /tmp/__solr_clener_log

After finishing optimisation curl returns below xml :









<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader"><int name="status">0</int><int name="QTime">10268995</int></lst>
</response>

Regards,Amey

> Date: Wed, 31 Dec 2014 02:32:37 -0700
> From: inna.geller@elbitsystems.com
> To: solr-user@lucene.apache.org
> Subject: Frequent deletions
> 
> Hello,
> We perform frequent deletions from our index, which greatly increases the
> index size.
> How can we perform an optimization in order to reduce the size.
> Please advise,
> Thanks.
> 
> 
> 
> 
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Frequent-deletions-tp4176689.html
> Sent from the Solr - User mailing list archive at Nabble.com.
 		 	   		  

Re: Frequent deletions

Posted by Alexandre Rafalovitch <ar...@gmail.com>.
Is there a specific list of which data structures are "sparce" and
"non-sparce" for Lucene and Solr (referencing G+ post)? I imagine this
is obvious to low-level hackers, but could actually be nice to
summarize it somewhere for troubleshooting.

Regards,
   Alex.
----
Sign up for my Solr resources newsletter at http://www.solr-start.com/


On 1 January 2015 at 05:22, Michael McCandless
<lu...@mikemccandless.com> wrote:
> Also see this G+ post I wrote up recently showing how %tg deletions
> changes over time for an "every add also deletes a previous document"
> stress test: https://plus.google.com/112759599082866346694/posts/MJVueTznYnD
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Wed, Dec 31, 2014 at 12:21 PM, Erick Erickson
> <er...@gmail.com> wrote:
>> It's usually not necessary to optimize, as more indexing happens you
>> should see background merges happen that'll reclaim the space, so I
>> wouldn't worry about it unless you're seeing actual problems that have
>> to be addressed. Here's a great visualization of the process:
>>
>> http://blog.mikemccandless.com/2011/02/visualizing-lucenes-segment-merges.html
>>
>> See especially the third video, "TieredMergePolicy" which is the default.
>>
>> If you insist, however, try a commit with expungeDeletes=true
>>
>> and if that isn't enough, try an optimize call
>> you can issue a "force merge" (aka optimize)  command from the URL (Or
>> cUrl etc) as:
>> http://localhost:8983/solr/techproducts/update?optimize=true
>>
>> But please don't do this unless it's absolutely necessary. You state
>> that you have "frequent deletions", but eventually this shoul dall
>> happen in the background. Optimize is a fairly expensive operation and
>> should be used judiciously.
>>
>> Best,
>> Erick
>>
>> On Wed, Dec 31, 2014 at 1:32 AM, ig01 <in...@elbitsystems.com> wrote:
>>> Hello,
>>> We perform frequent deletions from our index, which greatly increases the
>>> index size.
>>> How can we perform an optimization in order to reduce the size.
>>> Please advise,
>>> Thanks.
>>>
>>>
>>>
>>>
>>> --
>>> View this message in context: http://lucene.472066.n3.nabble.com/Frequent-deletions-tp4176689.html
>>> Sent from the Solr - User mailing list archive at Nabble.com.

Re: Frequent deletions

Posted by Michael McCandless <lu...@mikemccandless.com>.
Also see this G+ post I wrote up recently showing how %tg deletions
changes over time for an "every add also deletes a previous document"
stress test: https://plus.google.com/112759599082866346694/posts/MJVueTznYnD

Mike McCandless

http://blog.mikemccandless.com


On Wed, Dec 31, 2014 at 12:21 PM, Erick Erickson
<er...@gmail.com> wrote:
> It's usually not necessary to optimize, as more indexing happens you
> should see background merges happen that'll reclaim the space, so I
> wouldn't worry about it unless you're seeing actual problems that have
> to be addressed. Here's a great visualization of the process:
>
> http://blog.mikemccandless.com/2011/02/visualizing-lucenes-segment-merges.html
>
> See especially the third video, "TieredMergePolicy" which is the default.
>
> If you insist, however, try a commit with expungeDeletes=true
>
> and if that isn't enough, try an optimize call
> you can issue a "force merge" (aka optimize)  command from the URL (Or
> cUrl etc) as:
> http://localhost:8983/solr/techproducts/update?optimize=true
>
> But please don't do this unless it's absolutely necessary. You state
> that you have "frequent deletions", but eventually this shoul dall
> happen in the background. Optimize is a fairly expensive operation and
> should be used judiciously.
>
> Best,
> Erick
>
> On Wed, Dec 31, 2014 at 1:32 AM, ig01 <in...@elbitsystems.com> wrote:
>> Hello,
>> We perform frequent deletions from our index, which greatly increases the
>> index size.
>> How can we perform an optimization in order to reduce the size.
>> Please advise,
>> Thanks.
>>
>>
>>
>>
>> --
>> View this message in context: http://lucene.472066.n3.nabble.com/Frequent-deletions-tp4176689.html
>> Sent from the Solr - User mailing list archive at Nabble.com.

Re: Frequent deletions

Posted by Erick Erickson <er...@gmail.com>.
It's usually not necessary to optimize, as more indexing happens you
should see background merges happen that'll reclaim the space, so I
wouldn't worry about it unless you're seeing actual problems that have
to be addressed. Here's a great visualization of the process:

http://blog.mikemccandless.com/2011/02/visualizing-lucenes-segment-merges.html

See especially the third video, "TieredMergePolicy" which is the default.

If you insist, however, try a commit with expungeDeletes=true

and if that isn't enough, try an optimize call
you can issue a "force merge" (aka optimize)  command from the URL (Or
cUrl etc) as:
http://localhost:8983/solr/techproducts/update?optimize=true

But please don't do this unless it's absolutely necessary. You state
that you have "frequent deletions", but eventually this shoul dall
happen in the background. Optimize is a fairly expensive operation and
should be used judiciously.

Best,
Erick

On Wed, Dec 31, 2014 at 1:32 AM, ig01 <in...@elbitsystems.com> wrote:
> Hello,
> We perform frequent deletions from our index, which greatly increases the
> index size.
> How can we perform an optimization in order to reduce the size.
> Please advise,
> Thanks.
>
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Frequent-deletions-tp4176689.html
> Sent from the Solr - User mailing list archive at Nabble.com.