You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by tom_s <to...@gmail.com> on 2019/05/18 15:36:53 UTC

minimize disc space requirement.

hey, 
im aware that the best practice is to have disk space on your solr servers
to be 2 times the size of the index. but my goal to minimize this overhead
and have my index occupy more than 50% of disk space. in our index documents
have TTL, so documents are deleted every day and it causes background merge
of segments. can i change the merge policy and make the overhead of
background merging lower?  
will limiting the number of concurrent merges help(with the maxMergeCount
parameter)? do you know other methods that will help? 

info about my server: 
i use solr 6.5.1 . i index 200/docs per hour for each shard.i hard commit
every 5 minutes. the size of the index in each shard is around 70GB (with
around 15% deletions) . 
i use the following merge policy:
<mergePolicyFactory class="org.apache.solr.index.TieredMergePolicyFactory">
  <int name="maxMergeAtOnce">2</int>
  <int name="segmentsPerTier">4</int>
</mergePolicyFactory>
(the rest of the params are default) 

thanks



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: minimize disc space requirement.

Posted by Erick Erickson <er...@gmail.com>.
It Depends (tm).

No, limiting the background threads won’t help much. Here’s the issue:
At time T, the segments file contains the current “snapshot” of the index, i.e. the names of all the segments that have been committed.

At time T+N, another commit happens. Or, consider an optimize which for 6x defaults to merging into a single segment. During any merge, _all_ the new segments are written before _any_ old segment is deleted. The very last operation is to rewrite the segments file, but only after all the new segments are flushed.

After this point, the next time a searcher is opened all the old, no-longer-used segments will be deleted, but the trigger is opening a new searcher.

To make matters more interesting, during the merge process say new documents are indexed. Those go into new segments that aren’t in the totals above. Plus you have transaction logs being written which are usually pretty small, but can grow between commits.

I’ve used optimize as the example, but it’t at least theoretically possible that all the current segments are rewritten into a larger segment as part of a normal merge. This is frankly not very likely with large indexes (say > 20G) but still possible.

Now all that said, on a disk that’s hosting multiple replicas from multiple shards and/or multiple collections, the likelihood of all this happening at once (barring someone issuing an optimize for all the collections hosted on the machine) is very low. But what you’re risking is an unknown. Lucene/Solr try very hard to prevent bad stuff happening on a “disk full” situation, but given the number of possible code paths that could be affected it can’t be guaranteed to have benign outcomes.

So perhaps you can run forever with, say, 25% of the aggregate index size free. Perhaps you’ll blow up unexpectedly and there’s really no way to say ahead of time.

Best,
Erick

> On May 18, 2019, at 8:36 AM, tom_s <to...@gmail.com> wrote:
> 
> hey, 
> im aware that the best practice is to have disk space on your solr servers
> to be 2 times the size of the index. but my goal to minimize this overhead
> and have my index occupy more than 50% of disk space. in our index documents
> have TTL, so documents are deleted every day and it causes background merge
> of segments. can i change the merge policy and make the overhead of
> background merging lower?  
> will limiting the number of concurrent merges help(with the maxMergeCount
> parameter)? do you know other methods that will help? 
> 
> info about my server: 
> i use solr 6.5.1 . i index 200/docs per hour for each shard.i hard commit
> every 5 minutes. the size of the index in each shard is around 70GB (with
> around 15% deletions) . 
> i use the following merge policy:
> <mergePolicyFactory class="org.apache.solr.index.TieredMergePolicyFactory">
>  <int name="maxMergeAtOnce">2</int>
>  <int name="segmentsPerTier">4</int>
> </mergePolicyFactory>
> (the rest of the params are default) 
> 
> thanks
> 
> 
> 
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: minimize disc space requirement.

Posted by Erick Erickson <er...@gmail.com>.
Oh, and none of that includes people adding more and more documents to the existing replicas….

> On May 18, 2019, at 10:22 AM, Shawn Heisey <ap...@elyograg.org> wrote:
> 
> On 5/18/2019 9:36 AM, tom_s wrote:
>> im aware that the best practice is to have disk space on your solr servers
>> to be 2 times the size of the index. but my goal to minimize this overhead
>> and have my index occupy more than 50% of disk space. in our index documents
>> have TTL, so documents are deleted every day and it causes background merge
>> of segments. can i change the merge policy and make the overhead of
>> background merging lower?
>> will limiting the number of concurrent merges help(with the maxMergeCount
>> parameter)? do you know other methods that will help?
> 
> Actually the recommendation is to have enough space for the index to triple, not just double.  This can happen in the wild.
> 
> There are no merge settings that can prevent situations where the index doubles in size temporarily due to merging.  Chances are that it's going to happen eventually to any index.
> 
> Thanks,
> Shawn


Re: minimize disc space requirement.

Posted by Shawn Heisey <ap...@elyograg.org>.
On 5/18/2019 9:36 AM, tom_s wrote:
> im aware that the best practice is to have disk space on your solr servers
> to be 2 times the size of the index. but my goal to minimize this overhead
> and have my index occupy more than 50% of disk space. in our index documents
> have TTL, so documents are deleted every day and it causes background merge
> of segments. can i change the merge policy and make the overhead of
> background merging lower?
> will limiting the number of concurrent merges help(with the maxMergeCount
> parameter)? do you know other methods that will help?

Actually the recommendation is to have enough space for the index to 
triple, not just double.  This can happen in the wild.

There are no merge settings that can prevent situations where the index 
doubles in size temporarily due to merging.  Chances are that it's going 
to happen eventually to any index.

Thanks,
Shawn