You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Scott Lundgren <sc...@carbonblack.com> on 2013/10/23 16:32:53 UTC

Reclaiming disk space from (large, optimized) segments

*Background:*

- Our use case is to use SOLR as a massive FIFO queue.

- Document additions and updates happen continuously.

    - Documents are being added at sustained a rate of 50 - 100 documents
per second.

    - About 50% of these document are updates to existing docs, indexed
using atomic updates: the original doc is thus deleted and re-added.

- There is a separate purge operation running every four hours that deletes
the oldest docs, if required based on a number of unrelated configuration
parameters.

- At some time in the past, a manual force merge / optimize with
maxSegments=2 was run to troubleshoot high disk i/o and remove "too many
segments" as a potential variable.  Currently, the largest fdts are 74G and
43G.   There are 47 total segments, the largest other sizes are all around
2G.

- Merge policies are all at Solr 4 defaults. Index size is currently ~50M
maxDocs, ~35M numDocs, 276GB.

*Issue:*

The background purge operation is deleting docs on schedule, but the disk
space is not being recovered.

*Presumptions:*
I presume, but have not confirmed (how?) the 15M deleted documents are
predominately in the two large segments.  Because they are largely in the
two large segments, and those large segments still have (some/many) live
documents, the segment backing files are not deleted.

*Questions:*

- When will those segments get merged and documents recovered?  Does it
happen when _all_ the documents in those segments are deleted?  Some
percentage of the segment is filled with deleted documents?
- Is there a way to do it right now vs. just waiting?
- In some cases, the purge delete conditional is _just_ free disk space:
 when index > free space, delete oldest.  Those setups are now in scenarios
where index >> free space, and getting worse.  How does low disk space
effect above two questions?
- Is there a way for me to determine stats on a per-segment basis?
   - for example, how many deleted documents in a particular segment?
- On the flip side, can I determine in what segment a particular document
is located?

Thank you,

Scott

-- 
Scott Lundgren
Director of Engineering
Carbon Black, Inc.
(210) 204-0483 | scott.lundgren@carbonblack.com

Re: Reclaiming disk space from (large, optimized) segments

Posted by Jason Hellman <jh...@innoventsolutions.com>.
If I sage Otis’ intent here it is to create shards on the basis of intervals of time.  A shard represents a single interval (let’s say a year’s worth of data) and when that data is no longer necessary it is simply shut down and no longer included in queries.

So, for example, you could have three shards spanning the years 2011, 2012, and 2013 respectively.  When you no longer need 2011 you simply remove the shard.  My example is simple…compress based upon your needs.

On Oct 29, 2013, at 8:42 AM, Gun Akkor <gu...@carbonblack.com> wrote:

> Otis,
> 
> Thank you for your response,
> 
> Could you elaborate a bit more on what you have in mind when you say
> "time-based" indices?
> 
> Gun
> 
> 
> ---
> Senior Software Engineer
> Carbon Black, Inc.
> gun.akkor@carbonblack.com
> 
> 
> On Thu, Oct 24, 2013 at 11:56 PM, Otis Gospodnetic <
> otis.gospodnetic@gmail.com> wrote:
> 
>> Only skimmed your email, but purge every 4 hours jumped out at me. Would it
>> make sense to have time-based indices that can be periodically dropped
>> instead of being purged?
>> 
>> Otis
>> Solr & ElasticSearch Support
>> http://sematext.com/
>> On Oct 23, 2013 10:33 AM, "Scott Lundgren" <scott.lundgren@carbonblack.com
>>> 
>> wrote:
>> 
>>> *Background:*
>>> 
>>> - Our use case is to use SOLR as a massive FIFO queue.
>>> 
>>> - Document additions and updates happen continuously.
>>> 
>>>    - Documents are being added at sustained a rate of 50 - 100 documents
>>> per second.
>>> 
>>>    - About 50% of these document are updates to existing docs, indexed
>>> using atomic updates: the original doc is thus deleted and re-added.
>>> 
>>> - There is a separate purge operation running every four hours that
>> deletes
>>> the oldest docs, if required based on a number of unrelated configuration
>>> parameters.
>>> 
>>> - At some time in the past, a manual force merge / optimize with
>>> maxSegments=2 was run to troubleshoot high disk i/o and remove "too many
>>> segments" as a potential variable.  Currently, the largest fdts are 74G
>> and
>>> 43G.   There are 47 total segments, the largest other sizes are all
>> around
>>> 2G.
>>> 
>>> - Merge policies are all at Solr 4 defaults. Index size is currently ~50M
>>> maxDocs, ~35M numDocs, 276GB.
>>> 
>>> *Issue:*
>>> 
>>> The background purge operation is deleting docs on schedule, but the disk
>>> space is not being recovered.
>>> 
>>> *Presumptions:*
>>> I presume, but have not confirmed (how?) the 15M deleted documents are
>>> predominately in the two large segments.  Because they are largely in the
>>> two large segments, and those large segments still have (some/many) live
>>> documents, the segment backing files are not deleted.
>>> 
>>> *Questions:*
>>> 
>>> - When will those segments get merged and documents recovered?  Does it
>>> happen when _all_ the documents in those segments are deleted?  Some
>>> percentage of the segment is filled with deleted documents?
>>> - Is there a way to do it right now vs. just waiting?
>>> - In some cases, the purge delete conditional is _just_ free disk space:
>>> when index > free space, delete oldest.  Those setups are now in
>> scenarios
>>> where index >> free space, and getting worse.  How does low disk space
>>> effect above two questions?
>>> - Is there a way for me to determine stats on a per-segment basis?
>>>   - for example, how many deleted documents in a particular segment?
>>> - On the flip side, can I determine in what segment a particular document
>>> is located?
>>> 
>>> Thank you,
>>> 
>>> Scott
>>> 
>>> --
>>> Scott Lundgren
>>> Director of Engineering
>>> Carbon Black, Inc.
>>> (210) 204-0483 | scott.lundgren@carbonblack.com
>>> 
>> 


Re: Reclaiming disk space from (large, optimized) segments

Posted by Gun Akkor <gu...@carbonblack.com>.
Otis,

Thank you for your response,

Could you elaborate a bit more on what you have in mind when you say
"time-based" indices?

Gun


---
Senior Software Engineer
Carbon Black, Inc.
gun.akkor@carbonblack.com


On Thu, Oct 24, 2013 at 11:56 PM, Otis Gospodnetic <
otis.gospodnetic@gmail.com> wrote:

> Only skimmed your email, but purge every 4 hours jumped out at me. Would it
> make sense to have time-based indices that can be periodically dropped
> instead of being purged?
>
> Otis
> Solr & ElasticSearch Support
> http://sematext.com/
> On Oct 23, 2013 10:33 AM, "Scott Lundgren" <scott.lundgren@carbonblack.com
> >
> wrote:
>
> > *Background:*
> >
> > - Our use case is to use SOLR as a massive FIFO queue.
> >
> > - Document additions and updates happen continuously.
> >
> >     - Documents are being added at sustained a rate of 50 - 100 documents
> > per second.
> >
> >     - About 50% of these document are updates to existing docs, indexed
> > using atomic updates: the original doc is thus deleted and re-added.
> >
> > - There is a separate purge operation running every four hours that
> deletes
> > the oldest docs, if required based on a number of unrelated configuration
> > parameters.
> >
> > - At some time in the past, a manual force merge / optimize with
> > maxSegments=2 was run to troubleshoot high disk i/o and remove "too many
> > segments" as a potential variable.  Currently, the largest fdts are 74G
> and
> > 43G.   There are 47 total segments, the largest other sizes are all
> around
> > 2G.
> >
> > - Merge policies are all at Solr 4 defaults. Index size is currently ~50M
> > maxDocs, ~35M numDocs, 276GB.
> >
> > *Issue:*
> >
> > The background purge operation is deleting docs on schedule, but the disk
> > space is not being recovered.
> >
> > *Presumptions:*
> > I presume, but have not confirmed (how?) the 15M deleted documents are
> > predominately in the two large segments.  Because they are largely in the
> > two large segments, and those large segments still have (some/many) live
> > documents, the segment backing files are not deleted.
> >
> > *Questions:*
> >
> > - When will those segments get merged and documents recovered?  Does it
> > happen when _all_ the documents in those segments are deleted?  Some
> > percentage of the segment is filled with deleted documents?
> > - Is there a way to do it right now vs. just waiting?
> > - In some cases, the purge delete conditional is _just_ free disk space:
> >  when index > free space, delete oldest.  Those setups are now in
> scenarios
> > where index >> free space, and getting worse.  How does low disk space
> > effect above two questions?
> > - Is there a way for me to determine stats on a per-segment basis?
> >    - for example, how many deleted documents in a particular segment?
> > - On the flip side, can I determine in what segment a particular document
> > is located?
> >
> > Thank you,
> >
> > Scott
> >
> > --
> > Scott Lundgren
> > Director of Engineering
> > Carbon Black, Inc.
> > (210) 204-0483 | scott.lundgren@carbonblack.com
> >
>

Re: Reclaiming disk space from (large, optimized) segments

Posted by Otis Gospodnetic <ot...@gmail.com>.
Only skimmed your email, but purge every 4 hours jumped out at me. Would it
make sense to have time-based indices that can be periodically dropped
instead of being purged?

Otis
Solr & ElasticSearch Support
http://sematext.com/
On Oct 23, 2013 10:33 AM, "Scott Lundgren" <sc...@carbonblack.com>
wrote:

> *Background:*
>
> - Our use case is to use SOLR as a massive FIFO queue.
>
> - Document additions and updates happen continuously.
>
>     - Documents are being added at sustained a rate of 50 - 100 documents
> per second.
>
>     - About 50% of these document are updates to existing docs, indexed
> using atomic updates: the original doc is thus deleted and re-added.
>
> - There is a separate purge operation running every four hours that deletes
> the oldest docs, if required based on a number of unrelated configuration
> parameters.
>
> - At some time in the past, a manual force merge / optimize with
> maxSegments=2 was run to troubleshoot high disk i/o and remove "too many
> segments" as a potential variable.  Currently, the largest fdts are 74G and
> 43G.   There are 47 total segments, the largest other sizes are all around
> 2G.
>
> - Merge policies are all at Solr 4 defaults. Index size is currently ~50M
> maxDocs, ~35M numDocs, 276GB.
>
> *Issue:*
>
> The background purge operation is deleting docs on schedule, but the disk
> space is not being recovered.
>
> *Presumptions:*
> I presume, but have not confirmed (how?) the 15M deleted documents are
> predominately in the two large segments.  Because they are largely in the
> two large segments, and those large segments still have (some/many) live
> documents, the segment backing files are not deleted.
>
> *Questions:*
>
> - When will those segments get merged and documents recovered?  Does it
> happen when _all_ the documents in those segments are deleted?  Some
> percentage of the segment is filled with deleted documents?
> - Is there a way to do it right now vs. just waiting?
> - In some cases, the purge delete conditional is _just_ free disk space:
>  when index > free space, delete oldest.  Those setups are now in scenarios
> where index >> free space, and getting worse.  How does low disk space
> effect above two questions?
> - Is there a way for me to determine stats on a per-segment basis?
>    - for example, how many deleted documents in a particular segment?
> - On the flip side, can I determine in what segment a particular document
> is located?
>
> Thank you,
>
> Scott
>
> --
> Scott Lundgren
> Director of Engineering
> Carbon Black, Inc.
> (210) 204-0483 | scott.lundgren@carbonblack.com
>

Re: Reclaiming disk space from (large, optimized) segments

Posted by Gun Akkor <gu...@carbonblack.com>.
Hello Chris,

Thank you for the response, I am following up on the e-mail chain for Scott.

I guess we can try using a commit with expungeDeletes=true, but does not
really address the underlying problem.

If we hadn't issued the "optimize" in the past, thereby creating the 2 big
segments, my understanding is that Solr would have had (many more) smaller
segments, with deleted docs distributed across them. And in all likelihood,
the behind the scenes execution of the tiered merge policy would have
cleaned the deleted docs as segments merged, reclaiming the space.

But now that we have the two big segments, is there a way for Solr to
reclaim this space as part of its merge operation, or do we have to
manually (either via optimize or expunge deletes) remove the deleted docs
until we eat up all the docs in those big segments (i.e. as they are purged
with our purge logic)?

We are running Solr 4.2.1 with  TieredMergePolicy maxMergeAtOnce=10 and
segmentsPerTier=10

Thanks,

Gun


---
Senior Software Engineer
Carbon Black, Inc.
gun.akkor@carbonblack.com



On Thu, Oct 24, 2013 at 6:11 PM, Chris Hostetter
<ho...@fucit.org>wrote:

>
> I didn't dig into the details of your mail too much, but a few things
> jumped out at me...
>
> : - At some time in the past, a manual force merge / optimize with
> : maxSegments=2 was run to troubleshoot high disk i/o and remove "too many
>
> Have you tried a simple commit using expungeDeletes=true?  It should be a
> little less intensive then a optimizing.  (under the covers it does
> IndexWriter.forceMergeDeletes())
>
>
> : - Merge policies are all at Solr 4 defaults. Index size is currently ~50M
> : maxDocs, ~35M numDocs, 276GB.
>
> "Solr 4 defaults" is way to vague to be meaningful: 4.0? 4.1? ... 4.4?
>
> Do you mean you are using the example configs that came with that version
> of Solr, or do you mean you have no mergePolicy configured and you are
> getting the hardcoded defaults? .. either way it's important to specify
> exactly which version of Solr are you running and exactly what does your
> entire <indexConfig/> section looks like since both the example configs
> and the hardcoded default behavior when configs aren't specified have
> evolved since 4.0-ALPHA.
>
>
>
> -Hoss
>

Re: Reclaiming disk space from (large, optimized) segments

Posted by Chris Hostetter <ho...@fucit.org>.
I didn't dig into the details of your mail too much, but a few things 
jumped out at me...

: - At some time in the past, a manual force merge / optimize with
: maxSegments=2 was run to troubleshoot high disk i/o and remove "too many

Have you tried a simple commit using expungeDeletes=true?  It should be a 
little less intensive then a optimizing.  (under the covers it does 
IndexWriter.forceMergeDeletes())


: - Merge policies are all at Solr 4 defaults. Index size is currently ~50M
: maxDocs, ~35M numDocs, 276GB.

"Solr 4 defaults" is way to vague to be meaningful: 4.0? 4.1? ... 4.4? 

Do you mean you are using the example configs that came with that version 
of Solr, or do you mean you have no mergePolicy configured and you are 
getting the hardcoded defaults? .. either way it's important to specify 
exactly which version of Solr are you running and exactly what does your 
entire <indexConfig/> section looks like since both the example configs 
and the hardcoded default behavior when configs aren't specified have 
evolved since 4.0-ALPHA.



-Hoss