You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by tushar kapoor <tu...@rediffmail.com> on 2009/07/09 09:26:51 UTC

Re: Create incremental snapshot

What I gather from this discussion is -

1. Snapshots are always hard links and not actual files so they cannot
possibly consume the same amount    of space.
2. Snapshots contain hard links to existing docs + delta docs.

We are facing a situation wherein the snapshot occupies the same space as
the actual indexes thus violating the first point.
We have a batch processing scheme for refreshing indexes. the steps we
follow are -

1. Delete 200 documents in one go.
2. Do an optimize.
3. Create the 200 documents deleted earlier.
4. Do a commit.

This process continues for around 160,000 documents i.e. 800 times and by
the end of it we have 800 snapshots.

The size of actual indexes is 200 Mb and remarkably all the 800 snapshots
are of size around 200 Mb each. In effect this process consumes around 160
Gb space on our disks. This is causing a lot of pain right now.

My concern are - Is our understanding of the snapshooter correct ? Should
this massive space consumption be happening at all ? Are we missing
something critical ?

Regards,
Tushar.

Shalin Shekhar Mangar wrote:
> 
> On Sat, Apr 18, 2009 at 1:06 PM, Koushik Mitra
> <Ko...@infosys.com>wrote:
> 
>> Ok....
>>
>> If these are hard links, then where does the index data get stored? Those
>> must be getting stored somewhere in the file system.
>>
> 
> Yes, of course they are stored on disk. The hard links are created from
> the
> actual files inside the index directory. When those older files are
> deleted
> by Solr, they are still left on the disk if at least one hard link to that
> file exists. If you are looking for how to clean old snapshots, you could
> use the snapcleaner script.
> 
> Is that what you wanted to do?
> 
> -- 
> Regards,
> Shalin Shekhar Mangar.
> 
> 

-- 
View this message in context: http://www.nabble.com/Create-incremental-snapshot-tp23109877p24405434.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Create incremental snapshot

Posted by Chris Hostetter <ho...@fucit.org>.

: Thanks for the reply Asif. We have already tried removing the optimization
: step. Unfortunately the commit command alone is also causing an identical
: behaviour . Is there any thing else that we are missing ?

the hardlinking behavior of snapshots is based on the files in the index 
directory, and the files in the index directory are based on the current 
segments of your index -- so if you make enough changes to your index to 
cause all of hte segments to change every snapshot will be different.

optimizing garuntees you that every segment will be different (because all 
the old segment are gone, and a new segment is created) but if your merge 
settings are set to be really aggressive, then it's euqally possible that 
some number of delete/add calls will also cause every segment to be 
replaced.

without your configs, and directory listings of subsequent snapshots, it's 
hard to guess what the problem might be (if you already stoped optimizing 
on every batch)

But i think we have an XY problem here...

: >> This process continues for around 160,000 documents i.e. 800 times and by
: >> the end of it we have 800 snapshots.

Why do you keep 800 snapshots?

you really only need snapshots arround long enough to ensure that a slave 
isn't snappulling in hte middle of deleteing it ... unless you have some 
really funky usecase where you want some of your query boxes to 
deliberately fetch old versions of hte index, you odn't really need more 
then couple of snapshots at one time.

it can be prudent to keep more snapshots then you "need" arround in case 
of logical index corruption (ie: someone foolishly deletes a bunch of 
docs they shouldn't have) because snapshots are *usually* more disk 
space efficient then full backup copies -- but if you are finding that 
that's not hte case, why bother keeping them?


-Hoss

Re: Create incremental snapshot

Posted by tushar kapoor <tu...@rediffmail.com>.

Thanks for the reply Asif. We have already tried removing the optimization
step. Unfortunately the commit command alone is also causing an identical
behaviour . Is there any thing else that we are missing ?


Asif Rahman wrote:
> 
> Tushar:
> 
> Is it necessary to do the optimize on each iteration?  When you run an
> optimize, the entire index is rewritten.  Thus each index file can have at
> most one hard link and each snapshot will consume the full amount of space
> on your disk.
> 
> Asir
> 
> On Thu, Jul 9, 2009 at 3:26 AM, tushar kapoor <
> tushar_kapoor_47@rediffmail.com> wrote:
> 
>>
>> What I gather from this discussion is -
>>
>> 1. Snapshots are always hard links and not actual files so they cannot
>> possibly consume the same amount    of space.
>> 2. Snapshots contain hard links to existing docs + delta docs.
>>
>> We are facing a situation wherein the snapshot occupies the same space as
>> the actual indexes thus violating the first point.
>> We have a batch processing scheme for refreshing indexes. the steps we
>> follow are -
>>
>> 1. Delete 200 documents in one go.
>> 2. Do an optimize.
>> 3. Create the 200 documents deleted earlier.
>> 4. Do a commit.
>>
>> This process continues for around 160,000 documents i.e. 800 times and by
>> the end of it we have 800 snapshots.
>>
>> The size of actual indexes is 200 Mb and remarkably all the 800 snapshots
>> are of size around 200 Mb each. In effect this process consumes around
>> 160
>> Gb space on our disks. This is causing a lot of pain right now.
>>
>> My concern are - Is our understanding of the snapshooter correct ? Should
>> this massive space consumption be happening at all ? Are we missing
>> something critical ?
>>
>> Regards,
>> Tushar.
>>
>> Shalin Shekhar Mangar wrote:
>> >
>> > On Sat, Apr 18, 2009 at 1:06 PM, Koushik Mitra
>> > <Ko...@infosys.com>wrote:
>> >
>> >> Ok....
>> >>
>> >> If these are hard links, then where does the index data get stored?
>> Those
>> >> must be getting stored somewhere in the file system.
>> >>
>> >
>> > Yes, of course they are stored on disk. The hard links are created from
>> > the
>> > actual files inside the index directory. When those older files are
>> > deleted
>> > by Solr, they are still left on the disk if at least one hard link to
>> that
>> > file exists. If you are looking for how to clean old snapshots, you
>> could
>> > use the snapcleaner script.
>> >
>> > Is that what you wanted to do?
>> >
>> > --
>> > Regards,
>> > Shalin Shekhar Mangar.
>> >
>> >
>>
>> --
>> View this message in context:
>> http://www.nabble.com/Create-incremental-snapshot-tp23109877p24405434.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>>
>>
> 
> 
> -- 
> Asif Rahman
> Lead Engineer - NewsCred
> asif@newscred.com
> http://platform.newscred.com
> 
> 
:-((
-- 
View this message in context: http://www.nabble.com/Create-incremental-snapshot-tp23109877p24447593.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Create incremental snapshot

Posted by Asif Rahman <as...@newscred.com>.

Tushar:

Is it necessary to do the optimize on each iteration?  When you run an
optimize, the entire index is rewritten.  Thus each index file can have at
most one hard link and each snapshot will consume the full amount of space
on your disk.

Asir

On Thu, Jul 9, 2009 at 3:26 AM, tushar kapoor <
tushar_kapoor_47@rediffmail.com> wrote:

>
> What I gather from this discussion is -
>
> 1. Snapshots are always hard links and not actual files so they cannot
> possibly consume the same amount    of space.
> 2. Snapshots contain hard links to existing docs + delta docs.
>
> We are facing a situation wherein the snapshot occupies the same space as
> the actual indexes thus violating the first point.
> We have a batch processing scheme for refreshing indexes. the steps we
> follow are -
>
> 1. Delete 200 documents in one go.
> 2. Do an optimize.
> 3. Create the 200 documents deleted earlier.
> 4. Do a commit.
>
> This process continues for around 160,000 documents i.e. 800 times and by
> the end of it we have 800 snapshots.
>
> The size of actual indexes is 200 Mb and remarkably all the 800 snapshots
> are of size around 200 Mb each. In effect this process consumes around 160
> Gb space on our disks. This is causing a lot of pain right now.
>
> My concern are - Is our understanding of the snapshooter correct ? Should
> this massive space consumption be happening at all ? Are we missing
> something critical ?
>
> Regards,
> Tushar.
>
> Shalin Shekhar Mangar wrote:
> >
> > On Sat, Apr 18, 2009 at 1:06 PM, Koushik Mitra
> > <Ko...@infosys.com>wrote:
> >
> >> Ok....
> >>
> >> If these are hard links, then where does the index data get stored?
> Those
> >> must be getting stored somewhere in the file system.
> >>
> >
> > Yes, of course they are stored on disk. The hard links are created from
> > the
> > actual files inside the index directory. When those older files are
> > deleted
> > by Solr, they are still left on the disk if at least one hard link to
> that
> > file exists. If you are looking for how to clean old snapshots, you could
> > use the snapcleaner script.
> >
> > Is that what you wanted to do?
> >
> > --
> > Regards,
> > Shalin Shekhar Mangar.
> >
> >
>
> --
> View this message in context:
> http://www.nabble.com/Create-incremental-snapshot-tp23109877p24405434.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>


-- 
Asif Rahman
Lead Engineer - NewsCred
asif@newscred.com
http://platform.newscred.com