You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@solr.apache.org by Kudrettin Güleryüz <ku...@gmail.com> on 2021/05/18 16:37:07 UTC

Index size

Hello,

Experimenting with optimizing the index size.

Can you help me understand why indexing but not storing a file 10,000
increases the index size by 2,500 times? 7.3 here. Schema and all other
conditions are kept constant.

Thanks

Re: Index size

Posted by Walter Underwood <wu...@wunderwood.org>.

The keys are the same, but the index is bigger. Solr indexes the position 
of each term in each document. One term in one document is one position.
One term in 10k documents is 10k positions. One term occurring twice in
each of 10k documents is 20k positions.

Also, indexing many copies of the same document is not a good way to
forecast index size. The size depends on the statistics of the actual documents
and on the schema.

Measure it with real data and the schema you expect to use.

wunder
Walter Underwood
wunder@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On May 19, 2021, at 1:18 PM, Kudrettin Güleryüz <ku...@gmail.com> wrote:
> 
> Thanks for the insight, I forgot to mention a key information while
> explaining experiment two:
> 
> Although their content is exactly the same, each document would be
> different because of their filename. The name of the 10,000 file is
> different. Therefore some of the fields content such as filename, id, etc,
> is always different. The most significant field in terms of the storage
> size is the content field and that is exactly the same for all files in
> this experiment.
> 
> Since that is the case, I think no Solr document deletions are necessary.
> In fact when I run update?optimize=true, there is no significant change on
> the total size of the index.
> 
> On Wed, May 19, 2021 at 11:23 AM Mark H. Wood <mw...@iupui.edu> wrote:
> 
>> On Wed, May 19, 2021 at 09:03:52AM -0400, Kudrettin Güleryüz wrote:
>>> Sorry, I meant I am trying to reduce the index size... I am not using the
>>> index optimize feature at this point.
>>> 
>>> Experiment one:
>>> Index document of size ~10KB for only once. Total index size in multiple
>>> shards ~117KB
>>> 
>>> Experiment two:
>>> Index document of size ~10KB for 10,000 times. Total index size in
>> multiple
>>> shards ~250MB
>>> 
>>> I am assuming that the terms (keys) in the inverted index wouldn't
>> increase
>>> by indexing the same document multiple times. Therefore I would expect
>> the
>>> increase in index size would be minimal compared to indexing a totally
>>> different document. Can you tell me what I am missing?
>> 
>> 
>> https://lucidworks.com/post/solr-segment-merge-frees-wasted-space-caused-by-deleted-documents/
>> 
>> In short:  Solr doesn't re-use the space occupied by deleted index
>> entries.  Replacing a document causes the entries for the previous
>> version to be deleted.  Eventually Solr will reorganize parts of the
>> index into new files, and this drops *some* deleted index entries.  At
>> any point in time, Solr will be holding some "wasted" space, but it's
>> under control and normally you don't need to worry about it.
>> 
>>> On Tue, May 18, 2021 at 12:48 PM Dave <ha...@gmail.com>
>> wrote:
>>> 
>>>> At a certain point the index size doesn’t matter. When you re index a
>>>> document you do not delete the actual residing document, you mark it as
>>>> deleted and add on the replacement.  An optimize is what removes the
>> marked
>>>> deleted files, but an optimize is really no longer a recommended
>> process
>>>> since solr is very good at merging as well as the fact disk is
>>>> inexpensive.  The reason the index increased in guessing is that even
>>>> though it’s only indexed, that data is still stored and of course
>>>> duplicated.  If it’s performance has not been adversely effected I
>> would
>>>> not ever run the optimize command. I’ve pushed an index that is
>> naturally
>>>> 450gb all the way to 800gb+ and it ran great, assuming you have the
>> disk
>>>> space available
>>>> 
>>>>> On May 18, 2021, at 12:37 PM, Kudrettin Güleryüz <
>> kudrettin@gmail.com>
>>>> wrote:
>>>>> 
>>>>> Hello,
>>>>> 
>>>>> Experimenting with optimizing the index size.
>>>>> 
>>>>> Can you help me understand why indexing but not storing a file 10,000
>>>>> increases the index size by 2,500 times? 7.3 here. Schema and all
>> other
>>>>> conditions are kept constant.
>>>>> 
>>>>> Thanks
>>>> 
>> 
>> --
>> Mark H. Wood
>> Lead Technology Analyst
>> 
>> University Library
>> Indiana University - Purdue University Indianapolis
>> 755 W. Michigan Street
>> Indianapolis, IN 46202
>> 317-274-0749
>> www.ulib.iupui.edu
>>

Re: Index size

Posted by Kudrettin Güleryüz <ku...@gmail.com>.

Thanks for the insight, I forgot to mention a key information while
explaining experiment two:

Although their content is exactly the same, each document would be
different because of their filename. The name of the 10,000 file is
different. Therefore some of the fields content such as filename, id, etc,
is always different. The most significant field in terms of the storage
size is the content field and that is exactly the same for all files in
this experiment.

Since that is the case, I think no Solr document deletions are necessary.
In fact when I run update?optimize=true, there is no significant change on
the total size of the index.

On Wed, May 19, 2021 at 11:23 AM Mark H. Wood <mw...@iupui.edu> wrote:

> On Wed, May 19, 2021 at 09:03:52AM -0400, Kudrettin Güleryüz wrote:
> > Sorry, I meant I am trying to reduce the index size... I am not using the
> > index optimize feature at this point.
> >
> > Experiment one:
> > Index document of size ~10KB for only once. Total index size in multiple
> > shards ~117KB
> >
> > Experiment two:
> > Index document of size ~10KB for 10,000 times. Total index size in
> multiple
> > shards ~250MB
> >
> > I am assuming that the terms (keys) in the inverted index wouldn't
> increase
> > by indexing the same document multiple times. Therefore I would expect
> the
> > increase in index size would be minimal compared to indexing a totally
> > different document. Can you tell me what I am missing?
>
>
> https://lucidworks.com/post/solr-segment-merge-frees-wasted-space-caused-by-deleted-documents/
>
> In short:  Solr doesn't re-use the space occupied by deleted index
> entries.  Replacing a document causes the entries for the previous
> version to be deleted.  Eventually Solr will reorganize parts of the
> index into new files, and this drops *some* deleted index entries.  At
> any point in time, Solr will be holding some "wasted" space, but it's
> under control and normally you don't need to worry about it.
>
> > On Tue, May 18, 2021 at 12:48 PM Dave <ha...@gmail.com>
> wrote:
> >
> > > At a certain point the index size doesn’t matter. When you re index a
> > > document you do not delete the actual residing document, you mark it as
> > > deleted and add on the replacement.  An optimize is what removes the
> marked
> > > deleted files, but an optimize is really no longer a recommended
> process
> > > since solr is very good at merging as well as the fact disk is
> > > inexpensive.  The reason the index increased in guessing is that even
> > > though it’s only indexed, that data is still stored and of course
> > > duplicated.  If it’s performance has not been adversely effected I
> would
> > > not ever run the optimize command. I’ve pushed an index that is
> naturally
> > > 450gb all the way to 800gb+ and it ran great, assuming you have the
> disk
> > > space available
> > >
> > > > On May 18, 2021, at 12:37 PM, Kudrettin Güleryüz <
> kudrettin@gmail.com>
> > > wrote:
> > > >
> > > > Hello,
> > > >
> > > > Experimenting with optimizing the index size.
> > > >
> > > > Can you help me understand why indexing but not storing a file 10,000
> > > > increases the index size by 2,500 times? 7.3 here. Schema and all
> other
> > > > conditions are kept constant.
> > > >
> > > > Thanks
> > >
>
> --
> Mark H. Wood
> Lead Technology Analyst
>
> University Library
> Indiana University - Purdue University Indianapolis
> 755 W. Michigan Street
> Indianapolis, IN 46202
> 317-274-0749
> www.ulib.iupui.edu
>

Re: Index size

Posted by "Mark H. Wood" <mw...@iupui.edu>.

On Wed, May 19, 2021 at 09:03:52AM -0400, Kudrettin Güleryüz wrote:
> Sorry, I meant I am trying to reduce the index size... I am not using the
> index optimize feature at this point.
> 
> Experiment one:
> Index document of size ~10KB for only once. Total index size in multiple
> shards ~117KB
> 
> Experiment two:
> Index document of size ~10KB for 10,000 times. Total index size in multiple
> shards ~250MB
> 
> I am assuming that the terms (keys) in the inverted index wouldn't increase
> by indexing the same document multiple times. Therefore I would expect the
> increase in index size would be minimal compared to indexing a totally
> different document. Can you tell me what I am missing?

https://lucidworks.com/post/solr-segment-merge-frees-wasted-space-caused-by-deleted-documents/

In short:  Solr doesn't re-use the space occupied by deleted index
entries.  Replacing a document causes the entries for the previous
version to be deleted.  Eventually Solr will reorganize parts of the
index into new files, and this drops *some* deleted index entries.  At
any point in time, Solr will be holding some "wasted" space, but it's
under control and normally you don't need to worry about it.

> On Tue, May 18, 2021 at 12:48 PM Dave <ha...@gmail.com> wrote:
> 
> > At a certain point the index size doesn’t matter. When you re index a
> > document you do not delete the actual residing document, you mark it as
> > deleted and add on the replacement.  An optimize is what removes the marked
> > deleted files, but an optimize is really no longer a recommended process
> > since solr is very good at merging as well as the fact disk is
> > inexpensive.  The reason the index increased in guessing is that even
> > though it’s only indexed, that data is still stored and of course
> > duplicated.  If it’s performance has not been adversely effected I would
> > not ever run the optimize command. I’ve pushed an index that is naturally
> > 450gb all the way to 800gb+ and it ran great, assuming you have the disk
> > space available
> >
> > > On May 18, 2021, at 12:37 PM, Kudrettin Güleryüz <ku...@gmail.com>
> > wrote:
> > >
> > > Hello,
> > >
> > > Experimenting with optimizing the index size.
> > >
> > > Can you help me understand why indexing but not storing a file 10,000
> > > increases the index size by 2,500 times? 7.3 here. Schema and all other
> > > conditions are kept constant.
> > >
> > > Thanks
> >

-- 
Mark H. Wood
Lead Technology Analyst

University Library
Indiana University - Purdue University Indianapolis
755 W. Michigan Street
Indianapolis, IN 46202
317-274-0749
www.ulib.iupui.edu

Re: Index size

Posted by Kudrettin Güleryüz <ku...@gmail.com>.

Sorry, I meant I am trying to reduce the index size... I am not using the
index optimize feature at this point.

Experiment one:
Index document of size ~10KB for only once. Total index size in multiple
shards ~117KB

Experiment two:
Index document of size ~10KB for 10,000 times. Total index size in multiple
shards ~250MB

I am assuming that the terms (keys) in the inverted index wouldn't increase
by indexing the same document multiple times. Therefore I would expect the
increase in index size would be minimal compared to indexing a totally
different document. Can you tell me what I am missing?

On Tue, May 18, 2021 at 12:48 PM Dave <ha...@gmail.com> wrote:

> At a certain point the index size doesn’t matter. When you re index a
> document you do not delete the actual residing document, you mark it as
> deleted and add on the replacement.  An optimize is what removes the marked
> deleted files, but an optimize is really no longer a recommended process
> since solr is very good at merging as well as the fact disk is
> inexpensive.  The reason the index increased in guessing is that even
> though it’s only indexed, that data is still stored and of course
> duplicated.  If it’s performance has not been adversely effected I would
> not ever run the optimize command. I’ve pushed an index that is naturally
> 450gb all the way to 800gb+ and it ran great, assuming you have the disk
> space available
>
> > On May 18, 2021, at 12:37 PM, Kudrettin Güleryüz <ku...@gmail.com>
> wrote:
> >
> > Hello,
> >
> > Experimenting with optimizing the index size.
> >
> > Can you help me understand why indexing but not storing a file 10,000
> > increases the index size by 2,500 times? 7.3 here. Schema and all other
> > conditions are kept constant.
> >
> > Thanks
>

Re: Index size

Posted by Dave <ha...@gmail.com>.

At a certain point the index size doesn’t matter. When you re index a document you do not delete the actual residing document, you mark it as deleted and add on the replacement.  An optimize is what removes the marked deleted files, but an optimize is really no longer a recommended process since solr is very good at merging as well as the fact disk is inexpensive.  The reason the index increased in guessing is that even though it’s only indexed, that data is still stored and of course duplicated.  If it’s performance has not been adversely effected I would not ever run the optimize command. I’ve pushed an index that is naturally 450gb all the way to 800gb+ and it ran great, assuming you have the disk space available 

> On May 18, 2021, at 12:37 PM, Kudrettin Güleryüz <ku...@gmail.com> wrote:
> 
> Hello,
> 
> Experimenting with optimizing the index size.
> 
> Can you help me understand why indexing but not storing a file 10,000
> increases the index size by 2,500 times? 7.3 here. Schema and all other
> conditions are kept constant.
> 
> Thanks