You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Zheng Lin Edwin Yeo <ed...@gmail.com> on 2016/10/13 15:20:30 UTC

Slow indexing speed when index size is large?

Hi,

Would like to find out, will the indexing speed in a collection with a very
large index size be much slower than one which is still empty or a very
small index size? This is assuming that the configurations, indexing code
and the files to be indexed are the same.

Currently, I have a setup in which the collection is still empty, and I
managed to achieve an indexing speed of more than 7GB/hr. I also have
another setup in which the collection has an index size of 1.6TB, and when
I tried to index new documents to it, the indexing speed is less than
0.7GB/hr.

This setup was done with Solr 5.4.0

Regards,
Edwin

Re: Slow indexing speed when index size is large?

Posted by Zheng Lin Edwin Yeo <ed...@gmail.com>.
Hi Shawn,

Thanks for the information.

Regards,
Edwin


On 14 October 2016 at 20:19, Shawn Heisey <ap...@elyograg.org> wrote:

> On 10/13/2016 9:58 PM, Zheng Lin Edwin Yeo wrote:
> > Thanks for the reply Shawn. Currently, my heap allocation to each Solr
> > instance is 22GB. Is that big enough?
>
> I can't answer that question.  I know little about your install.  Even
> if I *did* know a few more things about your install, I could only make
> a *guess* about how much heap you need, and I'd probably be wrong.
>
> https://lucidworks.com/blog/2012/07/23/sizing-hardware-in-
> the-abstract-why-we-dont-have-a-definitive-answer/
>
> I did write down what I consider to be a good way to figure out a
> correct heap size, but it requires experimentation with your live
> system, which might cause disruption of your search service:
>
> https://wiki.apache.org/solr/SolrPerformanceProblems#How_
> much_heap_space_do_I_need.3F
>
> Thanks,
> Shawn
>
>

Re: Slow indexing speed when index size is large?

Posted by Shawn Heisey <ap...@elyograg.org>.
On 10/13/2016 9:58 PM, Zheng Lin Edwin Yeo wrote:
> Thanks for the reply Shawn. Currently, my heap allocation to each Solr
> instance is 22GB. Is that big enough? 

I can't answer that question.  I know little about your install.  Even
if I *did* know a few more things about your install, I could only make
a *guess* about how much heap you need, and I'd probably be wrong.

https://lucidworks.com/blog/2012/07/23/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/

I did write down what I consider to be a good way to figure out a
correct heap size, but it requires experimentation with your live
system, which might cause disruption of your search service:

https://wiki.apache.org/solr/SolrPerformanceProblems#How_much_heap_space_do_I_need.3F

Thanks,
Shawn


Re: Slow indexing speed when index size is large?

Posted by Zheng Lin Edwin Yeo <ed...@gmail.com>.
Thanks for the reply Shawn.

Currently, my heap allocation to each Solr instance is 22GB.
Is that big enough?

Regards,
Edwin


On 13 October 2016 at 23:56, Shawn Heisey <ap...@elyograg.org> wrote:

> On 10/13/2016 9:20 AM, Zheng Lin Edwin Yeo wrote:
> > Would like to find out, will the indexing speed in a collection with a
> > very large index size be much slower than one which is still empty or
> > a very small index size? This is assuming that the configurations,
> > indexing code and the files to be indexed are the same. Currently, I
> > have a setup in which the collection is still empty, and I managed to
> > achieve an indexing speed of more than 7GB/hr. I also have another
> > setup in which the collection has an index size of 1.6TB, and when I
> > tried to index new documents to it, the indexing speed is less than
> > 0.7GB/hr.
>
> I have noticed this phenomenon myself.  As the amount of index data
> already present increases, indexing slows down.  Best guess as to the
> cause: more frequent and longer-lasting garbage collections.
>
> Indexing involves a LOT of memory allocation.  Most of the memory chunks
> that get allocated are quickly discarded because they do not need to be
> retained.
>
> If you understand how the Java memory model works, then you know that
> this means there will be a lot of garbage collection.  Each GC will tend
> to take longer if there are a large number of objects allocated that are
> NOT garbage.
>
> When the index is large, Lucene/Solr must allocate and retain a larger
> amount of memory just to ensure that everything works properly.  This
> leaves less free memory, so indexing will cause more frequent garbage
> collections ... and because the amount of retained memory is
> correspondingly larger, each garbage collection will take longer than it
> would with a smaller index.  A ten to one difference in speed does seem
> extreme, though.
>
> You might want to increase the heap allocated to each Solr instance, so
> GC is less frequent.  This can take memory away from the OS disk cache,
> though.  If the amount of OS disk cache drops too low, general
> performance may suffer.
>
> Thanks,
> Shawn
>
>

Re: Slow indexing speed when index size is large?

Posted by Shawn Heisey <ap...@elyograg.org>.
On 10/13/2016 9:20 AM, Zheng Lin Edwin Yeo wrote:
> Would like to find out, will the indexing speed in a collection with a
> very large index size be much slower than one which is still empty or
> a very small index size? This is assuming that the configurations,
> indexing code and the files to be indexed are the same. Currently, I
> have a setup in which the collection is still empty, and I managed to
> achieve an indexing speed of more than 7GB/hr. I also have another
> setup in which the collection has an index size of 1.6TB, and when I
> tried to index new documents to it, the indexing speed is less than
> 0.7GB/hr. 

I have noticed this phenomenon myself.  As the amount of index data
already present increases, indexing slows down.  Best guess as to the
cause: more frequent and longer-lasting garbage collections.

Indexing involves a LOT of memory allocation.  Most of the memory chunks
that get allocated are quickly discarded because they do not need to be
retained.

If you understand how the Java memory model works, then you know that
this means there will be a lot of garbage collection.  Each GC will tend
to take longer if there are a large number of objects allocated that are
NOT garbage.

When the index is large, Lucene/Solr must allocate and retain a larger
amount of memory just to ensure that everything works properly.  This
leaves less free memory, so indexing will cause more frequent garbage
collections ... and because the amount of retained memory is
correspondingly larger, each garbage collection will take longer than it
would with a smaller index.  A ten to one difference in speed does seem
extreme, though.

You might want to increase the heap allocated to each Solr instance, so
GC is less frequent.  This can take memory away from the OS disk cache,
though.  If the amount of OS disk cache drops too low, general
performance may suffer.

Thanks,
Shawn