You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Varun Rajput <va...@gmail.com> on 2014/03/13 08:44:42 UTC

Solr Cloud Segments and Merging Issues

I am using Solr 4.6.0 in cloud mode. The setup is of 4 shards, 1 on each
machine with a zookeeper quorum running on 3 other machines. The index size
on each shard is about 15GB. I noticed that the number of segments in
second shard was 42 and in the remaining shards was between 25-30.

I am basically trying to get the number of segments down to a reasonable
size like 4 or 5 in order to improve the search time. We do have some
documents indexed everyday, so we don't want to do an optimize every day.

The merge factor with the TierMergePolicy is only the number of segments
per tier. Assuming there were 5 tiers (mergeFactor of 10) in the second
shard, I tried clearing the index, reducing the mergeFactor and re-indexing
the same data in the same manner, multiple times, but I don't see a pattern
of reduction in number of segments.

No mergeFactor set      =>     42 segments
mergeFactor=5      =>       22 segments
mergeFactor=2      =>       22 segments

Below is the simple configuration, as specified in the documentation, I am
using for merging:

<mergePolicy class="org.apache.lucene.index.TieredMergePolicy">

          <int name="maxMergeAtOnce">2</int>

          <int name="segmentsPerTier">2</int>

</mergePolicy>

<mergeScheduler class="org.apache.lucene.index.ConcurrentMergeScheduler"/>

What is the best way in which I can use merging to restrict the number of
segments being formed?

Also, we are moving from Solr 1.4 (Master-Slave) to Solr 4.6.0 Cloud and
see a great increase in response time from about 18ms to 150ms. Is this a
known issue? Is there no way to reduce the response time? In the MBeans,
the individual cores show the /select handler attributes having search
times around 8ms. What is it that causes the overall response time to
increase so much?

-Varun

Re: Solr Cloud Segments and Merging Issues

Posted by Varun Rajput <va...@gmail.com>.
Hey Shawn,

> The config with the old policy used to be the literal name
> "mergeFactor".  With TieredMergePolicy, there are now three settings
> that must be changed in order to actually be the same as what
> mergeFactor used to do.The followingconfig snippet is the equivalent
> config to a mergeFactor of 10, so these are the default settings.  If
> you don't change all three (especially segmentsPerTier), then you are
> not actually changing the "mergeFactor".
> 
>    <mergePolicy class="org.apache.lucene.index.TieredMergePolicy">
>      <int name="maxMergeAtOnce">10</int>
>      <int name="segmentsPerTier">10</int>
>      <int name="maxMergeAtOnceExplicit">30</int>
>    </mergePolicy>

I tried specifying all these configurations, but it still doesn't work as
expected. I even tried specifying a maxMergeSegmentMB to 20GB instead of the
default 5GB. This is the config I tried:

<mergePolicy class="org.apache.lucene.index.TieredMergePolicy">
          <int name="maxMergeAtOnce">2</int>
          <int name="segmentsPerTier">2</int>
          <int name="maxMergeAtOnceExplicit">100</int>
          <long name="maxMergedSegmentMB">21990232555520</long>
        </mergePolicy>


> With newer Solr versions, there is not as much speedup to be gained from
> fewer segments as before.  There *is* a noticeable change, but it is no
> longer the night/day difference it used to be.

We did a performance test on a normal and optimized index and saw a
considerable improvement (almost double) in response time. That's the reason
why we want to reduce our number of segments as we have a large index with
very small amount of updates.

> Assuming that there are no system resource limitations(especially RAM),
> a distributed index is slower than a single index of the same total
> size.  Where distributed indexes have an edge is in very large indexes
> or indexes with a moderately high query rate -- by applying more total
> RAM and/or CPU resources to the problem.  If your index already fits
> entirely into the OS disk cache, or you are sending a a handful of test
> queries, you won't notice any performance benefit from going distributed.

We have a large index which won't fit in memory and need high query rates.

> For SUPER high query rates, you need more replicas.  More shards might
> actually make performance go down in this situation.

This is something we identified while testing. We had to optimize the number
of shards to be lesser but a reasonable number that will allow us grow the
size of data in future.

-Varun


> I am using Solr 4.6.0 in cloud mode. The setup is of 4 shards, 1 on each
> machine with a zookeeper quorum running on 3 other machines. The index
> size
> on each shard is about 15GB. I noticed that the number of segments in
> second shard was 42 and in the remaining shards was between 25-30.
>
> I am basically trying to get the number of segments down to a reasonable
> size like 4 or 5 in order to improve the search time. We do have some
> documents indexed everyday, so we don't want to do an optimize every day.
>
> The merge factor with the TierMergePolicy is only the number of segments
> per tier. Assuming there were 5 tiers (mergeFactor of 10) in the second
> shard, I tried clearing the index, reducing the mergeFactor and
> re-indexing
> the same data in the same manner, multiple times, but I don't see a
> pattern
> of reduction in number of segments.
>
> No mergeFactor set      =>     42 segments
> mergeFactor=5      =>       22 segments
> mergeFactor=2      =>       22 segments
>
> Below is the simple configuration, as specified in the documentation, I am
> using for merging:
>
> <mergePolicy class="org.apache.lucene.index.TieredMergePolicy">
>
>            <int name="maxMergeAtOnce">2</int>
>
>            <int name="segmentsPerTier">2</int>
>
> </mergePolicy>
>
> <mergeScheduler class="org.apache.lucene.index.ConcurrentMergeScheduler"/>
>
> What is the best way in which I can use merging to restrict the number of
> segments being formed? 



--
View this message in context: http://lucene.472066.n3.nabble.com/Solr-Cloud-Segments-and-Merging-Issues-tp4123316p4123489.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr Cloud Segments and Merging Issues

Posted by Shawn Heisey <so...@elyograg.org>.
On 3/13/2014 1:44 AM, Varun Rajput wrote:
> I am using Solr 4.6.0 in cloud mode. The setup is of 4 shards, 1 on each
> machine with a zookeeper quorum running on 3 other machines. The index size
> on each shard is about 15GB. I noticed that the number of segments in
> second shard was 42 and in the remaining shards was between 25-30.
>
> I am basically trying to get the number of segments down to a reasonable
> size like 4 or 5 in order to improve the search time. We do have some
> documents indexed everyday, so we don't want to do an optimize every day.
>
> The merge factor with the TierMergePolicy is only the number of segments
> per tier. Assuming there were 5 tiers (mergeFactor of 10) in the second
> shard, I tried clearing the index, reducing the mergeFactor and re-indexing
> the same data in the same manner, multiple times, but I don't see a pattern
> of reduction in number of segments.
>
> No mergeFactor set      =>     42 segments
> mergeFactor=5      =>       22 segments
> mergeFactor=2      =>       22 segments
>
> Below is the simple configuration, as specified in the documentation, I am
> using for merging:
>
> <mergePolicy class="org.apache.lucene.index.TieredMergePolicy">
>
>            <int name="maxMergeAtOnce">2</int>
>
>            <int name="segmentsPerTier">2</int>
>
> </mergePolicy>
>
> <mergeScheduler class="org.apache.lucene.index.ConcurrentMergeScheduler"/>
>
> What is the best way in which I can use merging to restrict the number of
> segments being formed?

The config with the old policy used to be the literal name 
"mergeFactor".  With TieredMergePolicy, there are now three settings 
that must be changed in order to actually be the same as what 
mergeFactor used to do.The followingconfig snippet is the equivalent 
config to a mergeFactor of 10, so these are the default settings.  If 
you don't change all three (especially segmentsPerTier), then you are 
not actually changing the "mergeFactor".

   <mergePolicy class="org.apache.lucene.index.TieredMergePolicy">
     <int name="maxMergeAtOnce">10</int>
     <int name="segmentsPerTier">10</int>
     <int name="maxMergeAtOnceExplicit">30</int>
   </mergePolicy>

With newer Solr versions, there is not as much speedup to be gained from 
fewer segments as before.  There *is* a noticeable change, but it is no 
longer the night/day difference it used to be.

> Also, we are moving from Solr 1.4 (Master-Slave) to Solr 4.6.0 Cloud and
> see a great increase in response time from about 18ms to 150ms. Is this a
> known issue? Is there no way to reduce the response time? In the MBeans,
> the individual cores show the /select handler attributes having search
> times around 8ms. What is it that causes the overall response time to
> increase so much?

Assuming that there are no system resource limitations(especially RAM), 
a distributed index is slower than a single index of the same total 
size.  Where distributed indexes have an edge is in very large indexes 
or indexes with a moderately high query rate -- by applying more total 
RAM and/or CPU resources to the problem.  If your index already fits 
entirely into the OS disk cache, or you are sending a a handful of test 
queries, you won't notice any performance benefit from going distributed.

For SUPER high query rates, you need more replicas.  More shards might 
actually make performance go down in this situation.

You can run a single shard with SolrCloud -- there's nothing saying the 
index HAS to be distributed.

Thanks,
Shawn


Re: Solr Cloud Segments and Merging Issues

Posted by Varun Rajput <va...@gmail.com>.
Hi Remi,

I read your post and like you, I have also identified that running solr
4.6.0 in cloud mode results in higher response time which has something to
do with merging of documents from the various shards.

Looking at the source code, we couldn't understand why it would take so much
time for merging the documents. If you do find any solution, please share
with me.

Thanks,
Varun



--
View this message in context: http://lucene.472066.n3.nabble.com/Solr-Cloud-Segments-and-Merging-Issues-tp4123316p4123472.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr Cloud Segments and Merging Issues

Posted by remi tassing <ta...@gmail.com>.
Hi Varun,

I would just like to say that I have the same two problems you've mentioned
and I couldn't figure out a way to solve them.

For the 2nd I've posted a question a couple of days ago, title: "Result
merging takes too long"

Remi


On Thu, Mar 13, 2014 at 3:44 PM, Varun Rajput <va...@gmail.com> wrote:

> I am using Solr 4.6.0 in cloud mode. The setup is of 4 shards, 1 on each
> machine with a zookeeper quorum running on 3 other machines. The index size
> on each shard is about 15GB. I noticed that the number of segments in
> second shard was 42 and in the remaining shards was between 25-30.
>
> I am basically trying to get the number of segments down to a reasonable
> size like 4 or 5 in order to improve the search time. We do have some
> documents indexed everyday, so we don't want to do an optimize every day.
>
> The merge factor with the TierMergePolicy is only the number of segments
> per tier. Assuming there were 5 tiers (mergeFactor of 10) in the second
> shard, I tried clearing the index, reducing the mergeFactor and re-indexing
> the same data in the same manner, multiple times, but I don't see a pattern
> of reduction in number of segments.
>
> No mergeFactor set      =>     42 segments
> mergeFactor=5      =>       22 segments
> mergeFactor=2      =>       22 segments
>
> Below is the simple configuration, as specified in the documentation, I am
> using for merging:
>
> <mergePolicy class="org.apache.lucene.index.TieredMergePolicy">
>
>           <int name="maxMergeAtOnce">2</int>
>
>           <int name="segmentsPerTier">2</int>
>
> </mergePolicy>
>
> <mergeScheduler class="org.apache.lucene.index.ConcurrentMergeScheduler"/>
>
> What is the best way in which I can use merging to restrict the number of
> segments being formed?
>
> Also, we are moving from Solr 1.4 (Master-Slave) to Solr 4.6.0 Cloud and
> see a great increase in response time from about 18ms to 150ms. Is this a
> known issue? Is there no way to reduce the response time? In the MBeans,
> the individual cores show the /select handler attributes having search
> times around 8ms. What is it that causes the overall response time to
> increase so much?
>
> -Varun
>