You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Noble Paul (JIRA)" <ji...@apache.org> on 2017/03/12 00:58:04 UTC
[jira] [Comment Edited] (SOLR-10265) Overseer can become the bottleneck in very large clusters

    [ https://issues.apache.org/jira/browse/SOLR-10265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15906381#comment-15906381 ] 

Noble Paul edited comment on SOLR-10265 at 3/12/17 12:57 AM:
-------------------------------------------------------------

bq.I haven't debugged why the transaction logs ran into terabytes without taking into snapshots but this was my assumption based on the other problems we observed

The current design is suboptimal. Just to change one small bit of information, we write ~300bytes per replica. for 600 replicas we write ~180kb of data (add the fact that 600 nodes have to read that data and parse for every state change operation). Every write is appended to the ZK transaction log. We must split the state.json to scale any further. 
If we write out the state separately to a format as follows
{code}
{
"replica1": 1
"replica2":0
}
{code}

we can bring down the data size to around 10 bytes per replica. This means 600, replicas will have only 6K data per update. 


was (Author: noble.paul):
bq.I haven't debugged why the transaction logs ran into terabytes without taking into snapshots but this was my assumption based on the other problems we observed

The current design is suboptimal. Just to change one small bit of information, we write ~300bytes per replica. for 600 replicas we write ~180kb of data (add the fact that 600 nodes have to read that data and parse for every state change operation). Every write is appended to the ZK transaction log. We must split the state.json to scale any further. 
If we write out the state separately to a format as follows
{code}
{
"replica1": 1
"replica2":0
}
{code}

we can bring down the data size to around 10 bytes per replica. This means 600, replicas will have only 6K data per update. 

> Overseer can become the bottleneck in very large clusters
> ---------------------------------------------------------
>
>                 Key: SOLR-10265
>                 URL: https://issues.apache.org/jira/browse/SOLR-10265
>             Project: Solr
>          Issue Type: Improvement
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Varun Thacker
>
> Let's say we have a large cluster. Some numbers:
> - To ingest the data at the volume we want to I need roughly a 600 shard collection.
> - Index into the collection for 1 hour and then create a new collection 
> - For a 30 days retention window with these numbers we would end up wth  ~400k cores in the cluster
> - Just a rolling restart of this cluster can take hours because the overseer queue gets backed up. If a few nodes looses connectivity to ZooKeeper then also we can end up with lots of messages in the Overseer queue
> With some tests here are the two high level problems we have identified:
> 1> How fast can the overseer process operations:
> The rate at which the overseer processes events is too slow at this scale. 
> I ran {{OverseerTest#testPerformance}} which creates 10 collections ( 1 shard 1 replica ) and generates 20k state change events. The test took 119 seconds to run on my machine which means ~170 events a second. Let's say a server can process 5x of my machine so 1k events a second. 
> Total events generated by a 400k replica cluster = 400k * 4 ( state changes till replica become active ) = 1.6M / 1k events a second will be 1600 minutes.
> Second observation was that the rate at which the overseer can process events slows down when the number of items in the queue gets larger
> I ran the same {{OverseerTest#testPerformance}} but changed the number of events generated to 2000 instead. The test took only 5 seconds to run. So it was a lot faster than the test run which generated 20k events
> 2> State changes overwhelming ZK:
> For every state change Solr is writing out a big state.json to zookeeper. This can lead to the zookeeper transaction logs going out of control even with auto purging etc set . 
> I haven't debugged why the transaction logs ran into terabytes without taking into snapshots but this was my assumption based on the other problems we observed



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org