You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by prasenjit mukherjee <pr...@gmail.com> on 2012/02/01 18:12:13 UTC

SolrReplication configuration with frequent deletes and updates

I have the following requirements :

1. Adds : 20 docs/sec
2. Searches : 100 searches/sec
3. Deletes : (20*3600*24*7 ~ 12 mill ) docs/week ( basically a cron
job which deletes all documents more than 7 days old )

I am thinking of having 6 shards ( with each having 2 million docs )
with 1 master and 2 slaves with SolrReplication. Have following
questions :

1. With  50 searches/sec per shard with 2 million doc, what would be
the tentative response-time  ?  I am thinking of keeping it under <100
ms
2. What would be a reasonable latency ( pollInterval ) on slave for
SolrReplication ( all slaves connected with a single backplane ). Is 1
minute pollInterval reasonable ?
3. Is NRT a better/viable option compared to SolrReplication ?

-Thanks,
Prasenjit

Re: SolrReplication configuration with frequent deletes and updates

Posted by Erick Erickson <er...@gmail.com>.
First of all, what evidence do you have that you even need to shard?
12 M documents is quite a small index by Solr standards, just test it
and see.

As far as replication, 10 minutes is probably a good place to start, but
you can experiment with reducing it. I've often found that "real time" is
usually less important than people initially think. What you need to do is
measure the warmup time for your slaves and make sure your polling
interval is smaller than that. You basically have the same problem
if you put everything on the master, the warmup interval for your
new searchers is what really governs your latency the most.

With an index as small as yours, you might be OK with putting it all on
a single machine, no master/slave setup.

Unfortunately, there's no good way to say without testing *your*
specific queries on *your* specific setup.

Best
Erick

On Wed, Feb 1, 2012 at 9:41 PM, prasenjit mukherjee
<pr...@gmail.com> wrote:
> Appreciate your reply. Have some more follow up questions inline.
>
> On Thu, Feb 2, 2012 at 12:35 AM, Emmanuel Espina
> <es...@gmail.com> wrote:
>>> 1. Adds : 20 docs/sec
>>> 2. Searches : 100 searches/sec
>>> 3. Deletes : (20*3600*24*7 ~ 12 mill ) docs/week ( basically a cron
>>> job which deletes all documents more than 7 days old )
>>>
>>> I am thinking of having 6 shards ( with each having 2 million docs )
>>> with 1 master and 2 slaves with SolrReplication. Have following
>>> questions :
>>>
>>> 1. With  50 searches/sec per shard with 2 million doc, what would be
>>> the tentative response-time  ?  I am thinking of keeping it under <100
>>> ms
>>
>> That are quite a lot of searches per second considering that you will
>> have to search in 6 shards (the coordination and network latency
>> affects the results). Also the components you use and the complexity
>> of the query (as well as the number of segments in each shard) affects
>> the time. 100 ms is probably a low threshold for your requirements,
>> you will probably need to add more replicas.
>
> Adding slaves ( using SolrReplication ) is fine as long as it scales
> linear. I do understand that shards may not scale linearly, mostly
> because of merging/network overhead, but  think will help in reducing
> response time ( pls correct me if I am wrong ) .  I am more worried
> about response time ( even on a lightly loaded slave ). The main
> intention of sharding was to reduce the response time. Will it be
> better to have a 2shardsX6slaves configuration compared to
> 6shardX2slaves ? Considering my total# docs is 12 million, wIll solr
> be ok with 6 million docs/shard ?
>
>>
>>
>>> 2. What would be a reasonable latency ( pollInterval ) on slave for
>>> SolrReplication ( all slaves connected with a single backplane ). Is 1
>>> minute pollInterval reasonable ?
>>
>> Yes, but it is not reasonable that each time you poll you get updates.
>> That is, you shouldn't perform commits more than once every 10
>> minutes. Otherwise we would be talking of near real time indexing,
>> something that is in development in trunk
>> http://wiki.apache.org/solr/NearRealtimeSearch
>
> Hmm. 10 minutes latency is definitely too hight for me ( specially as
> this is a streaming use case, i.e. show latest stuff first )  In that
> case I can probably get rid of master-slave and update all the
> replicated shards. But then I will have to do lot of leg-work ( what
> if one of the slaves are down etc. etc. ) I was trying to avoid that.
> Just curious to know what is the stability of  NRT ?
>
>>
>>
>>> 3. Is NRT a better/viable option compared to SolrReplication ?
>>
>> That is something in development. AFAIK it works with shards (because
>> nrt refers to indexing and with shards there isn't anything particular
>> with the indexing) but with replication something different will be
>> needed: SolrCloud I think covers these nrt aspects due to its
>> different architecture (not master-slave that in replicas but all
>> peers replicating)
>
> So it seems SolrReplication is out ( if my pollInteterval < 5 minute
> ), right ? Let me look into SolrCloud. Any suggestions which one is
> more stable SolrCloud/NRT ?
>
> -Thanks,
> Prasenjit

Re: SolrReplication configuration with frequent deletes and updates

Posted by prasenjit mukherjee <pr...@gmail.com>.
Appreciate your reply. Have some more follow up questions inline.

On Thu, Feb 2, 2012 at 12:35 AM, Emmanuel Espina
<es...@gmail.com> wrote:
>> 1. Adds : 20 docs/sec
>> 2. Searches : 100 searches/sec
>> 3. Deletes : (20*3600*24*7 ~ 12 mill ) docs/week ( basically a cron
>> job which deletes all documents more than 7 days old )
>>
>> I am thinking of having 6 shards ( with each having 2 million docs )
>> with 1 master and 2 slaves with SolrReplication. Have following
>> questions :
>>
>> 1. With  50 searches/sec per shard with 2 million doc, what would be
>> the tentative response-time  ?  I am thinking of keeping it under <100
>> ms
>
> That are quite a lot of searches per second considering that you will
> have to search in 6 shards (the coordination and network latency
> affects the results). Also the components you use and the complexity
> of the query (as well as the number of segments in each shard) affects
> the time. 100 ms is probably a low threshold for your requirements,
> you will probably need to add more replicas.

Adding slaves ( using SolrReplication ) is fine as long as it scales
linear. I do understand that shards may not scale linearly, mostly
because of merging/network overhead, but  think will help in reducing
response time ( pls correct me if I am wrong ) .  I am more worried
about response time ( even on a lightly loaded slave ). The main
intention of sharding was to reduce the response time. Will it be
better to have a 2shardsX6slaves configuration compared to
6shardX2slaves ? Considering my total# docs is 12 million, wIll solr
be ok with 6 million docs/shard ?

>
>
>> 2. What would be a reasonable latency ( pollInterval ) on slave for
>> SolrReplication ( all slaves connected with a single backplane ). Is 1
>> minute pollInterval reasonable ?
>
> Yes, but it is not reasonable that each time you poll you get updates.
> That is, you shouldn't perform commits more than once every 10
> minutes. Otherwise we would be talking of near real time indexing,
> something that is in development in trunk
> http://wiki.apache.org/solr/NearRealtimeSearch

Hmm. 10 minutes latency is definitely too hight for me ( specially as
this is a streaming use case, i.e. show latest stuff first )  In that
case I can probably get rid of master-slave and update all the
replicated shards. But then I will have to do lot of leg-work ( what
if one of the slaves are down etc. etc. ) I was trying to avoid that.
Just curious to know what is the stability of  NRT ?

>
>
>> 3. Is NRT a better/viable option compared to SolrReplication ?
>
> That is something in development. AFAIK it works with shards (because
> nrt refers to indexing and with shards there isn't anything particular
> with the indexing) but with replication something different will be
> needed: SolrCloud I think covers these nrt aspects due to its
> different architecture (not master-slave that in replicas but all
> peers replicating)

So it seems SolrReplication is out ( if my pollInteterval < 5 minute
), right ? Let me look into SolrCloud. Any suggestions which one is
more stable SolrCloud/NRT ?

-Thanks,
Prasenjit

Re: SolrReplication configuration with frequent deletes and updates

Posted by Erick Erickson <er...@gmail.com>.
In addition to what Emmanuel mentioned, why not consider 7 shards? If
you used one shard/day, your delete problem becomes really easy,
just nuke the oldest shard....

Although beware that this solution may affect your TF/IDF calculations
on the new shard (i.e. the one you use for *today's* data) until you get
enough documents on it.

Best
Erick

On Wed, Feb 1, 2012 at 2:05 PM, Emmanuel Espina
<es...@gmail.com> wrote:
> 2012/2/1 prasenjit mukherjee <pr...@gmail.com>:
>> I have the following requirements :
>>
>> 1. Adds : 20 docs/sec
>> 2. Searches : 100 searches/sec
>> 3. Deletes : (20*3600*24*7 ~ 12 mill ) docs/week ( basically a cron
>> job which deletes all documents more than 7 days old )
>>
>> I am thinking of having 6 shards ( with each having 2 million docs )
>> with 1 master and 2 slaves with SolrReplication. Have following
>> questions :
>>
>> 1. With  50 searches/sec per shard with 2 million doc, what would be
>> the tentative response-time  ?  I am thinking of keeping it under <100
>> ms
>
> That are quite a lot of searches per second considering that you will
> have to search in 6 shards (the coordination and network latency
> affects the results). Also the components you use and the complexity
> of the query (as well as the number of segments in each shard) affects
> the time. 100 ms is probably a low threshold for your requirements,
> you will probably need to add more replicas.
>
>
>> 2. What would be a reasonable latency ( pollInterval ) on slave for
>> SolrReplication ( all slaves connected with a single backplane ). Is 1
>> minute pollInterval reasonable ?
>
> Yes, but it is not reasonable that each time you poll you get updates.
> That is, you shouldn't perform commits more than once every 10
> minutes. Otherwise we would be talking of near real time indexing,
> something that is in development in trunk
> http://wiki.apache.org/solr/NearRealtimeSearch
>
>
>> 3. Is NRT a better/viable option compared to SolrReplication ?
>
> That is something in development. AFAIK it works with shards (because
> nrt refers to indexing and with shards there isn't anything particular
> with the indexing) but with replication something different will be
> needed: SolrCloud I think covers these nrt aspects due to its
> different architecture (not master-slave that in replicas but all
> peers replicating)
>
>>
>> -Thanks,
>> Prasenjit

Re: SolrReplication configuration with frequent deletes and updates

Posted by Emmanuel Espina <es...@gmail.com>.
2012/2/1 prasenjit mukherjee <pr...@gmail.com>:
> I have the following requirements :
>
> 1. Adds : 20 docs/sec
> 2. Searches : 100 searches/sec
> 3. Deletes : (20*3600*24*7 ~ 12 mill ) docs/week ( basically a cron
> job which deletes all documents more than 7 days old )
>
> I am thinking of having 6 shards ( with each having 2 million docs )
> with 1 master and 2 slaves with SolrReplication. Have following
> questions :
>
> 1. With  50 searches/sec per shard with 2 million doc, what would be
> the tentative response-time  ?  I am thinking of keeping it under <100
> ms

That are quite a lot of searches per second considering that you will
have to search in 6 shards (the coordination and network latency
affects the results). Also the components you use and the complexity
of the query (as well as the number of segments in each shard) affects
the time. 100 ms is probably a low threshold for your requirements,
you will probably need to add more replicas.


> 2. What would be a reasonable latency ( pollInterval ) on slave for
> SolrReplication ( all slaves connected with a single backplane ). Is 1
> minute pollInterval reasonable ?

Yes, but it is not reasonable that each time you poll you get updates.
That is, you shouldn't perform commits more than once every 10
minutes. Otherwise we would be talking of near real time indexing,
something that is in development in trunk
http://wiki.apache.org/solr/NearRealtimeSearch


> 3. Is NRT a better/viable option compared to SolrReplication ?

That is something in development. AFAIK it works with shards (because
nrt refers to indexing and with shards there isn't anything particular
with the indexing) but with replication something different will be
needed: SolrCloud I think covers these nrt aspects due to its
different architecture (not master-slave that in replicas but all
peers replicating)

>
> -Thanks,
> Prasenjit