You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Alejandro Marqués Rodríguez <am...@paradigmatecnologico.com> on 2013/11/12 14:44:22 UTC

SorlCloud recovery issue while search stress test

Hi,

We've been experiencing some problems during search stress tests and we
don't even have a clue on why is this happening.

We have the following:
- 3 servers
- Websphere 7
- Zookeeper 3.4.5 on each server
- Solr 4.5.0 on each server
- 1 shard (so it is one leader and 2 replicas)
- The index contains 7M documents (About 2GB)

We've run several stress tests with JMeter with 100-500 concurrent threads.
Depending on how many threads, we have different scenarios, but appart from
times or wether the system fully recovers or not, we have the next steps:


   1. The solrs begin responding queries, with stable number of threads for
   each solr (Less than 10)
   2. Once the test has been running for several minutes we kill one of the
   solrs (Most of the times the one being the leader)
   3. The remaining solrs respond to the queries increasing slightly the
   number of threads used.
   4. After a few minutes we restart the killed solr again (And here is
   where our problem starts)
   5. Once it starts it begins increasing the number of threads used (Up to
   100 or above) and the worst thing is that even the other two solrs start
   responding slowly (Or not responding at all). Then, depending on the number
   of concurrent queries, if there are few in more or less 3 minutes
   everything goes back to normal (thought almost no queries are attended
   during that period) or, if there are more than 200 concurrent queries the
   restarted server increases so much its used threads that it crashes.

During the minutes that the three solrs are not responding there are no
logs, and after making a thread dump we've seen a lot of stalled threads
with sun.misc.Unsafe.park traces.

I don't understand this behaviour at all, not only it works better with two
solrs than restarting the third but this restart affects the behaviour of
the two remaining solrs...

Anybody has any clue about this?

Thanks in advance



-- 
Alejandro Marqués Rodríguez

Paradigma Tecnológico
http://www.paradigmatecnologico.com
Avenida de Europa, 26. Ática 5. 3ª Planta
28224 Pozuelo de Alarcón
Tel.: 91 352 59 42

Re: SorlCloud recovery issue while search stress test

Posted by Erick Erickson <er...@gmail.com>.

Check your Solr transaction log size. It's possible that your
killed Solr is replaying transaction logs. Or synching from the
current leader (perhaps by replicating the entire shard index).

This is usually in the case when you're getting updates while
killing the leader.

Here's a writeup on tlogs etc. and how to control this.

http://searchhub.org/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/

Best,
Erick


On Tue, Nov 12, 2013 at 8:44 AM, Alejandro Marqués Rodríguez <
amarques@paradigmatecnologico.com> wrote:

> Hi,
>
> We've been experiencing some problems during search stress tests and we
> don't even have a clue on why is this happening.
>
> We have the following:
> - 3 servers
> - Websphere 7
> - Zookeeper 3.4.5 on each server
> - Solr 4.5.0 on each server
> - 1 shard (so it is one leader and 2 replicas)
> - The index contains 7M documents (About 2GB)
>
> We've run several stress tests with JMeter with 100-500 concurrent threads.
> Depending on how many threads, we have different scenarios, but appart from
> times or wether the system fully recovers or not, we have the next steps:
>
>
>    1. The solrs begin responding queries, with stable number of threads for
>    each solr (Less than 10)
>    2. Once the test has been running for several minutes we kill one of the
>    solrs (Most of the times the one being the leader)
>    3. The remaining solrs respond to the queries increasing slightly the
>    number of threads used.
>    4. After a few minutes we restart the killed solr again (And here is
>    where our problem starts)
>    5. Once it starts it begins increasing the number of threads used (Up to
>    100 or above) and the worst thing is that even the other two solrs start
>    responding slowly (Or not responding at all). Then, depending on the
> number
>    of concurrent queries, if there are few in more or less 3 minutes
>    everything goes back to normal (thought almost no queries are attended
>    during that period) or, if there are more than 200 concurrent queries
> the
>    restarted server increases so much its used threads that it crashes.
>
> During the minutes that the three solrs are not responding there are no
> logs, and after making a thread dump we've seen a lot of stalled threads
> with sun.misc.Unsafe.park traces.
>
> I don't understand this behaviour at all, not only it works better with two
> solrs than restarting the third but this restart affects the behaviour of
> the two remaining solrs...
>
> Anybody has any clue about this?
>
> Thanks in advance
>
>
>
> --
> Alejandro Marqués Rodríguez
>
> Paradigma Tecnológico
> http://www.paradigmatecnologico.com
> Avenida de Europa, 26. Ática 5. 3ª Planta
> 28224 Pozuelo de Alarcón
> Tel.: 91 352 59 42
>