You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Alejandro Marqués Rodríguez <am...@paradigmatecnologico.com> on 2013/11/12 14:44:22 UTC
SorlCloud recovery issue while search stress test
Hi,
We've been experiencing some problems during search stress tests and we
don't even have a clue on why is this happening.
We have the following:
- 3 servers
- Websphere 7
- Zookeeper 3.4.5 on each server
- Solr 4.5.0 on each server
- 1 shard (so it is one leader and 2 replicas)
- The index contains 7M documents (About 2GB)
We've run several stress tests with JMeter with 100-500 concurrent threads.
Depending on how many threads, we have different scenarios, but appart from
times or wether the system fully recovers or not, we have the next steps:
1. The solrs begin responding queries, with stable number of threads for
each solr (Less than 10)
2. Once the test has been running for several minutes we kill one of the
solrs (Most of the times the one being the leader)
3. The remaining solrs respond to the queries increasing slightly the
number of threads used.
4. After a few minutes we restart the killed solr again (And here is
where our problem starts)
5. Once it starts it begins increasing the number of threads used (Up to
100 or above) and the worst thing is that even the other two solrs start
responding slowly (Or not responding at all). Then, depending on the number
of concurrent queries, if there are few in more or less 3 minutes
everything goes back to normal (thought almost no queries are attended
during that period) or, if there are more than 200 concurrent queries the
restarted server increases so much its used threads that it crashes.
During the minutes that the three solrs are not responding there are no
logs, and after making a thread dump we've seen a lot of stalled threads
with sun.misc.Unsafe.park traces.
I don't understand this behaviour at all, not only it works better with two
solrs than restarting the third but this restart affects the behaviour of
the two remaining solrs...
Anybody has any clue about this?
Thanks in advance
--
Alejandro Marqués Rodríguez
Paradigma Tecnológico
http://www.paradigmatecnologico.com
Avenida de Europa, 26. Ática 5. 3ª Planta
28224 Pozuelo de Alarcón
Tel.: 91 352 59 42
Re: SorlCloud recovery issue while search stress test
Posted by Erick Erickson <er...@gmail.com>.
Check your Solr transaction log size. It's possible that your
killed Solr is replaying transaction logs. Or synching from the
current leader (perhaps by replicating the entire shard index).
This is usually in the case when you're getting updates while
killing the leader.
Here's a writeup on tlogs etc. and how to control this.
http://searchhub.org/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/
Best,
Erick
On Tue, Nov 12, 2013 at 8:44 AM, Alejandro Marqués Rodríguez <
amarques@paradigmatecnologico.com> wrote:
> Hi,
>
> We've been experiencing some problems during search stress tests and we
> don't even have a clue on why is this happening.
>
> We have the following:
> - 3 servers
> - Websphere 7
> - Zookeeper 3.4.5 on each server
> - Solr 4.5.0 on each server
> - 1 shard (so it is one leader and 2 replicas)
> - The index contains 7M documents (About 2GB)
>
> We've run several stress tests with JMeter with 100-500 concurrent threads.
> Depending on how many threads, we have different scenarios, but appart from
> times or wether the system fully recovers or not, we have the next steps:
>
>
> 1. The solrs begin responding queries, with stable number of threads for
> each solr (Less than 10)
> 2. Once the test has been running for several minutes we kill one of the
> solrs (Most of the times the one being the leader)
> 3. The remaining solrs respond to the queries increasing slightly the
> number of threads used.
> 4. After a few minutes we restart the killed solr again (And here is
> where our problem starts)
> 5. Once it starts it begins increasing the number of threads used (Up to
> 100 or above) and the worst thing is that even the other two solrs start
> responding slowly (Or not responding at all). Then, depending on the
> number
> of concurrent queries, if there are few in more or less 3 minutes
> everything goes back to normal (thought almost no queries are attended
> during that period) or, if there are more than 200 concurrent queries
> the
> restarted server increases so much its used threads that it crashes.
>
> During the minutes that the three solrs are not responding there are no
> logs, and after making a thread dump we've seen a lot of stalled threads
> with sun.misc.Unsafe.park traces.
>
> I don't understand this behaviour at all, not only it works better with two
> solrs than restarting the third but this restart affects the behaviour of
> the two remaining solrs...
>
> Anybody has any clue about this?
>
> Thanks in advance
>
>
>
> --
> Alejandro Marqués Rodríguez
>
> Paradigma Tecnológico
> http://www.paradigmatecnologico.com
> Avenida de Europa, 26. Ática 5. 3ª Planta
> 28224 Pozuelo de Alarcón
> Tel.: 91 352 59 42
>