You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by Mark Miller <ma...@gmail.com> on 2018/05/30 18:52:29 UTC

Solr Star Burst - SolrCloud Performance / Scale

I've always said I wanted to focus on performance and scale for SolrCloud,
but for a long time that really just involved focusing on stability.

Now things have started to get pretty stable. Some things that made me
cringe about SolrCloud no longer do in 7.3/7.4.

Weeks back I found myself yet again looking for spurious, ugly issues
around fragile connections that cause recovery headaches and random request
fails. Again I made a change that should bring big improvements. Like many
times before.

I've had just about enough of that. Just about enough of broken connection
reuse. Just about enough of countless wasteful threads and connections
lurking and creaking all over. Just about enough of poor single update
performance and weaknesses in batch updates. Just about enough of the
painful ConcurrentUpdateSolrClient.

So much inefficiency hiding in plain sight. Stuff I always thought we would
overcome, but always far enough in the distance to keep me from feeling bad
that I didn't know quite how we would get there. Solr was a container
agnostic web application before Solr 5 for god's sake. Even relatively
simple changes like upgrading our http client from version 3 to 4 was a
huge amount of work for very incremental improvements.

If I'm going to be excited about this system after all these years all of
that has to change.

I started looking into using a HTTP/2 and a new HttpClient that can do non
blocking IO async requests.

I thought upgrading Apache HttpClient from 3 to 4 was long, tedious, and
difficult. Going to a fully different client has made me reconsider that. I
did a lot of the work, but a good amount remains (security, finish SSL,
tuning ...).

I wrote a new Http2SolrClient that can replace HttpSolrClient and plug into
CloudSolrClient and LBHttpSolrClient. I added some early async APIs. Non
blocking IO async is about as oversold as "schemaless", but it's a great
tool to have available as well.

I'm now working in a much more efficient world, aiming for 1 connection per
CoreContainer per remote destination. Connections are no longer fragile.
The transfer protocol is no longer text based.

Yonik should be pleased with the state of reordered updates from leader to
replica.

I replaced our CUSC usage for distributing updates with Http2SolrClient and
async calls.

I played with optionally using the async calls in the HttpShardHandler as
well.

I replaced all HttpSolrClient usage with Http2SolrClient.

I started to get control of threads. I had control of connections.

I added early efficient external request throttling.

I started tuning resource pools.

I started removing sleep polling loops. They are horrible and slow tests
especially, we already have a replacement we are hardly using.

I did some other related stuff. I'm just fixing the main things I hate
along these communication/resource-usage/scale/perf themes.

I'm calling this whole effort Star Burst:
https://github.com/markrmiller/starburst

I've done a ton. Mostly very late at night, it's not all perfect yet, some
of it may be exploratory. There is a lot to do to wrap it up with a bow.
This touches a lot of spots, our surface area of features is just huge now.

Basically I have a high performance Solr fork at the moment (only setup for
tests, not actually running stand alone Solr). I don't know how or when (or
to be completely honest, if) it comes home. I'm going to do what I can, but
it's likely to require more than me to be successful in a reasonable time
frame.

I have a couple JIRA issues open for HTTP/2 and the new SolrClient.

Mark


-- 
- Mark
about.me/markrmiller

Re: Solr Star Burst - SolrCloud Performance / Scale

Posted by Mark Miller <ma...@gmail.com>.

>
> If I just very slowly put it in piece by piece and tried to pre think out
> every step, the results would be pretty dreary.
>

To elaborate on that, there probably would not have been results from me :)

I almost quit in the middle of Jetty HttpClient. I relearned every mistake
I made trying to do the proxy the first time 6 times and then made some new
ones. The security and SSL part are still going to take some grunt work.

I almost quit in the middle of Http2. I hadn't signed up for this. But I
was in too far by then, too much invested.

By the QOSFilter, that was a nice change of pace, but  it's just an early
prototype.

It's one of those things that just doesn't happen until some idiot bites
off more than he can chew. Painful to break up much initially, too general
to pull lots of payed devs, too much for one dev.

I've been hunting down thread pools and bad resource use in general as well
(still clearing out sleeps, focusing on non test code first, but some test
code too). I'd like to get that in shape and then start enforcing checks
and tests around it. A lot of that can probably come in independently.

- Mark


-- 
- Mark
about.me/markrmiller

Re: Solr Star Burst - SolrCloud Performance / Scale

Posted by Mark Miller <ma...@gmail.com>.

On Wed, May 30, 2018 at 10:18 PM Varun Thacker <va...@vthacker.in> wrote:

> Hi Mark,
>
> I've started glancing at the the repo and some of the issues you are
> addressing here will make things a lot more stable under high loads. I'll
> look at it in a little more detail in the coming days.
>
> The key would be how to isolate the work in desecrate chunks to then go
> and make Jiras for. SOLR-12405 is the first thing that caught my eye that's
> an isolated jira and can be tackled without the http2 client etc
>

Yeah, anything that does not depend on the Jetty HttpClient or HTTP/2 can
likely be brought in independently.

The Http2SolrClient can also come in without HTTP/2 or replacing
HttpSolrClient and still offer non blocking IO async as a new HTTP/1.1
capable user client.

I guess I have maybe 3 JIRA issues filed - Http2SolrClient w/ Jetty
HttpClient, HTTP/2, QOSFilter. That covers the foundation.

As I have gained access to these features though, all of a sudden it
becomes easier to debug and solve other issues. I also learn and discover
by pushing down the road. If I just very slowly put it in piece by piece
and tried to pre think out every step, the results would be pretty dreary.
I would not be anywhere near the current state or have the same
understanding of what still needs to be done. Like SolrCloud originally,
the scope of change is just too large for standard procedure. We had to
fork that too and the merge back was huge and scary, but also would have
only been on master.

So I'll do what I can to keep the branch up to date and we will have to
pull off bitable pieces, with both HTTP/2 and Jetty HttpClient just being
big and invasive no matter what, but almost all for the better :)

As soon as anyone is ready to collaborate concretely on code, let me know
and I'll finish getting a base set of tests basing and move the branch to
Apache.

- Mark
-- 
- Mark
about.me/markrmiller

Re: Solr Star Burst - SolrCloud Performance / Scale

Posted by Varun Thacker <va...@vthacker.in>.

Hi Mark,

I've started glancing at the the repo and some of the issues you are
addressing here will make things a lot more stable under high loads. I'll
look at it in a little more detail in the coming days.

The key would be how to isolate the work in desecrate chunks to then go and
make Jiras for. SOLR-12405 is the first thing that caught my eye that's an
isolated jira and can be tackled without the http2 client etc

On Wed, May 30, 2018 at 4:13 PM, Mark Miller <ma...@gmail.com> wrote:

> Some of the fallout of this should be huge improvements to our tests.
> Right now, some of them take so long because no one even notices when they
> have done things to make the situation even worse and it's hard to monitor
> resource usage as we develop with it already fairly unbounded.
>
> On master right now, on a lucky run (no tlog replica type for sure),
> BasicDistributedZkTest takes my 6 core machine from 2012 takes 76 seconds.
> Depending on how hard test injection hits, I've seen a few minutes and
> anywhere in between.
>
> Setting the tlog replica issue aside (I've disabled it for the moment, but
> I have fixed that issue by changing out distrib commits work), on the
> starburst branch, resource usage with multiple parallel tests running is
> going to be much, much better. For single cloud tests, performance is
> mostly about removing naive polling and carefree resource usage. The branch
> has big improvements for single and parallel tests already.
>
> I don't know how much left there is to fix, but already, on starburst,
> BasicDistributedZkTest takes 45 seconds vs master's 76 best case.
>
> - Mark
>
> On Wed, May 30, 2018 at 1:52 PM Mark Miller <ma...@gmail.com> wrote:
>
>> I've always said I wanted to focus on performance and scale for
>> SolrCloud, but for a long time that really just involved focusing on
>> stability.
>>
>> Now things have started to get pretty stable. Some things that made me
>> cringe about SolrCloud no longer do in 7.3/7.4.
>>
>> Weeks back I found myself yet again looking for spurious, ugly issues
>> around fragile connections that cause recovery headaches and random request
>> fails. Again I made a change that should bring big improvements. Like many
>> times before.
>>
>> I've had just about enough of that. Just about enough of broken
>> connection reuse. Just about enough of countless wasteful threads and
>> connections lurking and creaking all over. Just about enough of poor single
>> update performance and weaknesses in batch updates. Just about enough of
>> the painful ConcurrentUpdateSolrClient.
>>
>> So much inefficiency hiding in plain sight. Stuff I always thought we
>> would overcome, but always far enough in the distance to keep me from
>> feeling bad that I didn't know quite how we would get there. Solr was a
>> container agnostic web application before Solr 5 for god's sake. Even
>> relatively simple changes like upgrading our http client from version 3 to
>> 4 was a huge amount of work for very incremental improvements.
>>
>> If I'm going to be excited about this system after all these years all of
>> that has to change.
>>
>> I started looking into using a HTTP/2 and a new HttpClient that can do
>> non blocking IO async requests.
>>
>> I thought upgrading Apache HttpClient from 3 to 4 was long, tedious, and
>> difficult. Going to a fully different client has made me reconsider that. I
>> did a lot of the work, but a good amount remains (security, finish SSL,
>> tuning ...).
>>
>> I wrote a new Http2SolrClient that can replace HttpSolrClient and plug
>> into CloudSolrClient and LBHttpSolrClient. I added some early async APIs.
>> Non blocking IO async is about as oversold as "schemaless", but it's a
>> great tool to have available as well.
>>
>> I'm now working in a much more efficient world, aiming for 1 connection
>> per CoreContainer per remote destination. Connections are no longer
>> fragile. The transfer protocol is no longer text based.
>>
>> Yonik should be pleased with the state of reordered updates from leader
>> to replica.
>>
>> I replaced our CUSC usage for distributing updates with Http2SolrClient
>> and async calls.
>>
>> I played with optionally using the async calls in the HttpShardHandler as
>> well.
>>
>> I replaced all HttpSolrClient usage with Http2SolrClient.
>>
>> I started to get control of threads. I had control of connections.
>>
>> I added early efficient external request throttling.
>>
>> I started tuning resource pools.
>>
>> I started removing sleep polling loops. They are horrible and slow tests
>> especially, we already have a replacement we are hardly using.
>>
>> I did some other related stuff. I'm just fixing the main things I hate
>> along these communication/resource-usage/scale/perf themes.
>>
>> I'm calling this whole effort Star Burst: https://github.com/
>> markrmiller/starburst
>>
>> I've done a ton. Mostly very late at night, it's not all perfect yet,
>> some of it may be exploratory. There is a lot to do to wrap it up with a
>> bow. This touches a lot of spots, our surface area of features is just huge
>> now.
>>
>> Basically I have a high performance Solr fork at the moment (only setup
>> for tests, not actually running stand alone Solr). I don't know how or when
>> (or to be completely honest, if) it comes home. I'm going to do what I can,
>> but it's likely to require more than me to be successful in a reasonable
>> time frame.
>>
>> I have a couple JIRA issues open for HTTP/2 and the new SolrClient.
>>
>> Mark
>>
>>
>> --
>> - Mark
>> about.me/markrmiller
>>
> --
> - Mark
> about.me/markrmiller
>

Re: Solr Star Burst - SolrCloud Performance / Scale

Posted by Mark Miller <ma...@gmail.com>.

Some of the fallout of this should be huge improvements to our tests. Right
now, some of them take so long because no one even notices when they have
done things to make the situation even worse and it's hard to monitor
resource usage as we develop with it already fairly unbounded.

On master right now, on a lucky run (no tlog replica type for sure),
BasicDistributedZkTest takes my 6 core machine from 2012 takes 76 seconds.
Depending on how hard test injection hits, I've seen a few minutes and
anywhere in between.

Setting the tlog replica issue aside (I've disabled it for the moment, but
I have fixed that issue by changing out distrib commits work), on the
starburst branch, resource usage with multiple parallel tests running is
going to be much, much better. For single cloud tests, performance is
mostly about removing naive polling and carefree resource usage. The branch
has big improvements for single and parallel tests already.

I don't know how much left there is to fix, but already, on starburst,
BasicDistributedZkTest takes 45 seconds vs master's 76 best case.

- Mark

On Wed, May 30, 2018 at 1:52 PM Mark Miller <ma...@gmail.com> wrote:

> I've always said I wanted to focus on performance and scale for SolrCloud,
> but for a long time that really just involved focusing on stability.
>
> Now things have started to get pretty stable. Some things that made me
> cringe about SolrCloud no longer do in 7.3/7.4.
>
> Weeks back I found myself yet again looking for spurious, ugly issues
> around fragile connections that cause recovery headaches and random request
> fails. Again I made a change that should bring big improvements. Like many
> times before.
>
> I've had just about enough of that. Just about enough of broken connection
> reuse. Just about enough of countless wasteful threads and connections
> lurking and creaking all over. Just about enough of poor single update
> performance and weaknesses in batch updates. Just about enough of the
> painful ConcurrentUpdateSolrClient.
>
> So much inefficiency hiding in plain sight. Stuff I always thought we
> would overcome, but always far enough in the distance to keep me from
> feeling bad that I didn't know quite how we would get there. Solr was a
> container agnostic web application before Solr 5 for god's sake. Even
> relatively simple changes like upgrading our http client from version 3 to
> 4 was a huge amount of work for very incremental improvements.
>
> If I'm going to be excited about this system after all these years all of
> that has to change.
>
> I started looking into using a HTTP/2 and a new HttpClient that can do non
> blocking IO async requests.
>
> I thought upgrading Apache HttpClient from 3 to 4 was long, tedious, and
> difficult. Going to a fully different client has made me reconsider that. I
> did a lot of the work, but a good amount remains (security, finish SSL,
> tuning ...).
>
> I wrote a new Http2SolrClient that can replace HttpSolrClient and plug
> into CloudSolrClient and LBHttpSolrClient. I added some early async APIs.
> Non blocking IO async is about as oversold as "schemaless", but it's a
> great tool to have available as well.
>
> I'm now working in a much more efficient world, aiming for 1 connection
> per CoreContainer per remote destination. Connections are no longer
> fragile. The transfer protocol is no longer text based.
>
> Yonik should be pleased with the state of reordered updates from leader to
> replica.
>
> I replaced our CUSC usage for distributing updates with Http2SolrClient
> and async calls.
>
> I played with optionally using the async calls in the HttpShardHandler as
> well.
>
> I replaced all HttpSolrClient usage with Http2SolrClient.
>
> I started to get control of threads. I had control of connections.
>
> I added early efficient external request throttling.
>
> I started tuning resource pools.
>
> I started removing sleep polling loops. They are horrible and slow tests
> especially, we already have a replacement we are hardly using.
>
> I did some other related stuff. I'm just fixing the main things I hate
> along these communication/resource-usage/scale/perf themes.
>
> I'm calling this whole effort Star Burst:
> https://github.com/markrmiller/starburst
>
> I've done a ton. Mostly very late at night, it's not all perfect yet, some
> of it may be exploratory. There is a lot to do to wrap it up with a bow.
> This touches a lot of spots, our surface area of features is just huge now.
>
> Basically I have a high performance Solr fork at the moment (only setup
> for tests, not actually running stand alone Solr). I don't know how or when
> (or to be completely honest, if) it comes home. I'm going to do what I can,
> but it's likely to require more than me to be successful in a reasonable
> time frame.
>
> I have a couple JIRA issues open for HTTP/2 and the new SolrClient.
>
> Mark
>
>
> --
> - Mark
> about.me/markrmiller
>
-- 
- Mark
about.me/markrmiller