You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Shawn Heisey <ap...@elyograg.org> on 2017/05/01 14:22:01 UTC

Re: Solr performance on EC2 linux

On 4/28/2017 10:09 AM, Jeff Wartes wrote:
> tldr: Recently, I tried moving an existing solrcloud configuration from a local datacenter to EC2. Performance was roughly 1/10th what I\u2019d expected, until I applied a bunch of linux tweaks.

How very strange.  I knew virtualization would have overheard, possibly
even measurable overhead, but that's insane.  Running on bare metal is
always better if you can do it.  I would be curious what would happen on
your original install if you applied similar tuning to that.  Would you
see a speedup there?

> Interestingly, a coworker playing with a ElasticSearch (ES 5.x, so a much more recent release) alternate implementation of the same index was not seeing this high-system-time behavior on EC2, and was getting throughput consistent with our general expectations.

That's even weirder.  ES 5.x will likely be using Points field types for
numeric fields, and although those are faster than what Solr currently
uses, I doubt it could explain that difference.  The implication here is
that the ES systems are running with stock EC2 settings, not the tuned
settings ... but I'd like you to confirm that.  Same Java version as
with Solr?  IMHO, Java itself is more likely to cause issues like you
saw than Solr.

> I\u2019m writing this for a few reasons:
>
> 1.       The performance difference was so crazy I really feel like this should really be broader knowledge.

Definitely agree!  I would be very interested in learning which of the
tunables you changed were major contributors to the improvement.  If it
turns out that Solr's code is sub-optimal in some way, maybe we can fix it.

> 2.       If anyone is aware of anything that changed in Lucene between 5.4 and 6.x that could explain why Elasticsearch wasn\u2019t suffering from this? If it\u2019s the clocksource that\u2019s the issue, there\u2019s an implication that Solr was using tons more system calls like gettimeofday that the EC2 (xen) hypervisor doesn\u2019t allow in userspace.

I had not considered the performance regression in 6.4.0 and 6.4.1 that
Erick mentioned.  Were you still running Solr 5.4, or was it a 6.x version?

=============

Specific thoughts on the tuning:

The noatime option is very good to use.  I also use nodiratime on my
systems.  Turning these off can have *massive* impacts on disk
performance.  If these are the source of the speedup, then the machine
doesn't have enough spare memory.

I'd be wary of the "nobarrier" mount option.  If the underlying storage
has battery-backed write caches, or is SSD without write caching, it
wouldn't be a problem.  Here's info about the "discard" mount option, I
don't know whether it applies to your amazon storage:

       discard/nodiscard
              Controls  whether ext4 should issue discard/TRIM commands
to the
              underlying block device when blocks are freed.  This  is 
useful
              for  SSD  devices  and sparse/thinly-provisioned LUNs, but
it is
              off by default until sufficient testing has been done.

The network tunables would have more of an effect in a distributed
environment like EC2 than they would on a LAN.

Thanks,
Shawn

Re: Solr performance on EC2 linux

Posted by Jeff Wartes <jw...@whitepages.com>.

Yes, that’s the Xenial I tried. Ubuntu 16.04.2 LTS.

On 5/1/17, 7:22 PM, "Will Martin" <wm...@outlook.com> wrote:

    Ubuntu 16.04 LTS - Xenial (HVM)
    
    Is this your Xenial version?
    
    
    
    
    On 5/1/2017 6:37 PM, Jeff Wartes wrote:
    > I tried a few variations of various things before we found and tried that linux/EC2 tuning page, including:
    >    - EC2 instance type: r4, c4, and i3
    >    - Ubuntu version: Xenial and Trusty
    >    - EBS vs local storage
    >    - Stock openjdk vs Zulu openjdk (Recent java8 in both cases - I’m aware of the issues with early java8 versions and I’m not using G1)
    >
    > Most of those attempts were to help reduce differences between the data center and the EC2 cluster. In all cases I re-indexed from scratch. I got the same very high system-time symptom in all cases. With the linux changes in place, we settled on r4/Xenial/EBS/Stock.
    >
    > Again, this was a slightly modified Solr 5.4, (I added backup requests, and two memory allocation rate tweaks that have long since been merged into mainline - released in 6.2 I think. I can dig up the jira numbers if anyone’s interested) I’ve never used Solr 6.x in production though.
    > The only reason I mentioned 6.x at all is because I’m aware that ES 5.x is based on Lucene 6.2. I don’t believe my coworker spent any time on tuning his ES setup, although I think he did try G1.
    >
    > I definitely do want to binary-search those settings until I understand better what exactly did the trick.
    > It’s a long cycle time per test is the problem, but hopefully in the next couple of weeks.
    >
    >
    >
    > On 5/1/17, 7:26 AM, "John Bickerstaff" <jo...@johnbickerstaff.com> wrote:
    >
    >      It's also very important to consider the type of EC2 instance you are
    >      using...
    >      
    >      We settled on the R4.2XL...  The R series is labeled "High-Memory"
    >      
    >      Which instance type did you end up using?
    >      
    >      On Mon, May 1, 2017 at 8:22 AM, Shawn Heisey <ap...@elyograg.org> wrote:
    >      
    >      > On 4/28/2017 10:09 AM, Jeff Wartes wrote:
    >      > > tldr: Recently, I tried moving an existing solrcloud configuration from
    >      > a local datacenter to EC2. Performance was roughly 1/10th what I’d
    >      > expected, until I applied a bunch of linux tweaks.
    >      >
    >      > How very strange.  I knew virtualization would have overheard, possibly
    >      > even measurable overhead, but that's insane.  Running on bare metal is
    >      > always better if you can do it.  I would be curious what would happen on
    >      > your original install if you applied similar tuning to that.  Would you
    >      > see a speedup there?
    >      >
    >      > > Interestingly, a coworker playing with a ElasticSearch (ES 5.x, so a
    >      > much more recent release) alternate implementation of the same index was
    >      > not seeing this high-system-time behavior on EC2, and was getting
    >      > throughput consistent with our general expectations.
    >      >
    >      > That's even weirder.  ES 5.x will likely be using Points field types for
    >      > numeric fields, and although those are faster than what Solr currently
    >      > uses, I doubt it could explain that difference.  The implication here is
    >      > that the ES systems are running with stock EC2 settings, not the tuned
    >      > settings ... but I'd like you to confirm that.  Same Java version as
    >      > with Solr?  IMHO, Java itself is more likely to cause issues like you
    >      > saw than Solr.
    >      >
    >      > > I’m writing this for a few reasons:
    >      > >
    >      > > 1.       The performance difference was so crazy I really feel like this
    >      > should really be broader knowledge.
    >      >
    >      > Definitely agree!  I would be very interested in learning which of the
    >      > tunables you changed were major contributors to the improvement.  If it
    >      > turns out that Solr's code is sub-optimal in some way, maybe we can fix it.
    >      >
    >      > > 2.       If anyone is aware of anything that changed in Lucene between
    >      > 5.4 and 6.x that could explain why Elasticsearch wasn’t suffering from
    >      > this? If it’s the clocksource that’s the issue, there’s an implication that
    >      > Solr was using tons more system calls like gettimeofday that the EC2 (xen)
    >      > hypervisor doesn’t allow in userspace.
    >      >
    >      > I had not considered the performance regression in 6.4.0 and 6.4.1 that
    >      > Erick mentioned.  Were you still running Solr 5.4, or was it a 6.x version?
    >      >
    >      > =============
    >      >
    >      > Specific thoughts on the tuning:
    >      >
    >      > The noatime option is very good to use.  I also use nodiratime on my
    >      > systems.  Turning these off can have *massive* impacts on disk
    >      > performance.  If these are the source of the speedup, then the machine
    >      > doesn't have enough spare memory.
    >      >
    >      > I'd be wary of the "nobarrier" mount option.  If the underlying storage
    >      > has battery-backed write caches, or is SSD without write caching, it
    >      > wouldn't be a problem.  Here's info about the "discard" mount option, I
    >      > don't know whether it applies to your amazon storage:
    >      >
    >      >        discard/nodiscard
    >      >               Controls  whether ext4 should issue discard/TRIM commands
    >      > to the
    >      >               underlying block device when blocks are freed.  This  is
    >      > useful
    >      >               for  SSD  devices  and sparse/thinly-provisioned LUNs, but
    >      > it is
    >      >               off by default until sufficient testing has been done.
    >      >
    >      > The network tunables would have more of an effect in a distributed
    >      > environment like EC2 than they would on a LAN.
    >      >
    >      > Thanks,
    >      > Shawn
    >      >
    >      >
    >      
    >

Re: Solr performance on EC2 linux

Posted by Will Martin <wm...@outlook.com>.

Ubuntu 16.04 LTS - Xenial (HVM)

Is this your Xenial version?




On 5/1/2017 6:37 PM, Jeff Wartes wrote:
> I tried a few variations of various things before we found and tried that linux/EC2 tuning page, including:
>    - EC2 instance type: r4, c4, and i3
>    - Ubuntu version: Xenial and Trusty
>    - EBS vs local storage
>    - Stock openjdk vs Zulu openjdk (Recent java8 in both cases - I’m aware of the issues with early java8 versions and I’m not using G1)
>
> Most of those attempts were to help reduce differences between the data center and the EC2 cluster. In all cases I re-indexed from scratch. I got the same very high system-time symptom in all cases. With the linux changes in place, we settled on r4/Xenial/EBS/Stock.
>
> Again, this was a slightly modified Solr 5.4, (I added backup requests, and two memory allocation rate tweaks that have long since been merged into mainline - released in 6.2 I think. I can dig up the jira numbers if anyone’s interested) I’ve never used Solr 6.x in production though.
> The only reason I mentioned 6.x at all is because I’m aware that ES 5.x is based on Lucene 6.2. I don’t believe my coworker spent any time on tuning his ES setup, although I think he did try G1.
>
> I definitely do want to binary-search those settings until I understand better what exactly did the trick.
> It’s a long cycle time per test is the problem, but hopefully in the next couple of weeks.
>
>
>
> On 5/1/17, 7:26 AM, "John Bickerstaff" <jo...@johnbickerstaff.com> wrote:
>
>      It's also very important to consider the type of EC2 instance you are
>      using...
>      
>      We settled on the R4.2XL...  The R series is labeled "High-Memory"
>      
>      Which instance type did you end up using?
>      
>      On Mon, May 1, 2017 at 8:22 AM, Shawn Heisey <ap...@elyograg.org> wrote:
>      
>      > On 4/28/2017 10:09 AM, Jeff Wartes wrote:
>      > > tldr: Recently, I tried moving an existing solrcloud configuration from
>      > a local datacenter to EC2. Performance was roughly 1/10th what I’d
>      > expected, until I applied a bunch of linux tweaks.
>      >
>      > How very strange.  I knew virtualization would have overheard, possibly
>      > even measurable overhead, but that's insane.  Running on bare metal is
>      > always better if you can do it.  I would be curious what would happen on
>      > your original install if you applied similar tuning to that.  Would you
>      > see a speedup there?
>      >
>      > > Interestingly, a coworker playing with a ElasticSearch (ES 5.x, so a
>      > much more recent release) alternate implementation of the same index was
>      > not seeing this high-system-time behavior on EC2, and was getting
>      > throughput consistent with our general expectations.
>      >
>      > That's even weirder.  ES 5.x will likely be using Points field types for
>      > numeric fields, and although those are faster than what Solr currently
>      > uses, I doubt it could explain that difference.  The implication here is
>      > that the ES systems are running with stock EC2 settings, not the tuned
>      > settings ... but I'd like you to confirm that.  Same Java version as
>      > with Solr?  IMHO, Java itself is more likely to cause issues like you
>      > saw than Solr.
>      >
>      > > I’m writing this for a few reasons:
>      > >
>      > > 1.       The performance difference was so crazy I really feel like this
>      > should really be broader knowledge.
>      >
>      > Definitely agree!  I would be very interested in learning which of the
>      > tunables you changed were major contributors to the improvement.  If it
>      > turns out that Solr's code is sub-optimal in some way, maybe we can fix it.
>      >
>      > > 2.       If anyone is aware of anything that changed in Lucene between
>      > 5.4 and 6.x that could explain why Elasticsearch wasn’t suffering from
>      > this? If it’s the clocksource that’s the issue, there’s an implication that
>      > Solr was using tons more system calls like gettimeofday that the EC2 (xen)
>      > hypervisor doesn’t allow in userspace.
>      >
>      > I had not considered the performance regression in 6.4.0 and 6.4.1 that
>      > Erick mentioned.  Were you still running Solr 5.4, or was it a 6.x version?
>      >
>      > =============
>      >
>      > Specific thoughts on the tuning:
>      >
>      > The noatime option is very good to use.  I also use nodiratime on my
>      > systems.  Turning these off can have *massive* impacts on disk
>      > performance.  If these are the source of the speedup, then the machine
>      > doesn't have enough spare memory.
>      >
>      > I'd be wary of the "nobarrier" mount option.  If the underlying storage
>      > has battery-backed write caches, or is SSD without write caching, it
>      > wouldn't be a problem.  Here's info about the "discard" mount option, I
>      > don't know whether it applies to your amazon storage:
>      >
>      >        discard/nodiscard
>      >               Controls  whether ext4 should issue discard/TRIM commands
>      > to the
>      >               underlying block device when blocks are freed.  This  is
>      > useful
>      >               for  SSD  devices  and sparse/thinly-provisioned LUNs, but
>      > it is
>      >               off by default until sufficient testing has been done.
>      >
>      > The network tunables would have more of an effect in a distributed
>      > environment like EC2 than they would on a LAN.
>      >
>      > Thanks,
>      > Shawn
>      >
>      >
>      
>

Re: Solr performance on EC2 linux

Posted by Jeff Wartes <jw...@whitepages.com>.

I tried a few variations of various things before we found and tried that linux/EC2 tuning page, including:
  - EC2 instance type: r4, c4, and i3
  - Ubuntu version: Xenial and Trusty
  - EBS vs local storage
  - Stock openjdk vs Zulu openjdk (Recent java8 in both cases - I’m aware of the issues with early java8 versions and I’m not using G1)

Most of those attempts were to help reduce differences between the data center and the EC2 cluster. In all cases I re-indexed from scratch. I got the same very high system-time symptom in all cases. With the linux changes in place, we settled on r4/Xenial/EBS/Stock.

Again, this was a slightly modified Solr 5.4, (I added backup requests, and two memory allocation rate tweaks that have long since been merged into mainline - released in 6.2 I think. I can dig up the jira numbers if anyone’s interested) I’ve never used Solr 6.x in production though. 
The only reason I mentioned 6.x at all is because I’m aware that ES 5.x is based on Lucene 6.2. I don’t believe my coworker spent any time on tuning his ES setup, although I think he did try G1.

I definitely do want to binary-search those settings until I understand better what exactly did the trick. 
It’s a long cycle time per test is the problem, but hopefully in the next couple of weeks.



On 5/1/17, 7:26 AM, "John Bickerstaff" <jo...@johnbickerstaff.com> wrote:

    It's also very important to consider the type of EC2 instance you are
    using...
    
    We settled on the R4.2XL...  The R series is labeled "High-Memory"
    
    Which instance type did you end up using?
    
    On Mon, May 1, 2017 at 8:22 AM, Shawn Heisey <ap...@elyograg.org> wrote:
    
    > On 4/28/2017 10:09 AM, Jeff Wartes wrote:
    > > tldr: Recently, I tried moving an existing solrcloud configuration from
    > a local datacenter to EC2. Performance was roughly 1/10th what I’d
    > expected, until I applied a bunch of linux tweaks.
    >
    > How very strange.  I knew virtualization would have overheard, possibly
    > even measurable overhead, but that's insane.  Running on bare metal is
    > always better if you can do it.  I would be curious what would happen on
    > your original install if you applied similar tuning to that.  Would you
    > see a speedup there?
    >
    > > Interestingly, a coworker playing with a ElasticSearch (ES 5.x, so a
    > much more recent release) alternate implementation of the same index was
    > not seeing this high-system-time behavior on EC2, and was getting
    > throughput consistent with our general expectations.
    >
    > That's even weirder.  ES 5.x will likely be using Points field types for
    > numeric fields, and although those are faster than what Solr currently
    > uses, I doubt it could explain that difference.  The implication here is
    > that the ES systems are running with stock EC2 settings, not the tuned
    > settings ... but I'd like you to confirm that.  Same Java version as
    > with Solr?  IMHO, Java itself is more likely to cause issues like you
    > saw than Solr.
    >
    > > I’m writing this for a few reasons:
    > >
    > > 1.       The performance difference was so crazy I really feel like this
    > should really be broader knowledge.
    >
    > Definitely agree!  I would be very interested in learning which of the
    > tunables you changed were major contributors to the improvement.  If it
    > turns out that Solr's code is sub-optimal in some way, maybe we can fix it.
    >
    > > 2.       If anyone is aware of anything that changed in Lucene between
    > 5.4 and 6.x that could explain why Elasticsearch wasn’t suffering from
    > this? If it’s the clocksource that’s the issue, there’s an implication that
    > Solr was using tons more system calls like gettimeofday that the EC2 (xen)
    > hypervisor doesn’t allow in userspace.
    >
    > I had not considered the performance regression in 6.4.0 and 6.4.1 that
    > Erick mentioned.  Were you still running Solr 5.4, or was it a 6.x version?
    >
    > =============
    >
    > Specific thoughts on the tuning:
    >
    > The noatime option is very good to use.  I also use nodiratime on my
    > systems.  Turning these off can have *massive* impacts on disk
    > performance.  If these are the source of the speedup, then the machine
    > doesn't have enough spare memory.
    >
    > I'd be wary of the "nobarrier" mount option.  If the underlying storage
    > has battery-backed write caches, or is SSD without write caching, it
    > wouldn't be a problem.  Here's info about the "discard" mount option, I
    > don't know whether it applies to your amazon storage:
    >
    >        discard/nodiscard
    >               Controls  whether ext4 should issue discard/TRIM commands
    > to the
    >               underlying block device when blocks are freed.  This  is
    > useful
    >               for  SSD  devices  and sparse/thinly-provisioned LUNs, but
    > it is
    >               off by default until sufficient testing has been done.
    >
    > The network tunables would have more of an effect in a distributed
    > environment like EC2 than they would on a LAN.
    >
    > Thanks,
    > Shawn
    >
    >

Re: Solr performance on EC2 linux

Posted by John Bickerstaff <jo...@johnbickerstaff.com>.

It's also very important to consider the type of EC2 instance you are
using...

We settled on the R4.2XL...  The R series is labeled "High-Memory"

Which instance type did you end up using?

On Mon, May 1, 2017 at 8:22 AM, Shawn Heisey <ap...@elyograg.org> wrote:

> On 4/28/2017 10:09 AM, Jeff Wartes wrote:
> > tldr: Recently, I tried moving an existing solrcloud configuration from
> a local datacenter to EC2. Performance was roughly 1/10th what I’d
> expected, until I applied a bunch of linux tweaks.
>
> How very strange.  I knew virtualization would have overheard, possibly
> even measurable overhead, but that's insane.  Running on bare metal is
> always better if you can do it.  I would be curious what would happen on
> your original install if you applied similar tuning to that.  Would you
> see a speedup there?
>
> > Interestingly, a coworker playing with a ElasticSearch (ES 5.x, so a
> much more recent release) alternate implementation of the same index was
> not seeing this high-system-time behavior on EC2, and was getting
> throughput consistent with our general expectations.
>
> That's even weirder.  ES 5.x will likely be using Points field types for
> numeric fields, and although those are faster than what Solr currently
> uses, I doubt it could explain that difference.  The implication here is
> that the ES systems are running with stock EC2 settings, not the tuned
> settings ... but I'd like you to confirm that.  Same Java version as
> with Solr?  IMHO, Java itself is more likely to cause issues like you
> saw than Solr.
>
> > I’m writing this for a few reasons:
> >
> > 1.       The performance difference was so crazy I really feel like this
> should really be broader knowledge.
>
> Definitely agree!  I would be very interested in learning which of the
> tunables you changed were major contributors to the improvement.  If it
> turns out that Solr's code is sub-optimal in some way, maybe we can fix it.
>
> > 2.       If anyone is aware of anything that changed in Lucene between
> 5.4 and 6.x that could explain why Elasticsearch wasn’t suffering from
> this? If it’s the clocksource that’s the issue, there’s an implication that
> Solr was using tons more system calls like gettimeofday that the EC2 (xen)
> hypervisor doesn’t allow in userspace.
>
> I had not considered the performance regression in 6.4.0 and 6.4.1 that
> Erick mentioned.  Were you still running Solr 5.4, or was it a 6.x version?
>
> =============
>
> Specific thoughts on the tuning:
>
> The noatime option is very good to use.  I also use nodiratime on my
> systems.  Turning these off can have *massive* impacts on disk
> performance.  If these are the source of the speedup, then the machine
> doesn't have enough spare memory.
>
> I'd be wary of the "nobarrier" mount option.  If the underlying storage
> has battery-backed write caches, or is SSD without write caching, it
> wouldn't be a problem.  Here's info about the "discard" mount option, I
> don't know whether it applies to your amazon storage:
>
>        discard/nodiscard
>               Controls  whether ext4 should issue discard/TRIM commands
> to the
>               underlying block device when blocks are freed.  This  is
> useful
>               for  SSD  devices  and sparse/thinly-provisioned LUNs, but
> it is
>               off by default until sufficient testing has been done.
>
> The network tunables would have more of an effect in a distributed
> environment like EC2 than they would on a LAN.
>
> Thanks,
> Shawn
>
>