You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@zookeeper.apache.org by Fangmin Lv <lv...@gmail.com> on 2019/11/03 02:07:41 UTC

Re: String inconsistency issue when running ZK with OpenJDK 10 on SKL machines

Enrico,

As Andor mentioned, the issue has been fixed in JDK 11 since b27, you
should be fine :)

Fangmin

On Mon, Oct 28, 2019 at 10:44 PM Andor Molnar <an...@apache.org> wrote:

> Here’s the JDK issue that Fangmin mentioned:
>
> https://bugs.openjdk.java.net/browse/JDK-8207746
>
> It’s a JDK 10 & 11 bug which has already been fixed since JDK11 b27.
>
> Andor
>
>
>
> > On 2019. Oct 28., at 8:00, Enrico Olivelli <eo...@gmail.com> wrote:
> >
> > Fangmin,
> >
> > Il lun 28 ott 2019, 02:23 Fangmin Lv <lv...@gmail.com> ha scritto:
> >
> >> Hey everyone,
> >>
> >> (Forgot to add subject in the previous email, resent with clear
> subject.)
> >>
> >> I'd like to share some weird inconsistency bugs we saw recently on prod,
> >> the root cause and potential fixes of it. It took us around a month to
> >> investigate, reproduce and find out the root cause, hopefully the
> >> informations here will help people avoid hitting this same potential
> issue.
> >>
> >> [Trigger conditions and behavior]
> >>
> >> The inconsistency issue only happened when running ZK with OpenJDK 10 on
> >> SKL machines, and it's not because of bugs inside ZK but due to a
> >> macro-assembly bug inside JDK.
> >>
> >> And the behavior of the issues might be:
> >>
> >> * NONODE returned when getData from a child exist when queried with
> >> getChildren, and there is no delete issued
> >> * NONODE error returned when try to create a child based on the parent
> node
> >> just successfully created, and there is no delete issued
> >> * No client is able to acquire the lock even though the previous session
> >> who hold the lock already dead
> >>
> >> [Root cause]
> >>
> >> The direct cause of the misbehavior above is due to the key/value put
> into
> >> the ZooKeeperServer.outstandingChangesForPath HashMap or the
> >> DataNode.children HashSet are not visible to the future get or remove,
> >> which caused the outstanding changes not visible when leader prepare the
> >> following txns, or node being deleted but not removed from
> >> DataNode.children.
> >>
> >> And the 'bad' HashMap/HashSet behavior is not because of concurrency
> bugs
> >> inside ZK, but due to a macro-assembly bug which is used to generate the
> >> String.equals intrinsic assembly code in JDK 9 and 10. The bug was
> >> introduced in JDK-8144771 when adding AVX-512 instructions support in
> JDK
> >> to optimize the String.equals intrinsic performance with 512 bit vector
> op
> >> support. Due to the bug, the String.equals method may return false
> result
> >> when using high band of CPU register (xmm16 - xmm31) with non-empty
> stack
> >> on SKL machines where AVX-512 is available.
> >>
> >> The macro-assembly bug we hit is in vptest which is used in the
> >> string_compare macro assembly code
> >> <
> >>
> http://hg.openjdk.java.net/jdk/jdk10/file/b09e56145e11/src/hotspot/cpu/x86/macroAssembler_x86.cpp#l4933
> >>> .
> >> It uses add/sub instruction when saving/resuming register values
> >> temporarily from stack, which will affect and distort the ZF (zero
> flag) in
> >> FLAGS register from the previous test instruction.
> >>
> >> For our case, if the key exist in the DataNode.children HashSet, the
> test
> >> instruction result will be zero, ZF bit will be set to 1, if the RSP
> value
> >> is not 0 (e.g stack is not empty) after addptr code here, then the ZF
> bit
> >> will be changed to 0, so String.equals compare during removeNode will
> >> return false result, and the key won't be removed.
> >>
> >> There is bug reported in JDK-8207746, the behavior is different, we've
> >> confirmed the issue by adding assembly code to log the issue in JDK 10.
> >>
> >> [Solutions]
> >>
> >> The possible mitigations are:
> >>
> >> 1. Disabling the AVX-512 with JVM option -XX:UseAVX=2
> >> 2. Using OpenJDK version higher than 10, which has fixed the issue in
> >> JDK-8207746
> >>
> >> Upgrading to OpenJDK 11+ is a better option, since 10 is not well
> >> supported, and AVX-512 do helps improving performance.
> >>
> >> We use JDK 10 due to SSL quorum socket close stall issue mentioned in
> >> ZOOKEEPER-3384 <https://issues.apache.org/jira/browse/ZOOKEEPER-3384>,
> and
> >> the SO_LINGER option is not honored in JDK 11. We've unblocked JDK 11 by
> >> asynchronously closing the quorum socket, and we're upstreaming that in
> >> ZOOKEEPER-3574 <https://issues.apache.org/jira/browse/ZOOKEEPER-3574>.
> >>
> >> Thanks,
> >> Fangmin
> >>
> >
> >
> > Thank you for sharing this.
> > Do you have any pointer to the jdk11 bugs? Is it solved in 12+?
> >
> > I am running with jdk11-13 but without ssl, so never seen problems.
> >
> > Enrico
> >
> >>
>
>