You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by Reynold Xin <rx...@databricks.com> on 2018/08/31 07:37:04 UTC

TimSort bug

“As a byproduct of our study, we uncover a bug in the Java implementation
that can cause the sorting method to fail during the execution.”

http://drops.dagstuhl.de/opus/volltexte/2018/9467/

This might impact Spark since we took the Java based TimSort
implementation. I have seen in the wild TimSort failing in the past. Maybe
this is the cause.

Re: TimSort bug

Posted by Reynold Xin <rx...@databricks.com>.
Thanks for looking into this, Sean! Loved the tl;dr.


On Fri, Aug 31, 2018 at 12:28 PM Sean Owen <sr...@gmail.com> wrote:

> TL;DR - We already had the fix from SPARK-5984. The delta from the current
> JDK implementation to Spark's looks actually inconsequential. No action
> required AFAICT.
>
> On Fri, Aug 31, 2018 at 12:30 PM Sean Owen <sr...@gmail.com> wrote:
>
>> I looked into this, because it sure sounds like a similar issue from a
>> few years ago that was fixed in
>> https://issues.apache.org/jira/browse/SPARK-5984
>> The change in that JIRA actually looks almost identical to the change
>> mentioned in the JDK bug:
>> http://hg.openjdk.java.net/jdk/jdk/rev/3a6d47df8239
>>
>> Reading the paper
>> http://drops.dagstuhl.de/opus/volltexte/2018/9467/pdf/LIPIcs-ESA-2018-4.pdf in
>> section 5 a little more, I think they are saying that there were two ways
>> to fix the original problem: a) fix the invariant, or b) increase some data
>> structure size. Java did the latter, it seems, and now they've shown it's
>> actually still busted. However Python and the original paper did the
>> former, and we seem to have copied that fix from
>> http://envisage-project.eu/proving-android-java-and-python-sorting-algorithm-is-broken-and-how-to-fix-it/ My
>> understanding is that this still works, and is what Java *now* implements.
>>
>> The only difference I can see in implementation is an extra check for a
>> negative array index before dereferencing array[n]. We can add that for
>> full consistency with the JVM change, I suppose. I don't think it's
>> relevant to the problem reported in the paper, but could be an issue
>> otherwise. The JVM implementation suggests it thinks this needs to be
>> guarded.
>>
>> I did hack together a crude translation of the paper's bug reproduction
>> at http://igm.univ-mlv.fr/~pivoteau/Timsort/Test.java by copying some
>> Spark test code. It does need a huge amount of memory to run (>32g.. ended
>> up at 44g) so not so feasible to put in the test suite. Running it on Spark
>> master nets a JVM crash:
>>
>> # Problematic frame:
>> # J 10195 C2
>> org.apache.spark.util.collection.TimSort.countRunAndMakeAscending(Ljava/lang/Object;IILjava/util/Comparator;)I
>> (198 bytes) @ 0x00007ff64dd9a262 [0x00007ff64dd99f20+0x342]
>>
>> Thats... not good, but I can't tell if it's really due to this same issue
>> or something else going on in the off-heap code. Making the tiny change I
>> mentioned above doesn't do anything.
>>
>> On Fri, Aug 31, 2018 at 2:37 AM Reynold Xin <rx...@databricks.com> wrote:
>>
>>> “As a byproduct of our study, we uncover a bug in the Java
>>> implementation that can cause the sorting method to fail during the
>>> execution.”
>>>
>>> http://drops.dagstuhl.de/opus/volltexte/2018/9467/
>>>
>>> This might impact Spark since we took the Java based TimSort
>>> implementation. I have seen in the wild TimSort failing in the past. Maybe
>>> this is the cause.
>>>
>>

Re: TimSort bug

Posted by Sean Owen <sr...@gmail.com>.
TL;DR - We already had the fix from SPARK-5984. The delta from the current
JDK implementation to Spark's looks actually inconsequential. No action
required AFAICT.

On Fri, Aug 31, 2018 at 12:30 PM Sean Owen <sr...@gmail.com> wrote:

> I looked into this, because it sure sounds like a similar issue from a few
> years ago that was fixed in
> https://issues.apache.org/jira/browse/SPARK-5984
> The change in that JIRA actually looks almost identical to the change
> mentioned in the JDK bug:
> http://hg.openjdk.java.net/jdk/jdk/rev/3a6d47df8239
>
> Reading the paper
> http://drops.dagstuhl.de/opus/volltexte/2018/9467/pdf/LIPIcs-ESA-2018-4.pdf in
> section 5 a little more, I think they are saying that there were two ways
> to fix the original problem: a) fix the invariant, or b) increase some data
> structure size. Java did the latter, it seems, and now they've shown it's
> actually still busted. However Python and the original paper did the
> former, and we seem to have copied that fix from
> http://envisage-project.eu/proving-android-java-and-python-sorting-algorithm-is-broken-and-how-to-fix-it/ My
> understanding is that this still works, and is what Java *now* implements.
>
> The only difference I can see in implementation is an extra check for a
> negative array index before dereferencing array[n]. We can add that for
> full consistency with the JVM change, I suppose. I don't think it's
> relevant to the problem reported in the paper, but could be an issue
> otherwise. The JVM implementation suggests it thinks this needs to be
> guarded.
>
> I did hack together a crude translation of the paper's bug reproduction at
> http://igm.univ-mlv.fr/~pivoteau/Timsort/Test.java by copying some Spark
> test code. It does need a huge amount of memory to run (>32g.. ended up at
> 44g) so not so feasible to put in the test suite. Running it on Spark
> master nets a JVM crash:
>
> # Problematic frame:
> # J 10195 C2
> org.apache.spark.util.collection.TimSort.countRunAndMakeAscending(Ljava/lang/Object;IILjava/util/Comparator;)I
> (198 bytes) @ 0x00007ff64dd9a262 [0x00007ff64dd99f20+0x342]
>
> Thats... not good, but I can't tell if it's really due to this same issue
> or something else going on in the off-heap code. Making the tiny change I
> mentioned above doesn't do anything.
>
> On Fri, Aug 31, 2018 at 2:37 AM Reynold Xin <rx...@databricks.com> wrote:
>
>> “As a byproduct of our study, we uncover a bug in the Java
>> implementation that can cause the sorting method to fail during the
>> execution.”
>>
>> http://drops.dagstuhl.de/opus/volltexte/2018/9467/
>>
>> This might impact Spark since we took the Java based TimSort
>> implementation. I have seen in the wild TimSort failing in the past. Maybe
>> this is the cause.
>>
>

Re: TimSort bug

Posted by Sean Owen <sr...@gmail.com>.
I looked into this, because it sure sounds like a similar issue from a few
years ago that was fixed in https://issues.apache.org/jira/browse/SPARK-5984

The change in that JIRA actually looks almost identical to the change
mentioned in the JDK bug:
http://hg.openjdk.java.net/jdk/jdk/rev/3a6d47df8239

Reading the paper
http://drops.dagstuhl.de/opus/volltexte/2018/9467/pdf/LIPIcs-ESA-2018-4.pdf in
section 5 a little more, I think they are saying that there were two ways
to fix the original problem: a) fix the invariant, or b) increase some data
structure size. Java did the latter, it seems, and now they've shown it's
actually still busted. However Python and the original paper did the
former, and we seem to have copied that fix from
http://envisage-project.eu/proving-android-java-and-python-sorting-algorithm-is-broken-and-how-to-fix-it/
My
understanding is that this still works, and is what Java *now* implements.

The only difference I can see in implementation is an extra check for a
negative array index before dereferencing array[n]. We can add that for
full consistency with the JVM change, I suppose. I don't think it's
relevant to the problem reported in the paper, but could be an issue
otherwise. The JVM implementation suggests it thinks this needs to be
guarded.

I did hack together a crude translation of the paper's bug reproduction at
http://igm.univ-mlv.fr/~pivoteau/Timsort/Test.java by copying some Spark
test code. It does need a huge amount of memory to run (>32g.. ended up at
44g) so not so feasible to put in the test suite. Running it on Spark
master nets a JVM crash:

# Problematic frame:
# J 10195 C2
org.apache.spark.util.collection.TimSort.countRunAndMakeAscending(Ljava/lang/Object;IILjava/util/Comparator;)I
(198 bytes) @ 0x00007ff64dd9a262 [0x00007ff64dd99f20+0x342]

Thats... not good, but I can't tell if it's really due to this same issue
or something else going on in the off-heap code. Making the tiny change I
mentioned above doesn't do anything.

On Fri, Aug 31, 2018 at 2:37 AM Reynold Xin <rx...@databricks.com> wrote:

> “As a byproduct of our study, we uncover a bug in the Java implementation
> that can cause the sorting method to fail during the execution.”
>
> http://drops.dagstuhl.de/opus/volltexte/2018/9467/
>
> This might impact Spark since we took the Java based TimSort
> implementation. I have seen in the wild TimSort failing in the past. Maybe
> this is the cause.
>