You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "kondziolka9ld (Jira)" <ji...@apache.org> on 2021/04/15 09:59:00 UTC
[jira] [Commented] (SPARK-34792) Restore previous behaviour of
randomSplit from spark-2.4.7 in spark-3
[ https://issues.apache.org/jira/browse/SPARK-34792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17322050#comment-17322050 ]
kondziolka9ld commented on SPARK-34792:
---------------------------------------
I found a root cause of this difference, namely it is related to https://issues.apache.org/jira/browse/SPARK-23643 change.
Obviously, input paritioning of dataframe has inpact on results as well.
When we assure that input paritioning on spark-2-4-7 and spark-3 are the same and when we revert change in `core/src/main/scala/org/apache/spark/util/random/XORShiftRandom.scala` from [https://github.com/apache/spark/pull/20793/|https://github.com/apache/spark/pull/20793/f] we get consistent results between both versions.
> Restore previous behaviour of randomSplit from spark-2.4.7 in spark-3
> ---------------------------------------------------------------------
>
> Key: SPARK-34792
> URL: https://issues.apache.org/jira/browse/SPARK-34792
> Project: Spark
> Issue Type: Question
> Components: Spark Core, SQL
> Affects Versions: 3.0.1
> Reporter: kondziolka9ld
> Priority: Major
>
> Hi,
> Please consider a following difference of `randomSplit` method even despite of the same seed.
>
> {code:java}
> ____ __
> / __/__ ___ _____/ /__
> _\ \/ _ \/ _ `/ __/ '_/
> /___/ .__/\_,_/_/ /_/\_\ version 2.4.7
> /_/
>
> Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_282)
> Type in expressions to have them evaluated.
> Type :help for more information.
> scala> val Array(f, s) = Seq(1,2,3,4,5,6,7,8,9,10).toDF.randomSplit(Array(0.3, 0.7), 42)
> f: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [value: int]
> s: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [value: int]
> scala> f.show
> +-----+
> |value|
> +-----+
> | 4|
> +-----+
> scala> s.show
> +-----+
> |value|
> +-----+
> | 1|
> | 2|
> | 3|
> | 5|
> | 6|
> | 7|
> | 8|
> | 9|
> | 10|
> +-----+
> {code}
> while as on spark-3
> {code:java}
> scala> val Array(f, s) = Seq(1,2,3,4,5,6,7,8,9,10).toDF.randomSplit(Array(0.3, 0.7), 42)
> f: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [value: int]
> s: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [value: int]
> scala> f.show
> +-----+
> |value|
> +-----+
> | 5|
> | 10|
> +-----+
> scala> s.show
> +-----+
> |value|
> +-----+
> | 1|
> | 2|
> | 3|
> | 4|
> | 6|
> | 7|
> | 8|
> | 9|
> +-----+
> {code}
> I guess that implementation of `sample` method changed.
> Is it possible to restore previous behaviour?
> Thanks in advance!
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org