You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Hyukjin Kwon (Jira)" <ji...@apache.org> on 2020/03/02 01:21:00 UTC
[jira] [Resolved] (SPARK-29419) Seq.toDS / spark.createDataset(Seq)
is not thread-safe
[ https://issues.apache.org/jira/browse/SPARK-29419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Hyukjin Kwon resolved SPARK-29419.
----------------------------------
Fix Version/s: 2.4.6
3.0.0
Resolution: Fixed
Fixed in https://github.com/apache/spark/pull/26076
> Seq.toDS / spark.createDataset(Seq) is not thread-safe
> ------------------------------------------------------
>
> Key: SPARK-29419
> URL: https://issues.apache.org/jira/browse/SPARK-29419
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 2.2.3, 2.3.4, 2.4.0, 3.0.0
> Reporter: Josh Rosen
> Assignee: Josh Rosen
> Priority: Blocker
> Labels: correctness
> Fix For: 3.0.0, 2.4.6
>
>
> The {{Seq.toDS}} and {{spark.createDataset(Seq)}} code is not thread-safe: if the caller-supplied {{Encoder}} is used in multiple threads then {{createDataset}}'s usage of the encoder may lead to incorrect answers because the Encoder's internal mutable state will be updated by from multiple threads.
> Here is an example demonstrating the problem:
> {code:java}
> import org.apache.spark.sql._
> val enc = implicitly[Encoder[(Int, Int)]]
> val datasets = (1 to 100).par.map { _ =>
> val pairs = (1 to 100).map(x => (x, x))
> spark.createDataset(pairs)(enc)
> }
> datasets.reduce(_ union _).collect().foreach {
> pair => require(pair._1 == pair._2, s"Pair elements are mismatched: $pair")
> }{code}
> Due to the thread-safety issue, the above example results in the creation of corrupted records where different input records' fields are co-mingled.
> This bug is similar to SPARK-22355, a related problem in {{Dataset.collect()}} (fixed in Spark 2.2.1+).
> Fortunately, this has a simple one-line fix (copy the encoder); I'll submit a patch for this shortly.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org