You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Huon Wilson (JIRA)" <ji...@apache.org> on 2018/12/19 05:40:00 UTC
[jira] [Created] (SPARK-26403) DataFrame pivot using array column
fails with "Unsupported literal type class"
Huon Wilson created SPARK-26403:
-----------------------------------
Summary: DataFrame pivot using array column fails with "Unsupported literal type class"
Key: SPARK-26403
URL: https://issues.apache.org/jira/browse/SPARK-26403
Project: Spark
Issue Type: Bug
Components: SQL
Affects Versions: 2.4.0
Reporter: Huon Wilson
Doing a pivot (using the {{pivot(pivotColumn: Column)}} overload) on a column containing arrays results in a runtime error:
{code:none}
scala> val df = Seq((1, Seq("a", "x"), 2), (1, Seq("b"), 3), (2, Seq("a", "x"), 10), (3, Seq(), 100)).toDF("x", "s", "y")
df: org.apache.spark.sql.DataFrame = [x: int, s: array<string> ... 1 more field]
scala> df.show
+---+------+---+
| x| s| y|
+---+------+---+
| 1|[a, x]| 2|
| 1| [b]| 3|
| 2|[a, x]| 10|
| 3| []|100|
+---+------+---+
scala> df.groupBy("x").pivot("s").agg(collect_list($"y")).show
java.lang.RuntimeException: Unsupported literal type class scala.collection.mutable.WrappedArray$ofRef WrappedArray()
at org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:78)
at org.apache.spark.sql.RelationalGroupedDataset$$anonfun$pivot$1.apply(RelationalGroupedDataset.scala:419)
at org.apache.spark.sql.RelationalGroupedDataset$$anonfun$pivot$1.apply(RelationalGroupedDataset.scala:419)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.AbstractTraversable.map(Traversable.scala:104)
at org.apache.spark.sql.RelationalGroupedDataset.pivot(RelationalGroupedDataset.scala:419)
at org.apache.spark.sql.RelationalGroupedDataset.pivot(RelationalGroupedDataset.scala:397)
at org.apache.spark.sql.RelationalGroupedDataset.pivot(RelationalGroupedDataset.scala:317)
... 49 elided
{code}
However, this doesn't seem to be a fundamental limitation with {{pivot}}, as it works fine using the {{pivot(pivotColumn: Column, values: Seq[Any])}} overload, as long as the arrays are mapped to the {{Array}} type:
{code:none}
scala> val rawValues = df.select("s").distinct.sort("s").collect
rawValues: Array[org.apache.spark.sql.Row] = Array([WrappedArray()], [WrappedArray(a, x)], [WrappedArray(b)])
scala> val values = rawValues.map(_.getSeq[String](0).to[Array])
values: Array[Array[String]] = Array(Array(), Array(a, x), Array(b))
scala> df.groupBy("x").pivot("s", values).agg(collect_list($"y")).show
+---+-----+------+---+
| x| []|[a, x]|[b]|
+---+-----+------+---+
| 1| []| [2]|[3]|
| 3|[100]| []| []|
| 2| []| [10]| []|
+---+-----+------+---+
{code}
It would be nice if {{pivot}} was more resilient to Spark's own representation of array columns, and so the first version worked.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org