You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Alberto Fernández (JIRA)" <ji...@apache.org> on 2017/06/07 11:18:18 UTC

[jira] [Commented] (SPARK-17237) DataFrame fill after pivot causing org.apache.spark.sql.AnalysisException

    [ https://issues.apache.org/jira/browse/SPARK-17237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16040716#comment-16040716 ] 

Alberto Fernández commented on SPARK-17237:
-------------------------------------------

Hi there,

I think this change introduced a breaking change in the way the "withColumnRenamed" method works. I can reproduce this breaking change with the following example:

{code}
dataframe = sql("SELECT * FROM db.table")
another_dataframe = sql("SELECT * FROM db.another_table")

dataframe
	.join(another_dataframe, on=[...])
	.pivot("column_name", values=[0, 1])
	.max("column1", "column2")
	.withColumnRenamed("0_max(another_table.`column1`)", "name1")
	.withColumnRenamed("0_max(another_table.`column2`)", "name2")
{code}

With Spark 2.1.0, the behaviour is the expected (buggy, but expected): columns doesn't get renamed.

With Spark 2.1.1, and if this issue was resolved, you wouldn't need to change anything for the renames to work. However, the column doesn't get renamed at all because now you would need to use the following renames:

{code}
dataframe = sql("SELECT * FROM db.table")
another_dataframe = sql("SELECT * FROM db.another_table")

dataframe
	.join(another_dataframe, on=[...])
	.pivot("column_name", values=[0, 1])
	.max("column1", "column2")
	.withColumnRenamed("0_max(column1)", "name1")
	.withColumnRenamed("1_max(column2)", "name2")
{code}

As you can see, it seems that this PR somehow managed to removed the table name from the join context and also removed the backticks, thus introducing a breaking change.

I should also notice that the original issue didn't happen when using JSON as output format. It only happens because Parquet doesn't support () characters in column names, but in JSON they work just fine. Here is an example of the error thrown by Parquet after upgrading to Spark 2.1.1 and not modifying your code.

{code}
Attribute name "0_max(column1)" contains invalid character(s) among " ,;{}()\\n\\t=". Please use alias to rename it.
{code}

I think the original issue was that the parseAttributeName cannot detect "table.column" notation, and as I understand this PR still doesn't fix this issue right?

As a workaround, you can change your column renames to accomodate the new format.

Any ideas? Am I missing something?

> DataFrame fill after pivot causing org.apache.spark.sql.AnalysisException
> -------------------------------------------------------------------------
>
>                 Key: SPARK-17237
>                 URL: https://issues.apache.org/jira/browse/SPARK-17237
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.0.0
>            Reporter: Jiang Qiqi
>            Assignee: Takeshi Yamamuro
>              Labels: newbie
>             Fix For: 2.0.3, 2.1.1, 2.2.0
>
>
> I am trying to run a pivot transformation which I ran on a spark1.6 cluster, 
> namely
> sc.parallelize(Seq((2,3,4), (3,4,5))).toDF("a", "b", "c")
> res1: org.apache.spark.sql.DataFrame = [a: int, b: int, c: int]
> scala> res1.groupBy("a").pivot("b").agg(count("c"), avg("c")).na.fill(0)
> res2: org.apache.spark.sql.DataFrame = [a: int, 3_count(c): bigint, 3_avg(c): double, 4_count(c): bigint, 4_avg(c): double]
> scala> res1.groupBy("a").pivot("b").agg(count("c"), avg("c")).na.fill(0).show
> +---+----------+--------+----------+--------+
> |  a|3_count(c)|3_avg(c)|4_count(c)|4_avg(c)|
> +---+----------+--------+----------+--------+
> |  2|         1|     4.0|         0|     0.0|
> |  3|         0|     0.0|         1|     5.0|
> +---+----------+--------+----------+--------+
> after upgrade the environment to spark2.0, got an error while executing .na.fill method
> scala> sc.parallelize(Seq((2,3,4), (3,4,5))).toDF("a", "b", "c")
> res3: org.apache.spark.sql.DataFrame = [a: int, b: int ... 1 more field]
> scala> res3.groupBy("a").pivot("b").agg(count("c"), avg("c")).na.fill(0)
> org.apache.spark.sql.AnalysisException: syntax error in attribute name: `3_count(`c`)`;
>   at org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute$.e$1(unresolved.scala:103)
>   at org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute$.parseAttributeName(unresolved.scala:113)
>   at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveQuoted(LogicalPlan.scala:168)
>   at org.apache.spark.sql.Dataset.resolve(Dataset.scala:218)
>   at org.apache.spark.sql.Dataset.col(Dataset.scala:921)
>   at org.apache.spark.sql.DataFrameNaFunctions.org$apache$spark$sql$DataFrameNaFunctions$$fillCol(DataFrameNaFunctions.scala:411)
>   at org.apache.spark.sql.DataFrameNaFunctions$$anonfun$2.apply(DataFrameNaFunctions.scala:162)
>   at org.apache.spark.sql.DataFrameNaFunctions$$anonfun$2.apply(DataFrameNaFunctions.scala:159)
>   at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
>   at org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:159)
>   at org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:149)
>   at org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:134)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org