You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "David Allsopp (JIRA)" <ji...@apache.org> on 2017/10/17 16:57:00 UTC

[jira] [Updated] (SPARK-21459) Some aggregation functions change the case of nested field names

     [ https://issues.apache.org/jira/browse/SPARK-21459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

David Allsopp updated SPARK-21459:
----------------------------------
    Description: 
When working with DataFrames with nested schemas, the behavior of the aggregation functions is inconsistent with respect to preserving the case of the nested field names.

For example, {{first()}} preserves the case of the field names, but {{collect_set()}} and {{collect_list()}} force the field names to lowercase.

Expected behavior: Field name case is preserved (or is at least consistent and documented)

Spark-shell session to reproduce:

*Update*: After trying different versions, I discovered that this problem occurs in the version of Spark 1.6.0 shipped with Cloudera CDH, not plain Spark. 
The plain Spark 1.6.0 does not support structs in aggregation operations such as {{collect_set}} at all.

{code:java}
case class Inner(Key:String, Value:String)
case class Outer(ID:Long, Pairs:Array[Inner])

val rdd = sc.parallelize(Seq(Outer(1L, Array(Inner("foo", "bar")))))
val df = sqlContext.createDataFrame(rdd)

scala> df
... = [ID: bigint, Pairs: array<struct<Key:string,Value:string>>]

scala>df.groupBy("ID").agg(first("Pairs"))
... = [ID: bigint, first(Pairs)(): array<struct<Key:string,Value:string>>]
// Note that Key and Value preserve their original case

scala>df.groupBy("ID").agg(collect_set("Pairs"))
... = [ID: bigint, collect_set(Pairs): array<struct<key:string,value:string>>]
// Note that key and value are now lowercased

{code}

Additionally, the column name (generated during aggregation) is inconsistent: {{first(Pairs)()}} versus {{collect_set(Pairs)}} - note the extra parentheses in the first name.

  was:
When working with DataFrames with nested schemas, the behavior of the aggregation functions is inconsistent with respect to preserving the case of the nested field names.

For example, {{first()}} preserves the case of the field names, but {{collect_set()}} and {{collect_list()}} force the field names to lowercase.

Expected behavior: Field name case is preserved (or is at least consistent and documented)

Spark-shell session to reproduce:


{code:java}
case class Inner(Key:String, Value:String)
case class Outer(ID:Long, Pairs:Array[Inner])

val rdd = sc.parallelize(Seq(Outer(1L, Array(Inner("foo", "bar")))))
val df = sqlContext.createDataFrame(rdd)

scala> df
... = [ID: bigint, Pairs: array<struct<Key:string,Value:string>>]

scala>df.groupBy("ID").agg(first("Pairs"))
... = [ID: bigint, first(Pairs)(): array<struct<Key:string,Value:string>>]
// Note that Key and Value preserve their original case

scala>df.groupBy("ID").agg(collect_set("Pairs"))
... = [ID: bigint, collect_set(Pairs): array<struct<key:string,value:string>>]
// Note that key and value are now lowercased

{code}

Additionally, the column name (generated during aggregation) is inconsistent: {{first(Pairs)()}} versus {{collect_set(Pairs)}} - note the extra parentheses in the first name.


> Some aggregation functions change the case of nested field names
> ----------------------------------------------------------------
>
>                 Key: SPARK-21459
>                 URL: https://issues.apache.org/jira/browse/SPARK-21459
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 1.6.0
>            Reporter: David Allsopp
>            Priority: Minor
>
> When working with DataFrames with nested schemas, the behavior of the aggregation functions is inconsistent with respect to preserving the case of the nested field names.
> For example, {{first()}} preserves the case of the field names, but {{collect_set()}} and {{collect_list()}} force the field names to lowercase.
> Expected behavior: Field name case is preserved (or is at least consistent and documented)
> Spark-shell session to reproduce:
> *Update*: After trying different versions, I discovered that this problem occurs in the version of Spark 1.6.0 shipped with Cloudera CDH, not plain Spark. 
> The plain Spark 1.6.0 does not support structs in aggregation operations such as {{collect_set}} at all.
> {code:java}
> case class Inner(Key:String, Value:String)
> case class Outer(ID:Long, Pairs:Array[Inner])
> val rdd = sc.parallelize(Seq(Outer(1L, Array(Inner("foo", "bar")))))
> val df = sqlContext.createDataFrame(rdd)
> scala> df
> ... = [ID: bigint, Pairs: array<struct<Key:string,Value:string>>]
> scala>df.groupBy("ID").agg(first("Pairs"))
> ... = [ID: bigint, first(Pairs)(): array<struct<Key:string,Value:string>>]
> // Note that Key and Value preserve their original case
> scala>df.groupBy("ID").agg(collect_set("Pairs"))
> ... = [ID: bigint, collect_set(Pairs): array<struct<key:string,value:string>>]
> // Note that key and value are now lowercased
> {code}
> Additionally, the column name (generated during aggregation) is inconsistent: {{first(Pairs)()}} versus {{collect_set(Pairs)}} - note the extra parentheses in the first name.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org