You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Nicholas Chammas (JIRA)" <ji...@apache.org> on 2017/06/15 19:03:00 UTC

[jira] [Created] (SPARK-21110) Structs should be orderable

Nicholas Chammas created SPARK-21110:
----------------------------------------

             Summary: Structs should be orderable
                 Key: SPARK-21110
                 URL: https://issues.apache.org/jira/browse/SPARK-21110
             Project: Spark
          Issue Type: Improvement
          Components: SQL
    Affects Versions: 2.1.1
            Reporter: Nicholas Chammas
            Priority: Minor


It seems like a missing feature that you can't compare structs in a filter on a DataFrame.

Here's a simple demonstration of a) where this would be useful and b) how it's different from simply comparing each of the components of the structs.

{code}
import pyspark
from pyspark.sql.functions import col, struct, concat

spark = pyspark.sql.SparkSession.builder.getOrCreate()

df = spark.createDataFrame(
    [
        ('Boston', 'Bob'),
        ('Boston', 'Nick'),
        ('San Francisco', 'Bob'),
        ('San Francisco', 'Nick'),
    ],
    ['city', 'person']
)
pairs = (
    df.select(
        struct('city', 'person').alias('p1')
    )
    .crossJoin(
        df.select(
            struct('city', 'person').alias('p2')
        )
    )
)

print("Everything")
pairs.show()

print("Comparing parts separately (doesn't give me what I want)")
(pairs
    .where(col('p1.city') < col('p2.city'))
    .where(col('p1.person') < col('p2.person'))
    .show())

print("Comparing parts together with concat (gives me what I want but is hacky)")
(pairs
    .where(concat('p1.city', 'p1.person') < concat('p2.city', 'p2.person'))
    .show())

print("Comparing parts together with struct (my desired solution but currently yields an error)")
(pairs
    .where(col('p1') < col('p2'))
    .show())
{code}

The last query yields the following error in Spark 2.1.1:

{code}
org.apache.spark.sql.AnalysisException: cannot resolve '(`p1` < `p2`)' due to data type mismatch: '(`p1` < `p2`)' requires (boolean or tinyint or smallint or int or bigint or float or double or decimal or timestamp or date or string or binary) type, not struct<city:string,person:string>;;
'Filter (p1#5 < p2#8)
+- Join Cross
   :- Project [named_struct(city, city#0, person, person#1) AS p1#5]
   :  +- LogicalRDD [city#0, person#1]
   +- Project [named_struct(city, city#0, person, person#1) AS p2#8]
      +- LogicalRDD [city#0, person#1]
{code}




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org