You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2020/06/09 16:43:47 UTC
[GitHub] [spark] fqaiser94 commented on pull request #27066: [SPARK-31317][SQL] Add withField method to Column

fqaiser94 commented on pull request #27066:
URL: https://github.com/apache/spark/pull/27066#issuecomment-640747267


   @cloud-fan 
   
   Thanks for your PR. I've merged your changes into this branch with some modifications in order to pass the tests. 
   
   > Remove the SQL API. It's unclear to me if we should add a special syntax for it, or simply treat it as a function. We can add the SQL API later.
   
   I tried to implement special syntax for it before but it involved a lot of changes and I didn't think the added complexity was worth it over simply having a function. 
   Anyway, I think adding the SQL API later is the correct decision if we're unsure about it right now and it helps keep this PR small.
   
   > return null if the struct is null. I think the user-facing behavior matters more than the internal implementation. I don't think it's a big problem to have deeply nested if-else expressions. It's similar to the encoder expressions.
   
   While the code will now give this behaviour (`return null if struct is null`), I'm still not entirely convinced the way this is currently implemented is a good idea.
   Looking at the generated physical plan, it looks like Spark would be doing a lot of unnecessary work for **nullable** struct columns. 
   In the following example, I add 2 columns to a nullable struct. 
   
   ```
   nullStructLevel1.withColumn("a", 'a.withField("b", lit(2)).withField("d", lit(4))).explain()
   // == Physical Plan ==
   // *(1) Project [if (isnull(if (isnull(a#6)) null else named_struct(a, a#6.a, b, 2, c, a#6.c))) null else named_struct(a, if (isnull(a#6)) null else named_struct(a, a#6.a, b, 2, c, a#6.c).a, b, if (isnull(a#6)) null else named_struct(a, a#6.a, b, 2, c, a#6.c).b, c, if (isnull(a#6)) null else named_struct(a, a#6.a, b, 2, c, a#6.c).c, d, 4) AS a#8]
   // +- *(1) Scan ExistingRDD[a#6]
   ```
   As you can see, the `If(IsNull(_), Literal(null, _), CreateNamedStruct(_))` check has to happen several times in this simple example unnecessarily. 
   This quickly gets out of hand the more fields you add. 
   For example, if I add 3 columns to the same nullable struct: 
   ```
   nullStructLevel1.withColumn("a", 'a.withField("b", lit(2)).withField("d", lit(4)).withField("e", lit(5))).explain()
   // == Physical Plan ==
   // *(1) Project [if (isnull(if (isnull(if (isnull(a#6)) null else named_struct(a, a#6.a, b, 2, c, a#6.c))) null else named_struct(a, if (isnull(a#6)) null else named_struct(a, a#6.a, b, 2, c, a#6.c).a, b, if (isnull(a#6)) null else named_struct(a, a#6.a, b, 2, c, a#6.c).b, c, if (isnull(a#6)) null else named_struct(a, a#6.a, b, 2, c, a#6.c).c, d, 4))) null else named_struct(a, if (isnull(if (isnull(a#6)) null else named_struct(a, a#6.a, b, 2, c, a#6.c))) null else named_struct(a, if (isnull(a#6)) null else named_struct(a, a#6.a, b, 2, c, a#6.c).a, b, if (isnull(a#6)) null else named_struct(a, a#6.a, b, 2, c, a#6.c).b, c, if (isnull(a#6)) null else named_struct(a, a#6.a, b, 2, c, a#6.c).c, d, 4).a, b, if (isnull(if (isnull(a#6)) null else named_struct(a, a#6.a, b, 2, c, a#6.c))) null else named_struct(a, if (isnull(a#6)) null else named_struct(a, a#6.a, b, 2, c, a#6.c).a, b, if (isnull(a#6)) null else named_struct(a, a#6.a, b, 2, c, a#6.c).b, c, if (isnull(a#6)) null else named_struct(a, a#6.a, b, 2, c, a#6.c).c, d, 4).b, c, if (isnull(if (isnull(a#6)) null else named_struct(a, a#6.a, b, 2, c, a#6.c))) null else named_struct(a, if (isnull(a#6)) null else named_struct(a, a#6.a, b, 2, c, a#6.c).a, b, if (isnull(a#6)) null else named_struct(a, a#6.a, b, 2, c, a#6.c).b, c, if (isnull(a#6)) null else named_struct(a, a#6.a, b, 2, c, a#6.c).c, d, 4).c, d, if (isnull(if (isnull(a#6)) null else named_struct(a, a#6.a, b, 2, c, a#6.c))) null else named_struct(a, if (isnull(a#6)) null else named_struct(a, a#6.a, b, 2, c, a#6.c).a, b, if (isnull(a#6)) null else named_struct(a, a#6.a, b, 2, c, a#6.c).b, c, if (isnull(a#6)) null else named_struct(a, a#6.a, b, 2, c, a#6.c).c, d, 4).d, e, 5) AS a#10]
   // +- *(1) Scan ExistingRDD[a#6]
   
   ```
   Do you still think this is not an issue? 
   
   If we're adamant on having this `return null if the struct is null` behaviour, I'll try to think of ways to fix this (whilst still reusing `CreateNamedStruct` Expression). 
   
   There might be a possibility of writing optimizer rules that can simplify nested if-else expressions of this sort.  One simple rule I can think of is as follows: 
   ```
   object RewriteIsNullIfIsNull extends Rule[LogicalPlan] {
     def apply(plan: LogicalPlan): LogicalPlan = plan transformAllExpressions {
       case IsNull(If(IsNull(a), Literal(null, _), b)) if !b.nullable => IsNull(a)
     }
   }
   ```
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org