You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by srowen <gi...@git.apache.org> on 2018/08/28 21:50:23 UTC

[GitHub] spark pull request #22259: [WIP][SPARK-25044][SQL] (take 2) Address translat...

GitHub user srowen opened a pull request:

    https://github.com/apache/spark/pull/22259

     [WIP][SPARK-25044][SQL] (take 2) Address translation of LMF closure primitive args to Object in Scala 2.12

    ## What changes were proposed in this pull request?
    
    Alternative take on https://github.com/apache/spark/pull/22063 that does not introduce udfInternal.
    Resolve issue with inferring func types in 2.12 by instead using info captured when UDF is registered -- capturing which types are nullable (i.e. not primitive)
    
    ## How was this patch tested?
    
    Existing tests.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/srowen/spark SPARK-25044.2

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/22259.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #22259
    
----
commit d99cfa6c8788d782840dae140c15047af924b3a0
Author: Sean Owen <se...@...>
Date:   2018-08-28T14:22:33Z

    Update UDF support to handle nullability correctly in Scala 2.12

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22259: [WIP][SPARK-25044][SQL] (take 2) Address translation of ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22259
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22259: [SPARK-25044][SQL] (take 2) Address translation o...

Posted by maryannxue <gi...@git.apache.org>.

Github user maryannxue commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22259#discussion_r224510115
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ScalaUDF.scala ---
    @@ -47,7 +48,8 @@ case class ScalaUDF(
         inputTypes: Seq[DataType] = Nil,
         udfName: Option[String] = None,
         nullable: Boolean = true,
    -    udfDeterministic: Boolean = true)
    +    udfDeterministic: Boolean = true,
    +    nullableTypes: Seq[Boolean] = Nil)
    --- End diff --
    
    Yeah, understood and agree. We'd probably need to change 3 things in follow-up PR(s):
    1. Add `isInstanceOf[KnownNotNull]` back (I can do this).
    2. Move `nullableTypes` up in the parameter list and make it required. I agree that it'd only make things worse if anyone uses this private API and forget to set this parameter.
    3. Flip the `true`/`false` around for `nullableTypes`. This is rather minor, but helps with readability and consistency I think.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22259: [WIP][SPARK-25044][SQL] (take 2) Address translat...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/22259


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22259: [SPARK-25044][SQL] (take 2) Address translation o...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22259#discussion_r224263116
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ScalaUDF.scala ---
    @@ -47,7 +48,8 @@ case class ScalaUDF(
         inputTypes: Seq[DataType] = Nil,
         udfName: Option[String] = None,
         nullable: Boolean = true,
    -    udfDeterministic: Boolean = true)
    +    udfDeterministic: Boolean = true,
    +    nullableTypes: Seq[Boolean] = Nil)
    --- End diff --
    
    I see, you are saying that some UDF needed to declare nullable types but didn't? I made the param optional to try to make 'migration' easier and avoid changing the signature much. But, the test you point to, doesn't it pass? are you saying it should not pass?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22259: [SPARK-25044][SQL] (take 2) Address translation o...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22259#discussion_r224542452
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ScalaUDF.scala ---
    @@ -47,7 +48,8 @@ case class ScalaUDF(
         inputTypes: Seq[DataType] = Nil,
         udfName: Option[String] = None,
         nullable: Boolean = true,
    -    udfDeterministic: Boolean = true)
    +    udfDeterministic: Boolean = true,
    +    nullableTypes: Seq[Boolean] = Nil)
    --- End diff --
    
    I think it's mostly about maintainability. We should definitely update the code as @maryannxue said, so that `ScalaUDF` is easier to use and not that error-prone. I feel we don't need to backport it, as it's basically code refactor.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22259: [WIP][SPARK-25044][SQL] (take 2) Address translation of ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22259
  
    **[Test build #4297 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4297/testReport)** for PR 22259 at commit [`d99cfa6`](https://github.com/apache/spark/commit/d99cfa6c8788d782840dae140c15047af924b3a0).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22259: [SPARK-25044][SQL] (take 2) Address translation o...

Posted by maryannxue <gi...@git.apache.org>.

Github user maryannxue commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22259#discussion_r224295469
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ScalaUDF.scala ---
    @@ -47,7 +48,8 @@ case class ScalaUDF(
         inputTypes: Seq[DataType] = Nil,
         udfName: Option[String] = None,
         nullable: Boolean = true,
    -    udfDeterministic: Boolean = true)
    +    udfDeterministic: Boolean = true,
    +    nullableTypes: Seq[Boolean] = Nil)
    --- End diff --
    
    Yes, the test should not pass after removing `isInstanceOf[KnownNotNull]` condition from `needsNullCheck` test (https://github.com/apache/spark/pull/22259/files#diff-57b3d87be744b7d79a9beacf8e5e5eb2L2160). The idea was to add a `KnownNotNull` node on top of the original node to mark it as null-checked, so the rule won't add redundant null checks even if it is accidentally applied again. I'm not sure about the exact reason why you removed `isInstanceOf[KnownNotNull]` condition in this PR, but I think it should be left there alongside the new nullable type check.
    After adding the `nullableTypes` parameter in the test, the issue can be reproduced:
    ```
      test("SPARK-24891 Fix HandleNullInputsForUDF rule") {
        val a = testRelation.output(0)
        val func = (x: Int, y: Int) => x + y
        val udf1 = ScalaUDF(func, IntegerType, a :: a :: Nil, nullableTypes = false :: false :: Nil)
        val udf2 = ScalaUDF(func, IntegerType, a :: udf1 :: Nil, nullableTypes = false :: false :: Nil)
        val plan = Project(Alias(udf2, "")() :: Nil, testRelation)
        comparePlans(plan.analyze, plan.analyze.analyze)
      }
    ```
    BTW, I'm just curious: It looks like `nullableTypes` indicates something opposite to "nullable" used in schema. I would assume when `nullableTypes` is `Seq(false)`, it means this type is not nullable and we need not add the null check, vice versa. Did I miss something here?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22259: [WIP][SPARK-25044][SQL] (take 2) Address translat...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22259#discussion_r213570418
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala ---
    @@ -2149,28 +2149,34 @@ class Analyzer(
     
           case p => p transformExpressionsUp {
     
    -        case udf @ ScalaUDF(func, _, inputs, _, _, _, _) =>
    -          val parameterTypes = ScalaReflection.getParameterTypes(func)
    -          assert(parameterTypes.length == inputs.length)
    -
    -          // TODO: skip null handling for not-nullable primitive inputs after we can completely
    -          // trust the `nullable` information.
    -          // (cls, expr) => cls.isPrimitive && expr.nullable
    -          val needsNullCheck = (cls: Class[_], expr: Expression) =>
    -            cls.isPrimitive && !expr.isInstanceOf[KnownNotNull]
    -          val inputsNullCheck = parameterTypes.zip(inputs)
    -            .filter { case (cls, expr) => needsNullCheck(cls, expr) }
    -            .map { case (_, expr) => IsNull(expr) }
    -            .reduceLeftOption[Expression]((e1, e2) => Or(e1, e2))
    -          // Once we add an `If` check above the udf, it is safe to mark those checked inputs
    -          // as not nullable (i.e., wrap them with `KnownNotNull`), because the null-returning
    -          // branch of `If` will be called if any of these checked inputs is null. Thus we can
    -          // prevent this rule from being applied repeatedly.
    -          val newInputs = parameterTypes.zip(inputs).map{ case (cls, expr) =>
    -            if (needsNullCheck(cls, expr)) KnownNotNull(expr) else expr }
    -          inputsNullCheck
    -            .map(If(_, Literal.create(null, udf.dataType), udf.copy(children = newInputs)))
    -            .getOrElse(udf)
    +        case udf@ScalaUDF(func, _, inputs, _, _, _, _, nullableTypes) =>
    --- End diff --
    
    ah missed it. We can clean it up later.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22259: [SPARK-25044][SQL] (take 2) Address translation o...

Posted by maryannxue <gi...@git.apache.org>.

Github user maryannxue commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22259#discussion_r224252642
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ScalaUDF.scala ---
    @@ -47,7 +48,8 @@ case class ScalaUDF(
         inputTypes: Seq[DataType] = Nil,
         udfName: Option[String] = None,
         nullable: Boolean = true,
    -    udfDeterministic: Boolean = true)
    +    udfDeterministic: Boolean = true,
    +    nullableTypes: Seq[Boolean] = Nil)
    --- End diff --
    
    I think the problem is more about the way we handle `nullableTypes` if not specified as in https://github.com/apache/spark/pull/22259/files#diff-57b3d87be744b7d79a9beacf8e5e5eb2R2157. The test failure of https://github.com/apache/spark/pull/21851/files#diff-e8dddba2915a147970654aa93bee30a7R344 would have been exposed if the `nullableTypes` had been updated in this PR. So I would say logically this parameter is required, but right now it is declared optional. In this case, things went wrong when `nullableTypes` was left unspecified, and this could happen not only with tests but in "source" too. I suggest we move this parameter up right after `inputTypes` so it can get the attention it needs.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22259: [SPARK-25044][SQL] (take 2) Address translation o...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22259#discussion_r224540341
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ScalaUDF.scala ---
    @@ -47,7 +48,8 @@ case class ScalaUDF(
         inputTypes: Seq[DataType] = Nil,
         udfName: Option[String] = None,
         nullable: Boolean = true,
    -    udfDeterministic: Boolean = true)
    +    udfDeterministic: Boolean = true,
    +    nullableTypes: Seq[Boolean] = Nil)
    --- End diff --
    
    Yeah I could see an argument that it need not block release. The functionality works as intended, at least. 
    
    Would you change it again in 2.4.1? If not then we decide to just keep this behavior. Let's say at least get this in if there is a new RC.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22259: [SPARK-25044][SQL] (take 2) Address translation o...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22259#discussion_r214585176
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ScalaUDF.scala ---
    @@ -47,7 +48,8 @@ case class ScalaUDF(
         inputTypes: Seq[DataType] = Nil,
         udfName: Option[String] = None,
         nullable: Boolean = true,
    -    udfDeterministic: Boolean = true)
    +    udfDeterministic: Boolean = true,
    +    nullableTypes: Seq[Boolean] = Nil)
    --- End diff --
    
    Using Nil as the default value is dangerous. We even do not have an assert to ensure it is set. We could easily miss the setting without the right values. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22259: [SPARK-25044][SQL] (take 2) Address translation o...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22259#discussion_r214683934
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ScalaUDF.scala ---
    @@ -47,7 +48,8 @@ case class ScalaUDF(
         inputTypes: Seq[DataType] = Nil,
         udfName: Option[String] = None,
         nullable: Boolean = true,
    -    udfDeterministic: Boolean = true)
    +    udfDeterministic: Boolean = true,
    +    nullableTypes: Seq[Boolean] = Nil)
    --- End diff --
    
    The logic here was again that we wanted to avoid changing the binary signature. I know catalyst is effectively private to Spark, but this wasn't marked specifically private; I wondered if it would actually affect callers? If not we can go back and merge it.
    
    Nil is just an empty list; I don't think it's dangerous and it is used above in `inputTypes`. It is not always set, because it's not always possible to infer the schema, let alone nullability.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22259: [WIP][SPARK-25044][SQL] (take 2) Address translation of ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22259
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22259: [WIP][SPARK-25044][SQL] (take 2) Address translation of ...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/22259
  
    thanks, merging to master!


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22259: [SPARK-25044][SQL] (take 2) Address translation o...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22259#discussion_r214603299
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ScalaUDF.scala ---
    @@ -47,7 +48,8 @@ case class ScalaUDF(
         inputTypes: Seq[DataType] = Nil,
         udfName: Option[String] = None,
         nullable: Boolean = true,
    -    udfDeterministic: Boolean = true)
    +    udfDeterministic: Boolean = true,
    +    nullableTypes: Seq[Boolean] = Nil)
    --- End diff --
    
    We put the assert in the rule `HandleNullInputsForUDF`.
    
    can we merge `inputTypes` and `nullableTypes` here so that we don't need to worry about it any more? cc @srowen 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22259: [WIP][SPARK-25044][SQL] (take 2) Address translation of ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22259
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95387/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22259: [WIP][SPARK-25044][SQL] (take 2) Address translation of ...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on the issue:

    https://github.com/apache/spark/pull/22259
  
    See also https://github.com/apache/spark/pull/22259


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22259: [SPARK-25044][SQL] (take 2) Address translation o...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22259#discussion_r224534401
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ScalaUDF.scala ---
    @@ -47,7 +48,8 @@ case class ScalaUDF(
         inputTypes: Seq[DataType] = Nil,
         udfName: Option[String] = None,
         nullable: Boolean = true,
    -    udfDeterministic: Boolean = true)
    +    udfDeterministic: Boolean = true,
    +    nullableTypes: Seq[Boolean] = Nil)
    --- End diff --
    
    In theory, nothing, because ScalaUDF is conceptually private (right?). In practice, people might use it. The argument here is forcing users who might use ScalaUDF directly to confront the change we had to make for Scala 2.12, and before we release a change here in 2.4.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22259: [SPARK-25044][SQL] (take 2) Address translation o...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22259#discussion_r224597357
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ScalaUDF.scala ---
    @@ -47,7 +48,8 @@ case class ScalaUDF(
         inputTypes: Seq[DataType] = Nil,
         udfName: Option[String] = None,
         nullable: Boolean = true,
    -    udfDeterministic: Boolean = true)
    +    udfDeterministic: Boolean = true,
    +    nullableTypes: Seq[Boolean] = Nil)
    --- End diff --
    
    @maryannxue Please submit a PR to make the parameter `nullableTypes ` required. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22259: [WIP][SPARK-25044][SQL] (take 2) Address translation of ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22259
  
    **[Test build #95387 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95387/testReport)** for PR 22259 at commit [`d99cfa6`](https://github.com/apache/spark/commit/d99cfa6c8788d782840dae140c15047af924b3a0).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22259: [WIP][SPARK-25044][SQL] (take 2) Address translat...

Posted by mgaido91 <gi...@git.apache.org>.

Github user mgaido91 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22259#discussion_r213569142
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala ---
    @@ -2149,28 +2149,34 @@ class Analyzer(
     
           case p => p transformExpressionsUp {
     
    -        case udf @ ScalaUDF(func, _, inputs, _, _, _, _) =>
    -          val parameterTypes = ScalaReflection.getParameterTypes(func)
    -          assert(parameterTypes.length == inputs.length)
    -
    -          // TODO: skip null handling for not-nullable primitive inputs after we can completely
    -          // trust the `nullable` information.
    -          // (cls, expr) => cls.isPrimitive && expr.nullable
    -          val needsNullCheck = (cls: Class[_], expr: Expression) =>
    -            cls.isPrimitive && !expr.isInstanceOf[KnownNotNull]
    -          val inputsNullCheck = parameterTypes.zip(inputs)
    -            .filter { case (cls, expr) => needsNullCheck(cls, expr) }
    -            .map { case (_, expr) => IsNull(expr) }
    -            .reduceLeftOption[Expression]((e1, e2) => Or(e1, e2))
    -          // Once we add an `If` check above the udf, it is safe to mark those checked inputs
    -          // as not nullable (i.e., wrap them with `KnownNotNull`), because the null-returning
    -          // branch of `If` will be called if any of these checked inputs is null. Thus we can
    -          // prevent this rule from being applied repeatedly.
    -          val newInputs = parameterTypes.zip(inputs).map{ case (cls, expr) =>
    -            if (needsNullCheck(cls, expr)) KnownNotNull(expr) else expr }
    -          inputsNullCheck
    -            .map(If(_, Literal.create(null, udf.dataType), udf.copy(children = newInputs)))
    -            .getOrElse(udf)
    +        case udf@ScalaUDF(func, _, inputs, _, _, _, _, nullableTypes) =>
    --- End diff --
    
    nit: can we restore the spaces as in the original?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22259: [WIP][SPARK-25044][SQL] (take 2) Address translat...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22259#discussion_r213689172
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala ---
    @@ -2149,28 +2149,34 @@ class Analyzer(
     
           case p => p transformExpressionsUp {
     
    -        case udf @ ScalaUDF(func, _, inputs, _, _, _, _) =>
    -          val parameterTypes = ScalaReflection.getParameterTypes(func)
    -          assert(parameterTypes.length == inputs.length)
    -
    -          // TODO: skip null handling for not-nullable primitive inputs after we can completely
    -          // trust the `nullable` information.
    -          // (cls, expr) => cls.isPrimitive && expr.nullable
    -          val needsNullCheck = (cls: Class[_], expr: Expression) =>
    -            cls.isPrimitive && !expr.isInstanceOf[KnownNotNull]
    -          val inputsNullCheck = parameterTypes.zip(inputs)
    -            .filter { case (cls, expr) => needsNullCheck(cls, expr) }
    -            .map { case (_, expr) => IsNull(expr) }
    -            .reduceLeftOption[Expression]((e1, e2) => Or(e1, e2))
    -          // Once we add an `If` check above the udf, it is safe to mark those checked inputs
    -          // as not nullable (i.e., wrap them with `KnownNotNull`), because the null-returning
    -          // branch of `If` will be called if any of these checked inputs is null. Thus we can
    -          // prevent this rule from being applied repeatedly.
    -          val newInputs = parameterTypes.zip(inputs).map{ case (cls, expr) =>
    -            if (needsNullCheck(cls, expr)) KnownNotNull(expr) else expr }
    -          inputsNullCheck
    -            .map(If(_, Literal.create(null, udf.dataType), udf.copy(children = newInputs)))
    -            .getOrElse(udf)
    +        case udf@ScalaUDF(func, _, inputs, _, _, _, _, nullableTypes) =>
    --- End diff --
    
    I might be missing it - what is the space issue? There's an additional level of indent because of the if statement 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22259: [WIP][SPARK-25044][SQL] (take 2) Address translat...

Posted by mgaido91 <gi...@git.apache.org>.

Github user mgaido91 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22259#discussion_r213699237
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala ---
    @@ -2149,28 +2149,34 @@ class Analyzer(
     
           case p => p transformExpressionsUp {
     
    -        case udf @ ScalaUDF(func, _, inputs, _, _, _, _) =>
    -          val parameterTypes = ScalaReflection.getParameterTypes(func)
    -          assert(parameterTypes.length == inputs.length)
    -
    -          // TODO: skip null handling for not-nullable primitive inputs after we can completely
    -          // trust the `nullable` information.
    -          // (cls, expr) => cls.isPrimitive && expr.nullable
    -          val needsNullCheck = (cls: Class[_], expr: Expression) =>
    -            cls.isPrimitive && !expr.isInstanceOf[KnownNotNull]
    -          val inputsNullCheck = parameterTypes.zip(inputs)
    -            .filter { case (cls, expr) => needsNullCheck(cls, expr) }
    -            .map { case (_, expr) => IsNull(expr) }
    -            .reduceLeftOption[Expression]((e1, e2) => Or(e1, e2))
    -          // Once we add an `If` check above the udf, it is safe to mark those checked inputs
    -          // as not nullable (i.e., wrap them with `KnownNotNull`), because the null-returning
    -          // branch of `If` will be called if any of these checked inputs is null. Thus we can
    -          // prevent this rule from being applied repeatedly.
    -          val newInputs = parameterTypes.zip(inputs).map{ case (cls, expr) =>
    -            if (needsNullCheck(cls, expr)) KnownNotNull(expr) else expr }
    -          inputsNullCheck
    -            .map(If(_, Literal.create(null, udf.dataType), udf.copy(children = newInputs)))
    -            .getOrElse(udf)
    +        case udf@ScalaUDF(func, _, inputs, _, _, _, _, nullableTypes) =>
    --- End diff --
    
    it is in `udf@ScalaUDF` which should have been `udf @ ScalaUDF`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22259: [SPARK-25044][SQL] (take 2) Address translation o...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22259#discussion_r224540330
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ScalaUDF.scala ---
    @@ -47,7 +48,8 @@ case class ScalaUDF(
         inputTypes: Seq[DataType] = Nil,
         udfName: Option[String] = None,
         nullable: Boolean = true,
    -    udfDeterministic: Boolean = true)
    +    udfDeterministic: Boolean = true,
    +    nullableTypes: Seq[Boolean] = Nil)
    --- End diff --
    
    Yeah I could see an argument that it need not block release. The functionality works as intended, at least. 
    
    Would you change it again in 2.4.1? If not then we decide to just keep this behavior. Let's say at least get this in if there is a new RC.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22259: [WIP][SPARK-25044][SQL] (take 2) Address translation of ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22259
  
    **[Test build #95387 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95387/testReport)** for PR 22259 at commit [`d99cfa6`](https://github.com/apache/spark/commit/d99cfa6c8788d782840dae140c15047af924b3a0).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22259: [SPARK-25044][SQL] (take 2) Address translation o...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22259#discussion_r224506066
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ScalaUDF.scala ---
    @@ -47,7 +48,8 @@ case class ScalaUDF(
         inputTypes: Seq[DataType] = Nil,
         udfName: Option[String] = None,
         nullable: Boolean = true,
    -    udfDeterministic: Boolean = true)
    +    udfDeterministic: Boolean = true,
    +    nullableTypes: Seq[Boolean] = Nil)
    --- End diff --
    
    I think the logic was that `nullableTypes` has all the information about which inputs are allowed to be null. Really it's true for reference types, which can be assigned null, and false for things like primitives, which can never take the value null. It might be described as "doesntNeedNullCheck". That could be flipped to "needsNullCheck" although we'd have to go back and flip all the callers.
    
    We can add back the check for `isInstanceOf[KnownNotNull]`; you know this area better than I.
    
    But it doesn't quite solve the issue that anyone calling `ScalaUDF` might forget to specify which types need the null check, now that Scala 2.12 can't infer it. We can make it required by moving the args around, but I worried that would break callers that use this 'private' API? Then again, is it worse to silently still accept the ScalaUDF when its interpretation could become wrong in Scala 2.12?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22259: [SPARK-25044][SQL] (take 2) Address translation o...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22259#discussion_r224512333
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ScalaUDF.scala ---
    @@ -47,7 +48,8 @@ case class ScalaUDF(
         inputTypes: Seq[DataType] = Nil,
         udfName: Option[String] = None,
         nullable: Boolean = true,
    -    udfDeterministic: Boolean = true)
    +    udfDeterministic: Boolean = true,
    +    nullableTypes: Seq[Boolean] = Nil)
    --- End diff --
    
    I'm OK with that, but obviously we need to get this into 2.4 if it's going to change. There's an RC out that's almost passed, and I flagged it. @cloud-fan it's worth considering a new RC for a change like this, even as I'm always a little uneasy about late significant changes. That said we need to release it in the way we want it. (And there are a few other minor doc things we could get into 2.4)


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22259: [SPARK-25044][SQL] (take 2) Address translation o...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22259#discussion_r224536625
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ScalaUDF.scala ---
    @@ -47,7 +48,8 @@ case class ScalaUDF(
         inputTypes: Seq[DataType] = Nil,
         udfName: Option[String] = None,
         nullable: Boolean = true,
    -    udfDeterministic: Boolean = true)
    +    udfDeterministic: Boolean = true,
    +    nullableTypes: Seq[Boolean] = Nil)
    --- End diff --
    
    then why does this need to be a blocker? Private APIs keep changing very often.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22259: [SPARK-25044][SQL] (take 2) Address translation o...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22259#discussion_r224532350
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ScalaUDF.scala ---
    @@ -47,7 +48,8 @@ case class ScalaUDF(
         inputTypes: Seq[DataType] = Nil,
         udfName: Option[String] = None,
         nullable: Boolean = true,
    -    udfDeterministic: Boolean = true)
    +    udfDeterministic: Boolean = true,
    +    nullableTypes: Seq[Boolean] = Nil)
    --- End diff --
    
    what's the impact to end users?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22259: [WIP][SPARK-25044][SQL] (take 2) Address translation of ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22259
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2648/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22259: [WIP][SPARK-25044][SQL] (take 2) Address translat...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22259#discussion_r213701008
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala ---
    @@ -2149,28 +2149,34 @@ class Analyzer(
     
           case p => p transformExpressionsUp {
     
    -        case udf @ ScalaUDF(func, _, inputs, _, _, _, _) =>
    -          val parameterTypes = ScalaReflection.getParameterTypes(func)
    -          assert(parameterTypes.length == inputs.length)
    -
    -          // TODO: skip null handling for not-nullable primitive inputs after we can completely
    -          // trust the `nullable` information.
    -          // (cls, expr) => cls.isPrimitive && expr.nullable
    -          val needsNullCheck = (cls: Class[_], expr: Expression) =>
    -            cls.isPrimitive && !expr.isInstanceOf[KnownNotNull]
    -          val inputsNullCheck = parameterTypes.zip(inputs)
    -            .filter { case (cls, expr) => needsNullCheck(cls, expr) }
    -            .map { case (_, expr) => IsNull(expr) }
    -            .reduceLeftOption[Expression]((e1, e2) => Or(e1, e2))
    -          // Once we add an `If` check above the udf, it is safe to mark those checked inputs
    -          // as not nullable (i.e., wrap them with `KnownNotNull`), because the null-returning
    -          // branch of `If` will be called if any of these checked inputs is null. Thus we can
    -          // prevent this rule from being applied repeatedly.
    -          val newInputs = parameterTypes.zip(inputs).map{ case (cls, expr) =>
    -            if (needsNullCheck(cls, expr)) KnownNotNull(expr) else expr }
    -          inputsNullCheck
    -            .map(If(_, Literal.create(null, udf.dataType), udf.copy(children = newInputs)))
    -            .getOrElse(udf)
    +        case udf@ScalaUDF(func, _, inputs, _, _, _, _, nullableTypes) =>
    --- End diff --
    
    Oh right. Yeah I didn't mean to change that. It's minor enough to leave I think. (or else standardize across the code)


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22259: [SPARK-25044][SQL] (take 2) Address translation o...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22259#discussion_r224607076
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ScalaUDF.scala ---
    @@ -47,7 +48,8 @@ case class ScalaUDF(
         inputTypes: Seq[DataType] = Nil,
         udfName: Option[String] = None,
         nullable: Boolean = true,
    -    udfDeterministic: Boolean = true)
    +    udfDeterministic: Boolean = true,
    +    nullableTypes: Seq[Boolean] = Nil)
    --- End diff --
    
    Yeah looks like we should just make these changes after all, and for 2.4, as we need another RC.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22259: [SPARK-25044][SQL] (take 2) Address translation o...

Posted by maryannxue <gi...@git.apache.org>.

Github user maryannxue commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22259#discussion_r224527615
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ScalaUDF.scala ---
    @@ -47,7 +48,8 @@ case class ScalaUDF(
         inputTypes: Seq[DataType] = Nil,
         udfName: Option[String] = None,
         nullable: Boolean = true,
    -    udfDeterministic: Boolean = true)
    +    udfDeterministic: Boolean = true,
    +    nullableTypes: Seq[Boolean] = Nil)
    --- End diff --
    
    Yes, I think this should be in 2.4, too, since it's an API change.
    BTW, I just finished https://github.com/apache/spark/pull/22701 addressing the suggested change 2 above. Then 1 and 3 can be covered in another PR.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22259: [WIP][SPARK-25044][SQL] (take 2) Address translation of ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22259
  
    **[Test build #4297 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4297/testReport)** for PR 22259 at commit [`d99cfa6`](https://github.com/apache/spark/commit/d99cfa6c8788d782840dae140c15047af924b3a0).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22259: [WIP][SPARK-25044][SQL] (take 2) Address translation of ...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/22259
  
    LGTM. How did you work around the type tag not found issue?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org