You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by Zack Sampson <zs...@palantir.com> on 2015/09/15 06:12:08 UTC

And.eval short circuiting

It seems like And.eval can avoid calculating right.eval if left.eval returns null. Is there a reason it's written like it is?


override def eval(input: Row): Any = {
  val l = left.eval(input)
  if (l == false) {
    false
  } else {
    val r = right.eval(input)
    if (r == false) {
      false
    } else {
      if (l != null && r != null) {
        true
      } else {
        null
      }
    }
  }
}

Re: And.eval short circuiting

Posted by Mingyu Kim <mk...@palantir.com>.
I filed SPARK-10703. Thanks!

Mingyu

From:  Reynold Xin
Date:  Thursday, September 17, 2015 at 11:22 PM
To:  Mingyu Kim
Cc:  Zack Sampson, "dev@spark.apache.org", Peter Faiman, Matt Cheah, Michael Armbrust
Subject:  Re: And.eval short circuiting

Please file a ticket and cc me. Thanks. 


On Thu, Sep 17, 2015 at 11:20 PM, Mingyu Kim <mk...@palantir.com> wrote:
That sounds good. I think the optimizer should not change the behavior of execution and reordering the filters can easily result in errors as exemplified below. I agree that the optimizer should not reorder the filters for correctness. Please correct me if I have an incorrect assumption about the guarantees of the optimizer.

Is there a bug filed that tracks the change you suggested below, btw? I’d like to follow the issue, if there’s one.

Thanks,
Mingyu

From: Reynold Xin
Date: Wednesday, September 16, 2015 at 1:17 PM
To: Zack Sampson
Cc: "dev@spark.apache.org", Mingyu Kim, Peter Faiman, Matt Cheah, Michael Armbrust 

Subject: Re: And.eval short circuiting

This is "expected" in the sense that DataFrame operations can get re-ordered under the hood by the optimizer. For example, if the optimizer deems it is cheaper to apply the 2nd filter first, it might re-arrange the filters. In reality, it doesn't do that. I think this is too confusing and violates principle of least astonishment, so we should fix it. 

I discussed more with Michael offline, and think we can add a rule for the physical filter operator to replace the general AND/OR/equality/etc with a special version that treats null as false. This rule needs to be carefully written because it should only apply to subtrees of AND/OR/equality/etc (e.g. it shouldn't rewrite children of isnull).


On Tue, Sep 15, 2015 at 1:09 PM, Zack Sampson <zs...@palantir.com> wrote:
I see. We're having problems with code like this (forgive my noob scala):
val df = Seq(("moose","ice"), (null,"fire")).toDF("animals", "elements")
df
  .filter($"animals".rlike(".*"))
  .filter(callUDF({(value: String) => value.length > 2}, BooleanType, $"animals"))
.collect()
This code throws a NPE because:
* Catalyst combines the filters with an AND
* the first filter passes returns null on the first input
* the second filter tries to read the length of that null

This feels weird. Reading that code, I wouldn't expect null to be passed to the second filter. Even weirder is that if you call collect() after the first filter you won't see nulls, and if you write the data to disk and reread it, the NPE won't happen.

It's bewildering! Is this the intended behavior?
From: Reynold Xin [rxin@databricks.com]
Sent: Monday, September 14, 2015 10:14 PM
To: Zack Sampson
Cc: dev@spark.apache.org
Subject: Re: And.eval short circuiting

rxin=# select null and true;
 ?column? 
----------
 
(1 row)

rxin=# select null and false;
 ?column? 
----------
 f
(1 row)


null and false should return false.


On Mon, Sep 14, 2015 at 9:12 PM, Zack Sampson <zs...@palantir.com> wrote:
It seems like And.eval can avoid calculating right.eval if left.eval returns null. Is there a reason it's written like it is? 

override def eval(input: Row): Any = {
  val l = left.eval(input)
  if (l == false) {
    false
  } else {
    val r = right.eval(input)
    if (r == false) {
      false
    } else {
      if (l != null && r != null) {
        true
      } else {
        null
      }
    }
  }
}





Re: And.eval short circuiting

Posted by Reynold Xin <rx...@databricks.com>.
Please file a ticket and cc me. Thanks.


On Thu, Sep 17, 2015 at 11:20 PM, Mingyu Kim <mk...@palantir.com> wrote:

> That sounds good. I think the optimizer should not change the behavior of
> execution and reordering the filters can easily result in errors as
> exemplified below. I agree that the optimizer should not reorder the
> filters for correctness. Please correct me if I have an incorrect
> assumption about the guarantees of the optimizer.
>
> Is there a bug filed that tracks the change you suggested below, btw? I’d
> like to follow the issue, if there’s one.
>
> Thanks,
> Mingyu
>
> From: Reynold Xin
> Date: Wednesday, September 16, 2015 at 1:17 PM
> To: Zack Sampson
> Cc: "dev@spark.apache.org", Mingyu Kim, Peter Faiman, Matt Cheah, Michael
> Armbrust
>
> Subject: Re: And.eval short circuiting
>
> This is "expected" in the sense that DataFrame operations can get
> re-ordered under the hood by the optimizer. For example, if the optimizer
> deems it is cheaper to apply the 2nd filter first, it might re-arrange the
> filters. In reality, it doesn't do that. I think this is too confusing and
> violates principle of least astonishment, so we should fix it.
>
> I discussed more with Michael offline, and think we can add a rule for the
> physical filter operator to replace the general AND/OR/equality/etc with a
> special version that treats null as false. This rule needs to be carefully
> written because it should only apply to subtrees of AND/OR/equality/etc
> (e.g. it shouldn't rewrite children of isnull).
>
>
> On Tue, Sep 15, 2015 at 1:09 PM, Zack Sampson <zs...@palantir.com>
> wrote:
>
>> I see. We're having problems with code like this (forgive my noob scala):
>>
>> val df = Seq(("moose","ice"), (null,"fire")).toDF("animals", "elements")
>> df
>>   .filter($"animals".rlike(".*"))
>>   .filter(callUDF({(value: String) => value.length > 2}, BooleanType, $"animals"))
>> .collect()
>>
>> This code throws a NPE because:
>> * Catalyst combines the filters with an AND
>> * the first filter passes returns null on the first input
>> * the second filter tries to read the length of that null
>>
>> This feels weird. Reading that code, I wouldn't expect null to be passed
>> to the second filter. Even weirder is that if you call collect() after the
>> first filter you won't see nulls, and if you write the data to disk and
>> reread it, the NPE won't happen.
>>
>> It's bewildering! Is this the intended behavior?
>> ------------------------------
>> *From:* Reynold Xin [rxin@databricks.com]
>> *Sent:* Monday, September 14, 2015 10:14 PM
>> *To:* Zack Sampson
>> *Cc:* dev@spark.apache.org
>> *Subject:* Re: And.eval short circuiting
>>
>> rxin=# select null and true;
>>  ?column?
>> ----------
>>
>> (1 row)
>>
>> rxin=# select null and false;
>>  ?column?
>> ----------
>>  f
>> (1 row)
>>
>>
>> null and false should return false.
>>
>>
>> On Mon, Sep 14, 2015 at 9:12 PM, Zack Sampson <zs...@palantir.com>
>> wrote:
>>
>>> It seems like And.eval can avoid calculating right.eval if left.eval
>>> returns null. Is there a reason it's written like it is?
>>>
>>> override def eval(input: Row): Any = {
>>>   val l = left.eval(input)
>>>   if (l == false) {
>>>     false
>>>   } else {
>>>     val r = right.eval(input)
>>>     if (r == false) {
>>>       false
>>>     } else {
>>>       if (l != null && r != null) {
>>>         true
>>>       } else {
>>>         null
>>>       }
>>>     }
>>>   }
>>> }
>>>
>>>
>>
>

Re: And.eval short circuiting

Posted by Mingyu Kim <mk...@palantir.com>.
That sounds good. I think the optimizer should not change the behavior of execution and reordering the filters can easily result in errors as exemplified below. I agree that the optimizer should not reorder the filters for correctness. Please correct me if I have an incorrect assumption about the guarantees of the optimizer.

Is there a bug filed that tracks the change you suggested below, btw? I’d like to follow the issue, if there’s one.

Thanks,
Mingyu

From:  Reynold Xin
Date:  Wednesday, September 16, 2015 at 1:17 PM
To:  Zack Sampson
Cc:  "dev@spark.apache.org", Mingyu Kim, Peter Faiman, Matt Cheah, Michael Armbrust
Subject:  Re: And.eval short circuiting

This is "expected" in the sense that DataFrame operations can get re-ordered under the hood by the optimizer. For example, if the optimizer deems it is cheaper to apply the 2nd filter first, it might re-arrange the filters. In reality, it doesn't do that. I think this is too confusing and violates principle of least astonishment, so we should fix it. 

I discussed more with Michael offline, and think we can add a rule for the physical filter operator to replace the general AND/OR/equality/etc with a special version that treats null as false. This rule needs to be carefully written because it should only apply to subtrees of AND/OR/equality/etc (e.g. it shouldn't rewrite children of isnull).


On Tue, Sep 15, 2015 at 1:09 PM, Zack Sampson <zs...@palantir.com> wrote:
I see. We're having problems with code like this (forgive my noob scala):
val df = Seq(("moose","ice"), (null,"fire")).toDF("animals", "elements")
df
  .filter($"animals".rlike(".*"))
  .filter(callUDF({(value: String) => value.length > 2}, BooleanType, $"animals"))
.collect()
This code throws a NPE because:
* Catalyst combines the filters with an AND
* the first filter passes returns null on the first input
* the second filter tries to read the length of that null

This feels weird. Reading that code, I wouldn't expect null to be passed to the second filter. Even weirder is that if you call collect() after the first filter you won't see nulls, and if you write the data to disk and reread it, the NPE won't happen.

It's bewildering! Is this the intended behavior?
From: Reynold Xin [rxin@databricks.com]
Sent: Monday, September 14, 2015 10:14 PM
To: Zack Sampson
Cc: dev@spark.apache.org
Subject: Re: And.eval short circuiting

rxin=# select null and true;
 ?column? 
----------
 
(1 row)

rxin=# select null and false;
 ?column? 
----------
 f
(1 row)


null and false should return false.


On Mon, Sep 14, 2015 at 9:12 PM, Zack Sampson <zs...@palantir.com> wrote:
It seems like And.eval can avoid calculating right.eval if left.eval returns null. Is there a reason it's written like it is? 

override def eval(input: Row): Any = {
  val l = left.eval(input)
  if (l == false) {
    false
  } else {
    val r = right.eval(input)
    if (r == false) {
      false
    } else {
      if (l != null && r != null) {
        true
      } else {
        null
      }
    }
  }
}




Re: And.eval short circuiting

Posted by Reynold Xin <rx...@databricks.com>.
This is "expected" in the sense that DataFrame operations can get
re-ordered under the hood by the optimizer. For example, if the optimizer
deems it is cheaper to apply the 2nd filter first, it might re-arrange the
filters. In reality, it doesn't do that. I think this is too confusing and
violates principle of least astonishment, so we should fix it.

I discussed more with Michael offline, and think we can add a rule for the
physical filter operator to replace the general AND/OR/equality/etc with a
special version that treats null as false. This rule needs to be carefully
written because it should only apply to subtrees of AND/OR/equality/etc
(e.g. it shouldn't rewrite children of isnull).


On Tue, Sep 15, 2015 at 1:09 PM, Zack Sampson <zs...@palantir.com> wrote:

> I see. We're having problems with code like this (forgive my noob scala):
>
> val df = Seq(("moose","ice"), (null,"fire")).toDF("animals", "elements")
> df
>   .filter($"animals".rlike(".*"))
>   .filter(callUDF({(value: String) => value.length > 2}, BooleanType, $"animals"))
> .collect()
>
> This code throws a NPE because:
> * Catalyst combines the filters with an AND
> * the first filter passes returns null on the first input
> * the second filter tries to read the length of that null
>
> This feels weird. Reading that code, I wouldn't expect null to be passed
> to the second filter. Even weirder is that if you call collect() after the
> first filter you won't see nulls, and if you write the data to disk and
> reread it, the NPE won't happen.
>
> It's bewildering! Is this the intended behavior?
> ------------------------------
> *From:* Reynold Xin [rxin@databricks.com]
> *Sent:* Monday, September 14, 2015 10:14 PM
> *To:* Zack Sampson
> *Cc:* dev@spark.apache.org
> *Subject:* Re: And.eval short circuiting
>
> rxin=# select null and true;
>  ?column?
> ----------
>
> (1 row)
>
> rxin=# select null and false;
>  ?column?
> ----------
>  f
> (1 row)
>
>
> null and false should return false.
>
>
> On Mon, Sep 14, 2015 at 9:12 PM, Zack Sampson <zs...@palantir.com>
> wrote:
>
>> It seems like And.eval can avoid calculating right.eval if left.eval
>> returns null. Is there a reason it's written like it is?
>>
>> override def eval(input: Row): Any = {
>>   val l = left.eval(input)
>>   if (l == false) {
>>     false
>>   } else {
>>     val r = right.eval(input)
>>     if (r == false) {
>>       false
>>     } else {
>>       if (l != null && r != null) {
>>         true
>>       } else {
>>         null
>>       }
>>     }
>>   }
>> }
>>
>>
>

RE: And.eval short circuiting

Posted by Zack Sampson <zs...@palantir.com>.
I see. We're having problems with code like this (forgive my noob scala):

val df = Seq(("moose","ice"), (null,"fire")).toDF("animals", "elements")
df
  .filter($"animals".rlike(".*"))
  .filter(callUDF({(value: String) => value.length > 2}, BooleanType, $"animals"))
.collect()

This code throws a NPE because:
* Catalyst combines the filters with an AND
* the first filter passes returns null on the first input
* the second filter tries to read the length of that null

This feels weird. Reading that code, I wouldn't expect null to be passed to the second filter. Even weirder is that if you call collect() after the first filter you won't see nulls, and if you write the data to disk and reread it, the NPE won't happen.

It's bewildering! Is this the intended behavior?
________________________________
From: Reynold Xin [rxin@databricks.com]
Sent: Monday, September 14, 2015 10:14 PM
To: Zack Sampson
Cc: dev@spark.apache.org
Subject: Re: And.eval short circuiting

rxin=# select null and true;
 ?column?
----------

(1 row)

rxin=# select null and false;
 ?column?
----------
 f
(1 row)


null and false should return false.


On Mon, Sep 14, 2015 at 9:12 PM, Zack Sampson <zs...@palantir.com>> wrote:
It seems like And.eval can avoid calculating right.eval if left.eval returns null. Is there a reason it's written like it is?


override def eval(input: Row): Any = {
  val l = left.eval(input)
  if (l == false) {
    false
  } else {
    val r = right.eval(input)
    if (r == false) {
      false
    } else {
      if (l != null && r != null) {
        true
      } else {
        null
      }
    }
  }
}


Re: And.eval short circuiting

Posted by Reynold Xin <rx...@databricks.com>.
rxin=# select null and true;
 ?column?
----------

(1 row)

rxin=# select null and false;
 ?column?
----------
 f
(1 row)


null and false should return false.


On Mon, Sep 14, 2015 at 9:12 PM, Zack Sampson <zs...@palantir.com> wrote:

> It seems like And.eval can avoid calculating right.eval if left.eval
> returns null. Is there a reason it's written like it is?
>
> override def eval(input: Row): Any = {
>   val l = left.eval(input)
>   if (l == false) {
>     false
>   } else {
>     val r = right.eval(input)
>     if (r == false) {
>       false
>     } else {
>       if (l != null && r != null) {
>         true
>       } else {
>         null
>       }
>     }
>   }
> }
>
>