You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/03/23 06:53:52 UTC

[GitHub] [arrow-datafusion] yjshen opened a new issue #2064: DivideByZero for CaseWhen

yjshen opened a new issue #2064:
URL: https://github.com/apache/arrow-datafusion/issues/2064


   `CASE WHEN a > 0 THEN 25.0 / cast(a, float64) ELSE float64(null) END` will fail with error: `ArrowError(DivideByZero)`, since we are computing `25 div a` without checking the `when` condition.
   
   
   To reproduce
   ```rust
   #[test]
       fn case_without_expr1() -> Result<()> {
           let batch = case_test_batch1()?;
           let schema = batch.schema();
   
           // CASE WHEN a > 0 THEN 25.0 / cast(a, float64) ELSE float64(null) END
           let when1 = binary(
               col("a", &schema)?,
               Operator::Gt,
               lit(ScalarValue::Int32(Some(0))),
               &batch.schema(),
           )?;
           let then1 = binary(
               lit(ScalarValue::Float64(Some(25.0))),
               Operator::Divide,
               cast(col("a", &schema)?, &batch.schema(), Float64)?,
               &batch.schema(),
           )?;
           let x = lit(ScalarValue::Float64(None));
   
           let expr = case(None, &[(when1, then1)], Some(x))?;
           let result = expr.evaluate(&batch)?.into_array(batch.num_rows());
           let result = result
               .as_any()
               .downcast_ref::<Float64Array>()
               .expect("failed to downcast to Int32Array");
   
           let expected = &Float64Array::from(vec![Some(25.0), None, None, Some(5.0)]);
   
           assert_eq!(expected, result);
   
           Ok(())
       }
   
       fn case_test_batch1() -> Result<RecordBatch> {
           let schema = Schema::new(vec![Field::new("a", DataType::Int32, true)]);
           let a = Int32Array::from(vec![Some(1), Some(0), None, Some(5)]);
           let batch = RecordBatch::try_new(Arc::new(schema), vec![Arc::new(a)])?;
           Ok(batch)
       }
   ```
   
   The [cause](https://github.com/apache/arrow-datafusion/blob/master/datafusion-physical-expr/src/expressions/case.rs#L344) of the problem:
   
   ```rust
   for i in (0..self.when_then_expr.len()).rev() {
               let i = i as usize;
   
               let when_value = self.when_then_expr[i].0.evaluate(batch)?;
               let when_value = when_value.into_array(batch.num_rows());
   
               let then_value = self.when_then_expr[i].1.evaluate(batch)?;     // <------------
               let then_value = then_value.into_array(batch.num_rows());
   ```
   
   We are not checking the "selection vector" generated by the `when clause` to evaluate `then clause`.
   
   A possible solution with minimum changes from arrow kernel computations might be:
   ```rust
   pub trait PhysicalExpr: Send + Sync + Display + Debug {
   ......
   .....
       /// Evaluate an expression against a RecordBatch with validity array
       fn evaluate_selection(&self, batch: &RecordBatch, selection: BooleanArray) -> Result<ColumnarValue> {
           let valid_batch = take_all_valid(batch, &selection);
           let tmp = self.evaluate(&valid_batch)?;
           Ok(scatter(tmp, &selection))
       }
   }
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] yjshen commented on issue #2064: DivideByZero for CaseWhen

Posted by GitBox <gi...@apache.org>.
yjshen commented on issue #2064:
URL: https://github.com/apache/arrow-datafusion/issues/2064#issuecomment-1075987181


   cc @alamb @houqp for thoughts.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on issue #2064: DivideByZero for CaseWhen

Posted by GitBox <gi...@apache.org>.
alamb commented on issue #2064:
URL: https://github.com/apache/arrow-datafusion/issues/2064#issuecomment-1076149642


   @yjshen  I think this is a common problem with vectorized evaluation engines, where the overhead of checking for overflow somewhat reduces the effectiveness of the vectorization
   
   I think your suggestion of using `take_all_valid` is a reasonable one
   
   The other approach I have heard of (though never seen implemented) is to implement some sort of 'poisoning' / deferred exception tracking that leaves the value of `a / 0` as `nulll` but doesn't throw the error until later
   
   My personal opinion is go for correct (aka the `take` approach you outline) and then we can optimize for speed if/when evaluating CASE becomes a bottleneck


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb closed issue #2064: DivideByZero for CaseWhen

Posted by GitBox <gi...@apache.org>.
alamb closed issue #2064:
URL: https://github.com/apache/arrow-datafusion/issues/2064


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org