You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/01/31 11:30:30 UTC

[GitHub] [arrow] alamb commented on a change in pull request #9359: ARROW-11426: [Rust][DataFusion] EXTRACT support

alamb commented on a change in pull request #9359:
URL: https://github.com/apache/arrow/pull/9359#discussion_r567411606



##########
File path: rust/datafusion/tests/sql.rs
##########
@@ -1964,6 +1964,20 @@ async fn crypto_expressions() -> Result<()> {
     Ok(())
 }
 
+#[tokio::test]
+async fn extract_date_part() -> Result<()> {
+    let mut ctx = ExecutionContext::new();
+    let sql = "SELECT
+        EXTRACT(HOUR FROM CAST('2020-01-01' AS DATE)),
+        EXTRACT(HOUR FROM to_timestamp('2020-09-08T12:00:00+00:00'))

Review comment:
       I think adding coverage for the other date parts would be valuable here (even if they error)

##########
File path: rust/datafusion/src/logical_plan/expr.rs
##########
@@ -169,6 +170,13 @@ pub enum Expr {
     },
     /// Represents a reference to all fields in a schema.
     Wildcard,
+    /// Extract date parts (day, hour, minute) from a date / time expression
+    Extract {

Review comment:
       I also agree (belatedly) that having a more efficient implementation of constant function arguments (e.g. #9376) is the way to go!
   
   In terms of adding variants to `Expr` I think it will need be done when the semantics of whatever expression is being added can't realistically be expressed as a function (e.g. CASE). 
   
   So in this case, given @jorgecarleitao is cranking along with #9376 it seems like perhaps this PR should perhaps try and translate `EXTRACT` into a function.

##########
File path: rust/datafusion/src/physical_plan/expressions/extract.rs
##########
@@ -0,0 +1,115 @@
+use crate::error::Result;
+use core::fmt;
+use std::{any::Any, sync::Arc};
+
+use arrow::{
+    array::{
+        Date32Array, Date64Array, TimestampMicrosecondArray, TimestampMillisecondArray,
+        TimestampNanosecondArray, TimestampSecondArray,
+    },
+    compute::hour,
+    datatypes::{DataType, Schema, TimeUnit},
+    record_batch::RecordBatch,
+};
+
+use crate::{
+    error::DataFusionError,
+    logical_plan::DatePart,
+    physical_plan::{ColumnarValue, PhysicalExpr},
+};
+
+impl fmt::Display for Extract {
+    fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
+        write!(f, "EXTRACT({} AS {:?})", self.date_part, self.expr)
+    }
+}
+
+/// Extract
+#[derive(Debug)]
+pub struct Extract {
+    date_part: DatePart,
+    expr: Arc<dyn PhysicalExpr>,
+}
+
+impl Extract {
+    /// Create new Extract expression
+    pub fn new(date_part: DatePart, expr: Arc<dyn PhysicalExpr>) -> Self {
+        Self { date_part, expr }
+    }
+}
+
+impl PhysicalExpr for Extract {
+    fn as_any(&self) -> &dyn Any {
+        self
+    }
+
+    fn data_type(&self, _input_schema: &Schema) -> Result<DataType> {
+        Ok(DataType::Int32)
+    }
+
+    fn nullable(&self, input_schema: &Schema) -> Result<bool> {
+        self.expr.nullable(input_schema)
+    }
+
+    fn evaluate(&self, batch: &RecordBatch) -> Result<ColumnarValue> {
+        let value = self.expr.evaluate(batch)?;
+        let data_type = value.data_type();
+        let array = match value {
+            ColumnarValue::Array(array) => array,
+            ColumnarValue::Scalar(scalar) => scalar.to_array(),
+        };
+
+        match data_type {
+            DataType::Date32 => {
+                let array = array.as_any().downcast_ref::<Date32Array>().unwrap();
+                Ok(ColumnarValue::Array(Arc::new(hour(array)?)))

Review comment:
       I find it confusing that `date_part` is passed all the way down in the Exprs / trees only to be ignored in the actual implementation which directly calls `hour. I can see that the expr seems to always be made with `DatePart::Hour` but I am not 100% sure. 
   
   I am fine with not supporting all the various date parts initially, but I would recommend the following as a way of documenting through code / safe guard against future bugs:
   
   1. Add a test for `EXTRACT DAY from timestamp` and show that it generates a useful error
   2. Add a check in this function for `date_part != DataPart::Hour` and throw an error




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org