You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "comphead (via GitHub)" <gi...@apache.org> on 2023/10/17 01:03:33 UTC

[PR] Change input for `to_timestamp` function [arrow-datafusion]

comphead opened a new pull request, #7844:
URL: https://github.com/apache/arrow-datafusion/pull/7844

   ## Which issue does this PR close?
   
   <!--
   We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123.
   -->
   
   Closes #7802.
   
   ## Rationale for this change
   
   <!--
    Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed.
    Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes.  
   -->
   
   `to_timestamp()` builtin function had an input in nanoseconds whereas other SQL engines has input in seconds and as the result the function output gives unexpected result.
    
   
   ## What changes are included in this PR?
   Changed signature to `to_timestamp` function from nanoseconds to seconds, also introduced `to_timestamp_nano` to create timestamp value from nanoseconds
   
   <!--
   There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR.
   -->
   
   ## Are these changes tested?
   Yes
   <!--
   We typically require tests for all PRs in order to:
   1. Prevent the code from being accidentally broken by subsequent changes
   2. Serve as another way to document the expected behavior of the code
   
   If tests are not included in your PR, please explain why (for example, are they covered by existing tests)?
   -->
   
   ## Are there any user-facing changes?
   
   <!--
   If there are user-facing changes then we may require documentation to be updated before approving the PR.
   -->
   
   <!--
   If there are any breaking changes to public APIs, please add the `api change` label.
   -->
   The user might expect that `to_timestamp` function will return other values if the input still in nanoseconds


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Change input for `to_timestamp` function to be seconds rather than nanoseconds, add `to_timestamp_nanos` [arrow-datafusion]

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on PR #7844:
URL: https://github.com/apache/arrow-datafusion/pull/7844#issuecomment-1779687674

   I think so @comphead  -- it has had plenty of review by now and no objections. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Change input for `to_timestamp` function to be seconds rather than nanoseconds, add `to_timestamp_nanos` [arrow-datafusion]

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on PR #7844:
URL: https://github.com/apache/arrow-datafusion/pull/7844#issuecomment-1770533939

   > To align with Postgrseql, perhaps we could fire another ticket for supporting more inputs like floating number & string.
   
   Great idea @waitingkuo.  This is now tracked by two tickets:
   * https://github.com/apache/arrow-datafusion/issues/7868
   * https://github.com/apache/arrow-datafusion/issues/5398


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Change input for `to_timestamp` function [arrow-datafusion]

Posted by "waitingkuo (via GitHub)" <gi...@apache.org>.
waitingkuo commented on PR #7844:
URL: https://github.com/apache/arrow-datafusion/pull/7844#issuecomment-1770268498

   Thank you @comphead @alamb 
   
   This looks great, this closes #2979 as well.
   
   To align with Postgrseql, perhaps we could fire another ticket for supporting more inputs like floating number & string.
   i.e. in postgresql
   ```bash
   willy=# select to_timestamp('1.1');
          to_timestamp       
   --------------------------
    1970-01-01 08:00:01.1+08
   (1 row)
   
   willy=# select to_timestamp(1.1);
          to_timestamp       
   --------------------------
    1970-01-01 08:00:01.1+08
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Change input for `to_timestamp` function to be seconds rather than nanoseconds, add `to_timestamp_nanos` [arrow-datafusion]

Posted by "comphead (via GitHub)" <gi...@apache.org>.
comphead commented on PR #7844:
URL: https://github.com/apache/arrow-datafusion/pull/7844#issuecomment-1778329935

   @alamb should we merge the PR?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Change input for `to_timestamp` function to be seconds rather than nanoseconds, add `to_timestamp_nanos` [arrow-datafusion]

Posted by "waitingkuo (via GitHub)" <gi...@apache.org>.
waitingkuo commented on code in PR #7844:
URL: https://github.com/apache/arrow-datafusion/pull/7844#discussion_r1367633556


##########
datafusion/expr/src/built_in_function.rs:
##########
@@ -748,6 +751,7 @@ impl BuiltinScalarFunction {
             BuiltinScalarFunction::ToTimestamp => Ok(Timestamp(Nanosecond, None)),

Review Comment:
   @alamb  @comphead 
   any idea why the function  could return `Timestamp(Second, None)` even though we set the return type here as `Timestamp(Nanosecond, None)`?



##########
datafusion/physical-expr/src/functions.rs:
##########
@@ -74,15 +74,20 @@ pub fn create_physical_expr(
         // so we don't have to pay a per-array/batch cost.
         BuiltinScalarFunction::ToTimestamp => {
             Arc::new(match input_phy_exprs[0].data_type(input_schema) {
-                Ok(DataType::Int64) | Ok(DataType::Timestamp(_, None)) => {
-                    |col_values: &[ColumnarValue]| {
-                        cast_column(
-                            &col_values[0],
-                            &DataType::Timestamp(TimeUnit::Nanosecond, None),
-                            None,
-                        )
-                    }
-                }
+                Ok(DataType::Int64) => |col_values: &[ColumnarValue]| {
+                    cast_column(
+                        &col_values[0],
+                        &DataType::Timestamp(TimeUnit::Second, None),
+                        None,
+                    )
+                },

Review Comment:
   it still returns Timestamp(second, None) while the input is Int64
   ```bash
   ❯ select arrow_typeof(to_timestamp(0));
   +--------------------------------------+
   | arrow_typeof(to_timestamp(Int64(0))) |
   +--------------------------------------+
   | Timestamp(Second, None)              |
   +--------------------------------------+
   1 row in set. Query took 0.003 seconds.
   ```



##########
datafusion/physical-expr/src/functions.rs:
##########
@@ -74,15 +74,20 @@ pub fn create_physical_expr(
         // so we don't have to pay a per-array/batch cost.
         BuiltinScalarFunction::ToTimestamp => {
             Arc::new(match input_phy_exprs[0].data_type(input_schema) {
-                Ok(DataType::Int64) | Ok(DataType::Timestamp(_, None)) => {
-                    |col_values: &[ColumnarValue]| {
-                        cast_column(
-                            &col_values[0],
-                            &DataType::Timestamp(TimeUnit::Nanosecond, None),
-                            None,
-                        )
-                    }
-                }
+                Ok(DataType::Int64) => |col_values: &[ColumnarValue]| {
+                    cast_column(
+                        &col_values[0],
+                        &DataType::Timestamp(TimeUnit::Second, None),
+                        None,
+                    )
+                },

Review Comment:
   it still returns Timestamp(second, None) while the input is Int64
   ```bash
   ❯ select arrow_typeof(to_timestamp(0));
   +--------------------------------------+
   | arrow_typeof(to_timestamp(Int64(0))) |
   +--------------------------------------+
   | Timestamp(Second, None)              |
   +--------------------------------------+
   1 row in set. Query took 0.003 seconds.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Change input for `to_timestamp` function [arrow-datafusion]

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on PR #7844:
URL: https://github.com/apache/arrow-datafusion/pull/7844#issuecomment-1767415271

   cc @gruuya  @Dandandan and @waitingkuo and @waynexia  as other people I know who might be using the timestamp functionality of DataFusion and be affected by this change


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Change input for `to_timestamp` function [arrow-datafusion]

Posted by "mhilton (via GitHub)" <gi...@apache.org>.
mhilton commented on code in PR #7844:
URL: https://github.com/apache/arrow-datafusion/pull/7844#discussion_r1363741843


##########
docs/source/user-guide/sql/scalar_functions.md:
##########
@@ -1390,10 +1391,10 @@ extract(field FROM source)
 
 ### `to_timestamp`
 
-Converts a value to RFC3339 nanosecond timestamp format (`YYYY-MM-DDT00:00:00.000000000Z`).
+Converts a value to RFC3339 second timestamp format (`YYYY-MM-DDT00:00:00Z`).
 Supports timestamp, integer, and unsigned integer types as input.
 Integers and unsigned integers are parsed as Unix nanosecond timestamps and

Review Comment:
   Is this documentation still accurate? Integers are now parsed as "unix second timestamps"?. Unless they aren't in which case a little bit more expanation of the behaviour for integers would be helpful.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Change input for `to_timestamp` function to be seconds rather than nanoseconds, add `to_timestamp_nanos` [arrow-datafusion]

Posted by "Dandandan (via GitHub)" <gi...@apache.org>.
Dandandan commented on code in PR #7844:
URL: https://github.com/apache/arrow-datafusion/pull/7844#discussion_r1367939268


##########
datafusion/expr/src/built_in_function.rs:
##########
@@ -748,6 +751,7 @@ impl BuiltinScalarFunction {
             BuiltinScalarFunction::ToTimestamp => Ok(Timestamp(Nanosecond, None)),
             BuiltinScalarFunction::ToTimestampMillis => Ok(Timestamp(Millisecond, None)),
             BuiltinScalarFunction::ToTimestampMicros => Ok(Timestamp(Microsecond, None)),
+            BuiltinScalarFunction::ToTimestampNanos => Ok(Timestamp(Nanosecond, None)),

Review Comment:
   We  could map `to_timestamp` to `ToTimestampSeconds` as well durig planing.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Change input for `to_timestamp` function to be seconds rather than nanoseconds, add `to_timestamp_nanos` [arrow-datafusion]

Posted by "comphead (via GitHub)" <gi...@apache.org>.
comphead commented on code in PR #7844:
URL: https://github.com/apache/arrow-datafusion/pull/7844#discussion_r1365918819


##########
datafusion/expr/src/built_in_function.rs:
##########
@@ -745,9 +748,10 @@ impl BuiltinScalarFunction {
                     return plan_err!("The to_hex function can only accept integers.");
                 }
             }),
-            BuiltinScalarFunction::ToTimestamp => Ok(Timestamp(Nanosecond, None)),
+            BuiltinScalarFunction::ToTimestamp => Ok(Timestamp(Second, None)),

Review Comment:
   In PG 
   ```
   select pg_typeof(to_timestamp(1.234))
   ```
   the return type is `timestamp with time zone` which has minimum resolution as a microsecond
   
   https://www.postgresql.org/docs/current/datatype-datetime.html
   
   In light of this, having the return type as nanos or micros makes more sense for me.
   What do you think is better? microseconds like in PG or nanos like it was in DF before the PR?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Change input for `to_timestamp` function to be seconds rather than nanoseconds, add `to_timestamp_nanos` [arrow-datafusion]

Posted by "comphead (via GitHub)" <gi...@apache.org>.
comphead commented on code in PR #7844:
URL: https://github.com/apache/arrow-datafusion/pull/7844#discussion_r1367672783


##########
datafusion/expr/src/built_in_function.rs:
##########
@@ -748,6 +751,7 @@ impl BuiltinScalarFunction {
             BuiltinScalarFunction::ToTimestamp => Ok(Timestamp(Nanosecond, None)),

Review Comment:
   Thanks @waitingkuo for raising it.
   I suppose for bunch of functions the return type gets changed  in `create_physical_expr` because `functions.rs` is on physical expr layer and rewrites the return type defined on logical layer. 
   
   For ints I would think the second is probably better return type
   If you try to present 0004-01-01 year in nanoseconds it will require higher value than `i64::MAX/MIN`
   https://doc.rust-lang.org/src/core/num/mod.rs.html#366
   
   ```
   make_type!(
       TimestampNanosecondType,
       i64,
       DataType::Timestamp(TimeUnit::Nanosecond, None),
       "A timestamp nanosecond type with an optional timezone."
   );
   ```
   But you are right we need to clean up builtin functions configuration in multiple places.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Change input for `to_timestamp` function to be seconds rather than nanoseconds, add `to_timestamp_nanos` [arrow-datafusion]

Posted by "andygrove (via GitHub)" <gi...@apache.org>.
andygrove merged PR #7844:
URL: https://github.com/apache/arrow-datafusion/pull/7844


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Change input for `to_timestamp` function to be seconds rather than nanoseconds, add `to_timestamp_nanos` [arrow-datafusion]

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on code in PR #7844:
URL: https://github.com/apache/arrow-datafusion/pull/7844#discussion_r1365993550


##########
datafusion/expr/src/built_in_function.rs:
##########
@@ -745,9 +748,10 @@ impl BuiltinScalarFunction {
                     return plan_err!("The to_hex function can only accept integers.");
                 }
             }),
-            BuiltinScalarFunction::ToTimestamp => Ok(Timestamp(Nanosecond, None)),
+            BuiltinScalarFunction::ToTimestamp => Ok(Timestamp(Second, None)),

Review Comment:
   I don't have any preference now that we have coercion rules that can automatically coerce between timestamp resolution



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Change input for `to_timestamp` function to be seconds rather than nanoseconds, add `to_timestamp_nanos` [arrow-datafusion]

Posted by "comphead (via GitHub)" <gi...@apache.org>.
comphead commented on code in PR #7844:
URL: https://github.com/apache/arrow-datafusion/pull/7844#discussion_r1367406181


##########
datafusion/expr/src/built_in_function.rs:
##########
@@ -745,9 +748,10 @@ impl BuiltinScalarFunction {
                     return plan_err!("The to_hex function can only accept integers.");
                 }
             }),
-            BuiltinScalarFunction::ToTimestamp => Ok(Timestamp(Nanosecond, None)),
+            BuiltinScalarFunction::ToTimestamp => Ok(Timestamp(Second, None)),

Review Comment:
   set return type to nanos as it was before. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Change input for `to_timestamp` function to be seconds rather than nanoseconds, add `to_timestamp_nanos` [arrow-datafusion]

Posted by "waitingkuo (via GitHub)" <gi...@apache.org>.
waitingkuo commented on code in PR #7844:
URL: https://github.com/apache/arrow-datafusion/pull/7844#discussion_r1365606831


##########
datafusion/expr/src/built_in_function.rs:
##########
@@ -745,9 +748,10 @@ impl BuiltinScalarFunction {
                     return plan_err!("The to_hex function can only accept integers.");
                 }
             }),
-            BuiltinScalarFunction::ToTimestamp => Ok(Timestamp(Nanosecond, None)),
+            BuiltinScalarFunction::ToTimestamp => Ok(Timestamp(Second, None)),

Review Comment:
   Return `Timestamp(Nanosecond, None)` seems reasonable to me since most of the time related functions in datafusion return nanosecond based timestamp i.e. `timestamp '2000-01-01T00:00:00'`, `now()`, ...



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Change input for `to_timestamp` function to be seconds rather than nanoseconds, add `to_timestamp_nanos` [arrow-datafusion]

Posted by "comphead (via GitHub)" <gi...@apache.org>.
comphead commented on code in PR #7844:
URL: https://github.com/apache/arrow-datafusion/pull/7844#discussion_r1366053076


##########
datafusion/expr/src/built_in_function.rs:
##########
@@ -745,9 +748,10 @@ impl BuiltinScalarFunction {
                     return plan_err!("The to_hex function can only accept integers.");
                 }
             }),
-            BuiltinScalarFunction::ToTimestamp => Ok(Timestamp(Nanosecond, None)),
+            BuiltinScalarFunction::ToTimestamp => Ok(Timestamp(Second, None)),

Review Comment:
   got it, I'll do micros as a return type as it is in PG, will update the PR soon



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Change input for `to_timestamp` function to be seconds rather than nanoseconds, add `to_timestamp_nanos` [arrow-datafusion]

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on code in PR #7844:
URL: https://github.com/apache/arrow-datafusion/pull/7844#discussion_r1365284753


##########
datafusion/expr/src/built_in_function.rs:
##########
@@ -745,9 +748,10 @@ impl BuiltinScalarFunction {
                     return plan_err!("The to_hex function can only accept integers.");
                 }
             }),
-            BuiltinScalarFunction::ToTimestamp => Ok(Timestamp(Nanosecond, None)),
+            BuiltinScalarFunction::ToTimestamp => Ok(Timestamp(Second, None)),

Review Comment:
   while writing up https://github.com/apache/arrow-datafusion/issues/7868 it occured to me this signature prevents implementing something like `to_timestamp(1.234)` reasonably, as it can't represent sub-second precison
   
   What would you think about leaving the return type the same (`Timestamp(Nanosecond, None)`) so that we can add support for floating point argments as a follow on?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Change input for `to_timestamp` function [arrow-datafusion]

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on code in PR #7844:
URL: https://github.com/apache/arrow-datafusion/pull/7844#discussion_r1362721442


##########
datafusion/optimizer/src/simplify_expressions/expr_simplifier.rs:
##########
@@ -1534,10 +1537,10 @@ mod tests {
         let expr = cast_to_int64_expr(now_expr()) + lit(100_i64);
         test_evaluate_with_start_time(expr, lit(ts_nanos + 100), &time);
 
-        //  CAST(now() as int64) < cast(to_timestamp(...) as int64) + 50000_i64 ---> true
+        //  CAST(now() as int64) < cast(to_timestamp(...) as int64) + 50000_i64 ---> false
         let expr = cast_to_int64_expr(now_expr())
             .lt(cast_to_int64_expr(to_timestamp_expr(ts_string)) + lit(50000i64));

Review Comment:
   maybe we can adjust the literal so it still evaluates to true so that we don't lose coverage 🤔 



##########
datafusion/optimizer/src/simplify_expressions/expr_simplifier.rs:
##########
@@ -1534,10 +1537,10 @@ mod tests {
         let expr = cast_to_int64_expr(now_expr()) + lit(100_i64);
         test_evaluate_with_start_time(expr, lit(ts_nanos + 100), &time);
 
-        //  CAST(now() as int64) < cast(to_timestamp(...) as int64) + 50000_i64 ---> true
+        //  CAST(now() as int64) < cast(to_timestamp(...) as int64) + 50000_i64 ---> false
         let expr = cast_to_int64_expr(now_expr())
             .lt(cast_to_int64_expr(to_timestamp_expr(ts_string)) + lit(50000i64));

Review Comment:
   maybe we can adjust the literal so it still evaluates to true so that we don't lose coverage 🤔 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org