You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "alamb (via GitHub)" <gi...@apache.org> on 2023/06/11 12:08:30 UTC

[GitHub] [arrow-datafusion] alamb opened a new pull request, #6632: Fix date_trunc signature

alamb opened a new pull request, #6632:
URL: https://github.com/apache/arrow-datafusion/pull/6632

   # Which issue does this PR close?
   
   Closes https://github.com/apache/arrow-datafusion/issues/6623
   
   
   # Rationale for this change
   
   The declared output type for `date_trunc` is wrong -- it is always `Timestamp(Nanosecond)` but it was declared to return the same type as its input
   
   # What changes are included in this PR?
   Make signature match the actual output type
   
   # Are these changes tested?
   Yes, 
   
   
   # Are there any user-facing changes?
   (fewer queries will error with a bug)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #6632: Fix date_trunc signature

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on code in PR #6632:
URL: https://github.com/apache/arrow-datafusion/pull/6632#discussion_r1225793821


##########
datafusion/expr/src/function.rs:
##########
@@ -218,7 +218,9 @@ pub fn return_type(
         BuiltinScalarFunction::Concat => Ok(DataType::Utf8),
         BuiltinScalarFunction::ConcatWithSeparator => Ok(DataType::Utf8),
         BuiltinScalarFunction::DatePart => Ok(DataType::Float64),
-        BuiltinScalarFunction::DateTrunc | BuiltinScalarFunction::DateBin => {
+        // DateTrunc always makes nanosecond timestamps
+        BuiltinScalarFunction::DateTrunc => Ok(DataType::Timestamp(TimeUnit::Nanosecond, None)),

Review Comment:
   here is the actual bugfix 
   
   I think it is a reasonable question if date_trunc should be changed to return the same type as its input, but the current implementation always creates nanosecond timestamps;
   
   https://github.com/alamb/arrow-datafusion/blob/main/datafusion/physical-expr/src/datetime_expressions.rs#L266



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] viirya commented on pull request #6632: Fix date_trunc signature

Posted by "viirya (via GitHub)" <gi...@apache.org>.
viirya commented on PR #6632:
URL: https://github.com/apache/arrow-datafusion/pull/6632#issuecomment-1586347255

   > After reviewing #6638 from @viirya I am not sure about this PR anymore -- it seems like this PR simply makes the signature match for the array versions but then it would be inconsistent with the scalar versions...
   > 
   > I think any real fix should have the array and scalar implementations match and of course the signature match 🤔
   
   Hmm, I think this should be correct. As I mentioned in https://github.com/apache/arrow-datafusion/pull/6632#discussion_r1225918413, actually the second argument (no matter scalar or array) should be `Timestamp(Nanosecond, None)` based on the function signature. Unless that we want to make date_trunc accepts all timestamp types instead of just nanosecond timestamp.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #6632: Fix date_trunc signature

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on code in PR #6632:
URL: https://github.com/apache/arrow-datafusion/pull/6632#discussion_r1225793821


##########
datafusion/expr/src/function.rs:
##########
@@ -218,7 +218,9 @@ pub fn return_type(
         BuiltinScalarFunction::Concat => Ok(DataType::Utf8),
         BuiltinScalarFunction::ConcatWithSeparator => Ok(DataType::Utf8),
         BuiltinScalarFunction::DatePart => Ok(DataType::Float64),
-        BuiltinScalarFunction::DateTrunc | BuiltinScalarFunction::DateBin => {
+        // DateTrunc always makes nanosecond timestamps
+        BuiltinScalarFunction::DateTrunc => Ok(DataType::Timestamp(TimeUnit::Nanosecond, None)),

Review Comment:
   This change makes the declared types consistent with the actual implementation
   
   https://github.com/alamb/arrow-datafusion/blob/main/datafusion/physical-expr/src/datetime_expressions.rs#L266
   
   I think it is a reasonable question if `date_trunc` should be changed to return the same type as its input, but the current implementation always creates nanosecond timestamps
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on pull request #6632: Fix date_trunc signature

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on PR #6632:
URL: https://github.com/apache/arrow-datafusion/pull/6632#issuecomment-1587835528

   Filed https://github.com/apache/arrow-datafusion/issues/6653 to track improvements to `date_trunc`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on pull request #6632: Fix date_trunc signature

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on PR #6632:
URL: https://github.com/apache/arrow-datafusion/pull/6632#issuecomment-1586335712

   After reviewing https://github.com/apache/arrow-datafusion/pull/6638 from @viirya  I am not sure about this PR anymore -- it seems like this PR simply makes the signature match for the array versions but then it would be inconsistent with the scalar versions...
   
   I think any real fix should have the array and scalar implementations match and of course the signature match 🤔 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb merged pull request #6632: Fix date_trunc signature

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb merged PR #6632:
URL: https://github.com/apache/arrow-datafusion/pull/6632


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on pull request #6632: Fix date_trunc signature

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on PR #6632:
URL: https://github.com/apache/arrow-datafusion/pull/6632#issuecomment-1586352059

   > Unless that we want to make date_trunc accepts all timestamp types instead of just nanosecond timestamp, and returns same output type as input type.
   
   Yeah, this is what I was thinking. This might have the benefit of supporting larger timestamp ranges as  `timestamp(second, ...)` can hold a larger range
   
    But I suppose since we already have fixes lined up that standardize on always generating `timestamp(nanosecond, None)` that seems better than what is on master. 
   
   cc @waitingkuo  in case you have some opinions


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #6632: Fix date_trunc signature

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on code in PR #6632:
URL: https://github.com/apache/arrow-datafusion/pull/6632#discussion_r1225793420


##########
datafusion/core/tests/sqllogictests/test_files/timestamps.slt:
##########
@@ -15,6 +15,38 @@
 # specific language governing permissions and limitations
 # under the License.
 
+##########

Review Comment:
   I just moved this table definition to the top of this file so I could reuse it



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] jackwener commented on a diff in pull request #6632: Fix date_trunc signature

Posted by "jackwener (via GitHub)" <gi...@apache.org>.
jackwener commented on code in PR #6632:
URL: https://github.com/apache/arrow-datafusion/pull/6632#discussion_r1226162464


##########
datafusion/expr/src/function.rs:
##########
@@ -221,17 +221,17 @@ pub fn return_type(
         BuiltinScalarFunction::Concat => Ok(Utf8),
         BuiltinScalarFunction::ConcatWithSeparator => Ok(Utf8),
         BuiltinScalarFunction::DatePart => Ok(Float64),
-        BuiltinScalarFunction::DateTrunc | BuiltinScalarFunction::DateBin => {
-            match input_expr_types[1] {
-                Timestamp(Nanosecond, _) | Utf8 => Ok(Timestamp(Nanosecond, None)),
-                Timestamp(Microsecond, _) => Ok(Timestamp(Microsecond, None)),
-                Timestamp(Millisecond, _) => Ok(Timestamp(Millisecond, None)),
-                Timestamp(Second, _) => Ok(Timestamp(Second, None)),
-                _ => Err(DataFusionError::Internal(format!(
-                    "The {fun} function can only accept timestamp as the second arg."
-                ))),
-            }
-        }
+        // DateTrunc always makes nanosecond timestamps
+        BuiltinScalarFunction::DateTrunc => Ok(Timestamp(Nanosecond, None)),

Review Comment:
   > So it doesn't cause issue before?
   
   Because it's hidden by other BUG #6613 (from_plan create projection use original schema)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] waitingkuo commented on pull request #6632: Fix date_trunc signature

Posted by "waitingkuo (via GitHub)" <gi...@apache.org>.
waitingkuo commented on PR #6632:
URL: https://github.com/apache/arrow-datafusion/pull/6632#issuecomment-1586877278

   hi @alamb thank you for pining me
   
   https://github.com/apache/arrow-datafusion/blob/373e291faba5b3c722018367e01a6b59a91bddee/datafusion/physical-expr/src/datetime_expressions.rs#L286-L317
   
   for angularity minute, second, and millisecond, it returns second, millisecond, and microsecond
   
   ```bash
   ❯ select arrow_typeof(date_trunc('minute', '2001-01-01T00:00:00.123456789'::timestamp));
   +--------------------------------------------------------------------------------+
   | arrow_typeof(date_trunc(Utf8("minute"),Utf8("2001-01-01T00:00:00.123456789"))) |
   +--------------------------------------------------------------------------------+
   | Timestamp(Second, None)                                                        |
   +--------------------------------------------------------------------------------+
   1 row in set. Query took 0.003 seconds.
   ```
   
   ```bash
   ❯ select arrow_typeof(date_trunc('second', '2001-01-01T00:00:00.123456789'::timestamp));
   +--------------------------------------------------------------------------------+
   | arrow_typeof(date_trunc(Utf8("second"),Utf8("2001-01-01T00:00:00.123456789"))) |
   +--------------------------------------------------------------------------------+
   | Timestamp(Millisecond, None)                                                   |
   +--------------------------------------------------------------------------------+
   1 row in set. Query took 0.003 seconds.
   ```
   
   ```bash
   ❯ select arrow_typeof(date_trunc('millisecond', '2001-01-01T00:00:00.123456789'::timestamp));
   +-------------------------------------------------------------------------------------+
   | arrow_typeof(date_trunc(Utf8("millisecond"),Utf8("2001-01-01T00:00:00.123456789"))) |
   +-------------------------------------------------------------------------------------+
   | Timestamp(Microsecond, None)                                                        |
   +-------------------------------------------------------------------------------------+
   1 row in set. Query took 0.003 seconds.
   ```
   
   for the array implementation it returns error since
   
   ```bash
   ❯ select date_trunc('minute', a) from (select '2001-01-01T00:00:00'::timestamp as a) union (select '2001-01-01T00:00:01');
   External error: Arrow error: Invalid argument error: RowConverter column schema mismatch, expected Timestamp(Second, None) got Timestamp(Nanosecond, None)
   ```
   
   ```bash
   ❯ select date_trunc('second', a) from (select '2001-01-01T00:00:00'::timestamp as a) union (select '2001-01-01T00:00:01');
   External error: Arrow error: Invalid argument error: RowConverter column schema mismatch, expected Timestamp(Millisecond, None) got Timestamp(Nanosecond, None)
   ```
   
   ```bash
   ❯ select date_trunc('millisecond', a) from (select '2001-01-01T00:00:00'::timestamp as a) union (select '2001-01-01T00:00:01');
   External error: Arrow error: Invalid argument error: RowConverter column schema mismatch, expected Timestamp(Microsecond, None) got Timestamp(Nanosecond, None)
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] viirya commented on a diff in pull request #6632: Fix date_trunc signature

Posted by "viirya (via GitHub)" <gi...@apache.org>.
viirya commented on code in PR #6632:
URL: https://github.com/apache/arrow-datafusion/pull/6632#discussion_r1225918413


##########
datafusion/expr/src/function.rs:
##########
@@ -221,17 +221,17 @@ pub fn return_type(
         BuiltinScalarFunction::Concat => Ok(Utf8),
         BuiltinScalarFunction::ConcatWithSeparator => Ok(Utf8),
         BuiltinScalarFunction::DatePart => Ok(Float64),
-        BuiltinScalarFunction::DateTrunc | BuiltinScalarFunction::DateBin => {
-            match input_expr_types[1] {
-                Timestamp(Nanosecond, _) | Utf8 => Ok(Timestamp(Nanosecond, None)),
-                Timestamp(Microsecond, _) => Ok(Timestamp(Microsecond, None)),
-                Timestamp(Millisecond, _) => Ok(Timestamp(Millisecond, None)),
-                Timestamp(Second, _) => Ok(Timestamp(Second, None)),
-                _ => Err(DataFusionError::Internal(format!(
-                    "The {fun} function can only accept timestamp as the second arg."
-                ))),
-            }
-        }
+        // DateTrunc always makes nanosecond timestamps
+        BuiltinScalarFunction::DateTrunc => Ok(Timestamp(Nanosecond, None)),

Review Comment:
   I guess that it is before because the second argument in `DateTrunc` signature is `Timestamp(Nanosecond, None)`. So it doesn't cause issue before?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on pull request #6632: Fix date_trunc signature

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on PR #6632:
URL: https://github.com/apache/arrow-datafusion/pull/6632#issuecomment-1587335750

   > for the array implementation it returns error since
   
   Thanks @waitingkuo -- I think this PR fixes the errors you show and 
   
   My plan here is to merge this PR and https://github.com/apache/arrow-datafusion/pull/6638 and then I will file a follow on ticket about potentially changing the implementation to support more granularity in `date_trunc`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org