You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/04/09 21:01:55 UTC

[GitHub] [arrow] alamb opened a new pull request #9970: ARROW-12277: [Rust][DataFusion] Implement Sum/Count/Min/Max aggregates for Timestamp(_,_)

alamb opened a new pull request #9970:
URL: https://github.com/apache/arrow/pull/9970


   # Rationale:
   If you try and aggregate (via SUM, for example) a column of a timestamp type, DataFusion generates an error:
   ```
   Coercion from [Timestamp(Nanosecond, None)] to the signature Uniform(1, [Int8, Int16, Int32, Int64, UInt8, UInt16, UInt32, UInt64, Float32, Float64]) failed.
   ```
   
   For example, from IOx
   
   ```
   > show columns from t;
   +---------------+--------------+------------+-------------+-----------------------------+-------------+
   | table_catalog | table_schema | table_name | column_name | data_type                   | is_nullable |
   +---------------+--------------+------------+-------------+-----------------------------+-------------+
   | datafusion    | public       | t          | a           | Utf8                        | NO          |
   | datafusion    | public       | t          | b           | Timestamp(Nanosecond, None) | NO          |
   +---------------+--------------+------------+-------------+-----------------------------+-------------+
   2 row in set. Query took 0 seconds.
   > select sum(b) from t;
   Plan("Coercion from [Timestamp(Nanosecond, None)] to the signature Uniform(1, [Int8, Int16, Int32, Int64, UInt8, UInt16, UInt32, UInt64, Float32, Float64]) failed.")
   {code}
   
   # Changes:
   Add support for aggregating timestamp types and tests for same
   
   # Notes
   Note this is follow on / more fleshing out of the work done in #9773 by @velvia (👋  thanks for adding Timestamps to `ScalarValue`)
   
   I will track supporting AVG on timestamps in another ticket. It is more involved (as currently Avg assumes the output type is always F64), and not important for myuse casee at the moment.
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] Dandandan commented on pull request #9970: ARROW-12277: [Rust][DataFusion] Implement Sum/Count/Min/Max aggregates for Timestamp(_,_)

Posted by GitBox <gi...@apache.org>.
Dandandan commented on pull request #9970:
URL: https://github.com/apache/arrow/pull/9970#issuecomment-817118938


   Hey @alamb this is looking good.
   I'm wondering whether we should support sum/avg for timestamp?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on pull request #9970: ARROW-12277: [Rust][DataFusion] Implement Sum/Count/Min/Max aggregates for Timestamp(_,_)

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #9970:
URL: https://github.com/apache/arrow/pull/9970#issuecomment-817279890


   > What are the exact use cases for summing timestamps? When does it make sense?
   
   @Dandandan  that is an excellent question. I will freely admit I was just heads down trying to get all the aggregates working rather than thinking if I *should* be working. I can't think of any time that `sum(timestamp)` really makes sense to be honest. 
   
   Do you think I should remove support for summing timestamps?
   
   As for AVG, I added a note to https://issues.apache.org/jira/browse/ARROW-12318 with your observation. 🤔 good question
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] Dandandan edited a comment on pull request #9970: ARROW-12277: [Rust][DataFusion] Implement Sum/Count/Min/Max aggregates for Timestamp(_,_)

Posted by GitBox <gi...@apache.org>.
Dandandan edited a comment on pull request #9970:
URL: https://github.com/apache/arrow/pull/9970#issuecomment-817946446


   > > At least would be best to remove support for sum, as we don't have an absolute zero for timestamps. For average, I think it may make some more sense, but I think the use case seems limited to me.
   > 
   > @Dandandan -- I just double checked and the IOx project actually has need of `Sum` (and `Avg`) for timestamps due to the semantics of the `Flux` and `InfluxQL` languages. I can implement this functionality for custom user defined aggregates if needed, but I would prefer having the functionality in DataFusion directly so that anyone else can potentially use it too
   > 
   > I don't really understand the comment about "absolute zero" for timestamps.
   > 
   > Would it be ok with you if I merged in support for Sum for timestamps?
   
   My reaction to enable sum for timestamps was that it depends on the starting/zero value for the value, in this case the UNIX epoch.
   The result of a addition/sum on timestamp is meaningless as it depends on the choice of the value of zero (1980 + 1980 + 1980 = 2000!?, 1970 + 1970 = 1970!?). When we would change the zero value (to Jan 1 1900 for example), we would totally change the meaning of adding timestamps.
   
   If there was an "actual" zero date (like Kelvin for example) summing the values makes sense but there isn't such a thing for timestamps of course.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb edited a comment on pull request #9970: ARROW-12277: [Rust][DataFusion] Implement Sum/Count/Min/Max aggregates for Timestamp(_,_)

Posted by GitBox <gi...@apache.org>.
alamb edited a comment on pull request #9970:
URL: https://github.com/apache/arrow/pull/9970#issuecomment-817915042


   > At least would be best to remove support for sum, as we don't have an absolute zero for timestamps. For average, I think it may make some more sense, but I think the use case seems limited to me.
   
   @Dandandan -- I just double checked and the IOx project actually has need of `Sum` (and `Avg`) for timestamps due to the semantics of the `Flux` and `InfluxQL` languages.  I can implement this functionality for custom user defined aggregates if needed, but I would prefer having the functionality in DataFusion directly so that anyone else can potentially use it too 
   
   I don't really understand the comment about "absolute zero" for timestamps.
   
   Would it be ok with you if I merged in support for Sum for timestamps?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on a change in pull request #9970: ARROW-12277: [Rust][DataFusion] Implement Sum/Count/Min/Max aggregates for Timestamp(_,_)

Posted by GitBox <gi...@apache.org>.
alamb commented on a change in pull request #9970:
URL: https://github.com/apache/arrow/pull/9970#discussion_r612351047



##########
File path: rust/datafusion/src/execution/context.rs
##########
@@ -1403,6 +1403,121 @@ mod tests {
         Ok(())
     }
 
+    #[tokio::test]
+    async fn aggregate_timestamps_sum() -> Result<()> {
+        let tmp_dir = TempDir::new()?;
+        let mut ctx = create_ctx(&tmp_dir, 1)?;
+        ctx.register_table("t", test::table_with_timestamps())
+            .unwrap();
+
+        let results = plan_and_collect(
+            &mut ctx,
+            "SELECT sum(nanos), sum(micros), sum(millis), sum(secs) FROM t",
+        )
+        .await
+        .unwrap_err();
+
+        assert_eq!(results.to_string(), "Error during planning: Coercion from [Timestamp(Nanosecond, None)] to the signature Uniform(1, [Int8, Int16, Int32, Int64, UInt8, UInt16, UInt32, UInt64, Float32, Float64]) failed.");
+
+        Ok(())
+    }
+
+    #[tokio::test]
+    async fn aggregate_timestamps_count() -> Result<()> {
+        let tmp_dir = TempDir::new()?;
+        let mut ctx = create_ctx(&tmp_dir, 1)?;
+        ctx.register_table("t", test::table_with_timestamps())
+            .unwrap();
+
+        let results = plan_and_collect(
+            &mut ctx,
+            "SELECT count(nanos), count(micros), count(millis), count(secs) FROM t",
+        )
+        .await
+        .unwrap();
+
+        let expected = vec![

Review comment:
       Thanks @velvia  --In general I agree that checking string representations in tests is brittle when formatting changes.
   
   In the case of sql / tabular output, I think it is the best alternative I have seen because:
   1. The output formatting does not change often
   2. Updating these tests are easy (simply copy/paste the output of the test failure into the test after verifying it)
   3. These tests follow the same pattern as the rest of the tests in this module
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on pull request #9970: ARROW-12277: [Rust][DataFusion] Implement Sum/Count/Min/Max aggregates for Timestamp(_,_)

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #9970:
URL: https://github.com/apache/arrow/pull/9970#issuecomment-817915042


   > At least would be best to remove support for sum, as we don't have an absolute zero for timestamps. For average, I think it may make some more sense, but I think the use case seems limited to me.
   
   @Dandandan -- I just double checked and the IOx project actually has need of `Sum` (and `Avg`) for timestamps due to the semantics of the `Flux` and `InfluxQL` languages.  I can implement this functionality for custom user defined aggregates if needed, but I would prefer having the functionality in DataFusion directly so that anyone else can potentially use it too 
   
   I don't really understand the comment about "absolute zero" for timestamps.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] velvia commented on a change in pull request #9970: ARROW-12277: [Rust][DataFusion] Implement Sum/Count/Min/Max aggregates for Timestamp(_,_)

Posted by GitBox <gi...@apache.org>.
velvia commented on a change in pull request #9970:
URL: https://github.com/apache/arrow/pull/9970#discussion_r611954223



##########
File path: rust/datafusion/src/execution/context.rs
##########
@@ -1403,6 +1403,128 @@ mod tests {
         Ok(())
     }
 
+    #[tokio::test]
+    async fn aggregate_timestamps_sum() -> Result<()> {
+        let tmp_dir = TempDir::new()?;
+        let mut ctx = create_ctx(&tmp_dir, 1)?;
+        ctx.register_table("t", test::table_with_timestamps())
+            .unwrap();
+
+        let results = plan_and_collect(
+            &mut ctx,
+            "SELECT sum(nanos), sum(micros), sum(millis), sum(secs) FROM t",
+        )
+        .await
+        .unwrap();
+
+        let expected = vec![

Review comment:
       That's a good question.  I can think of great uses for avg, and difference (latency between two timestamps), but not really for sum, other than that sum is an important part of computing average?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on pull request #9970: ARROW-12277: [Rust][DataFusion] Implement Sum/Count/Min/Max aggregates for Timestamp(_,_)

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #9970:
URL: https://github.com/apache/arrow/pull/9970#issuecomment-817874943


   > At least would be best to remove support for sum, as we don't have an absolute zero for timestamps. For average, I think it may make some more sense, but I think the use case seems limited to me.
   
   
   will do


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] Dandandan commented on a change in pull request #9970: ARROW-12277: [Rust][DataFusion] Implement Sum/Count/Min/Max aggregates for Timestamp(_,_)

Posted by GitBox <gi...@apache.org>.
Dandandan commented on a change in pull request #9970:
URL: https://github.com/apache/arrow/pull/9970#discussion_r611030976



##########
File path: rust/datafusion/src/execution/context.rs
##########
@@ -1403,6 +1403,128 @@ mod tests {
         Ok(())
     }
 
+    #[tokio::test]
+    async fn aggregate_timestamps_sum() -> Result<()> {
+        let tmp_dir = TempDir::new()?;
+        let mut ctx = create_ctx(&tmp_dir, 1)?;
+        ctx.register_table("t", test::table_with_timestamps())
+            .unwrap();
+
+        let results = plan_and_collect(
+            &mut ctx,
+            "SELECT sum(nanos), sum(micros), sum(millis), sum(secs) FROM t",
+        )
+        .await
+        .unwrap();
+
+        let expected = vec![

Review comment:
       What are the exact use cases for summing timestamps? When does it make sense?
   It looks like PostgreSQL doesn't support it: https://www.postgresql.org/docs/13/functions-aggregate.html




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on a change in pull request #9970: ARROW-12277: [Rust][DataFusion] Implement Sum/Count/Min/Max aggregates for Timestamp(_,_)

Posted by GitBox <gi...@apache.org>.
alamb commented on a change in pull request #9970:
URL: https://github.com/apache/arrow/pull/9970#discussion_r610907584



##########
File path: rust/datafusion/src/execution/context.rs
##########
@@ -1403,6 +1403,128 @@ mod tests {
         Ok(())
     }
 
+    #[tokio::test]
+    async fn aggregate_timestamps_sum() -> Result<()> {
+        let tmp_dir = TempDir::new()?;
+        let mut ctx = create_ctx(&tmp_dir, 1)?;
+        ctx.register_table("t", test::table_with_timestamps())
+            .unwrap();
+
+        let results = plan_and_collect(
+            &mut ctx,
+            "SELECT sum(nanos), sum(micros), sum(millis), sum(secs) FROM t",
+        )
+        .await
+        .unwrap();
+
+        let expected = vec![
+            "+-------------------------------+----------------------------+-------------------------+---------------------+",
+            "| SUM(nanos)                    | SUM(micros)                | SUM(millis)             | SUM(secs)           |",
+            "+-------------------------------+----------------------------+-------------------------+---------------------+",
+            "| 2111-10-27 09:35:30.566825885 | 2111-10-27 09:35:30.566825 | 2111-10-27 09:35:30.566 | 2111-10-27 09:35:30 |",
+            "+-------------------------------+----------------------------+-------------------------+---------------------+",
+];
+        assert_batches_sorted_eq!(expected, &results);
+
+        Ok(())
+    }
+
+    #[tokio::test]
+    async fn aggregate_timestamps_count() -> Result<()> {
+        let tmp_dir = TempDir::new()?;
+        let mut ctx = create_ctx(&tmp_dir, 1)?;
+        ctx.register_table("t", test::table_with_timestamps())
+            .unwrap();
+
+        let results = plan_and_collect(
+            &mut ctx,
+            "SELECT count(nanos), count(micros), count(millis), count(secs) FROM t",
+        )
+        .await
+        .unwrap();
+
+        let expected = vec![
+            "+--------------+---------------+---------------+-------------+",
+            "| COUNT(nanos) | COUNT(micros) | COUNT(millis) | COUNT(secs) |",
+            "+--------------+---------------+---------------+-------------+",
+            "| 3            | 3             | 3             | 3           |",
+            "+--------------+---------------+---------------+-------------+",
+        ];
+        assert_batches_sorted_eq!(expected, &results);
+
+        Ok(())
+    }
+
+    #[tokio::test]
+    async fn aggregate_timestamps_min() -> Result<()> {
+        let tmp_dir = TempDir::new()?;
+        let mut ctx = create_ctx(&tmp_dir, 1)?;
+        ctx.register_table("t", test::table_with_timestamps())
+            .unwrap();
+
+        let results = plan_and_collect(
+            &mut ctx,
+            "SELECT min(nanos), min(micros), min(millis), min(secs) FROM t",
+        )
+        .await
+        .unwrap();
+
+        let expected = vec![
+            "+----------------------------+----------------------------+-------------------------+---------------------+",
+            "| MIN(nanos)                 | MIN(micros)                | MIN(millis)             | MIN(secs)           |",
+            "+----------------------------+----------------------------+-------------------------+---------------------+",
+            "| 2011-12-13 11:13:10.123450 | 2011-12-13 11:13:10.123450 | 2011-12-13 11:13:10.123 | 2011-12-13 11:13:10 |",
+            "+----------------------------+----------------------------+-------------------------+---------------------+",
+        ];
+        assert_batches_sorted_eq!(expected, &results);
+
+        Ok(())
+    }
+
+    #[tokio::test]
+    async fn aggregate_timestamps_max() -> Result<()> {
+        let tmp_dir = TempDir::new()?;
+        let mut ctx = create_ctx(&tmp_dir, 1)?;
+        ctx.register_table("t", test::table_with_timestamps())
+            .unwrap();
+
+        let results = plan_and_collect(
+            &mut ctx,
+            "SELECT max(nanos), max(micros), max(millis), max(secs) FROM t",
+        )
+        .await
+        .unwrap();
+
+        let expected = vec![
+            "+-------------------------+-------------------------+-------------------------+---------------------+",
+            "| MAX(nanos)              | MAX(micros)             | MAX(millis)             | MAX(secs)           |",
+            "+-------------------------+-------------------------+-------------------------+---------------------+",
+            "| 2021-01-01 05:11:10.432 | 2021-01-01 05:11:10.432 | 2021-01-01 05:11:10.432 | 2021-01-01 05:11:10 |",
+            "+-------------------------+-------------------------+-------------------------+---------------------+",
+];
+        assert_batches_sorted_eq!(expected, &results);
+
+        Ok(())
+    }
+
+    #[tokio::test]
+    async fn aggregate_timestamps_avg() -> Result<()> {
+        let tmp_dir = TempDir::new()?;
+        let mut ctx = create_ctx(&tmp_dir, 1)?;
+        ctx.register_table("t", test::table_with_timestamps())
+            .unwrap();
+
+        let results = plan_and_collect(
+            &mut ctx,
+            "SELECT avg(nanos), avg(micros), avg(millis), avg(secs) FROM t",
+        )
+        .await
+        .unwrap_err();
+
+        assert_eq!(results.to_string(), "Error during planning: Coercion from [Timestamp(Nanosecond, None)] to the signature Uniform(1, [Int8, Int16, Int32, Int64, UInt8, UInt16, UInt32, UInt64, Float32, Float64]) failed.");

Review comment:
       This error message is pretty horrible. I filed https://issues.apache.org/jira/browse/ARROW-12319 to track improving it




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] velvia commented on a change in pull request #9970: ARROW-12277: [Rust][DataFusion] Implement Sum/Count/Min/Max aggregates for Timestamp(_,_)

Posted by GitBox <gi...@apache.org>.
velvia commented on a change in pull request #9970:
URL: https://github.com/apache/arrow/pull/9970#discussion_r611955641



##########
File path: rust/datafusion/src/execution/context.rs
##########
@@ -1403,6 +1403,121 @@ mod tests {
         Ok(())
     }
 
+    #[tokio::test]
+    async fn aggregate_timestamps_sum() -> Result<()> {
+        let tmp_dir = TempDir::new()?;
+        let mut ctx = create_ctx(&tmp_dir, 1)?;
+        ctx.register_table("t", test::table_with_timestamps())
+            .unwrap();
+
+        let results = plan_and_collect(
+            &mut ctx,
+            "SELECT sum(nanos), sum(micros), sum(millis), sum(secs) FROM t",
+        )
+        .await
+        .unwrap_err();
+
+        assert_eq!(results.to_string(), "Error during planning: Coercion from [Timestamp(Nanosecond, None)] to the signature Uniform(1, [Int8, Int16, Int32, Int64, UInt8, UInt16, UInt32, UInt64, Float32, Float64]) failed.");
+
+        Ok(())
+    }
+
+    #[tokio::test]
+    async fn aggregate_timestamps_count() -> Result<()> {
+        let tmp_dir = TempDir::new()?;
+        let mut ctx = create_ctx(&tmp_dir, 1)?;
+        ctx.register_table("t", test::table_with_timestamps())
+            .unwrap();
+
+        let results = plan_and_collect(
+            &mut ctx,
+            "SELECT count(nanos), count(micros), count(millis), count(secs) FROM t",
+        )
+        .await
+        .unwrap();
+
+        let expected = vec![

Review comment:
       Testing string formatting seems kinda yucky and prone to fail when the formatting changes.  Is there a better way to test the output here?

##########
File path: rust/datafusion/src/test/mod.rs
##########
@@ -182,6 +185,93 @@ pub fn make_partition(sz: i32) -> RecordBatch {
     RecordBatch::try_new(schema, vec![arr]).unwrap()
 }
 
+/// Return a new table provider containing all of the supported timestamp types
+pub fn table_with_timestamps() -> Arc<dyn TableProvider> {
+    let batch = make_timestamps();
+    let schema = batch.schema();
+    let partitions = vec![vec![batch]];
+    Arc::new(MemTable::try_new(schema, partitions).unwrap())
+}
+
+/// Return  record batch with all of the supported timestamp types
+/// values
+///
+/// Columns are named:
+/// "nanos" --> TimestampNanosecondArray
+/// "micros" --> TimestampMicrosecondArray
+/// "millis" --> TimestampMillisecondArray
+/// "secs" --> TimestampSecondArray
+/// "names" --> StringArray
+pub fn make_timestamps() -> RecordBatch {

Review comment:
       👍   this will be really useful.

##########
File path: rust/datafusion/src/scalar.rs
##########
@@ -72,12 +77,14 @@ pub enum ScalarValue {
     Date32(Option<i32>),
     /// Date stored as a signed 64bit int
     Date64(Option<i64>),
+    /// Timestamp Second
+    TimestampSecond(Option<i64>),

Review comment:
       👍  Great to be consistent.
   Not related to this PR necessarily but I'm going to work on support for converting date strings to timestamps of different resolutions.  Right now only Nanos are supported, ideally we'd have `to_timestamp(...)` that allows choosing nanos, micros, millis, or seconds under the hood.

##########
File path: rust/datafusion/src/execution/context.rs
##########
@@ -1403,6 +1403,121 @@ mod tests {
         Ok(())
     }
 
+    #[tokio::test]
+    async fn aggregate_timestamps_sum() -> Result<()> {
+        let tmp_dir = TempDir::new()?;
+        let mut ctx = create_ctx(&tmp_dir, 1)?;
+        ctx.register_table("t", test::table_with_timestamps())
+            .unwrap();
+
+        let results = plan_and_collect(
+            &mut ctx,
+            "SELECT sum(nanos), sum(micros), sum(millis), sum(secs) FROM t",
+        )
+        .await
+        .unwrap_err();
+
+        assert_eq!(results.to_string(), "Error during planning: Coercion from [Timestamp(Nanosecond, None)] to the signature Uniform(1, [Int8, Int16, Int32, Int64, UInt8, UInt16, UInt32, UInt64, Float32, Float64]) failed.");
+
+        Ok(())
+    }
+
+    #[tokio::test]
+    async fn aggregate_timestamps_count() -> Result<()> {
+        let tmp_dir = TempDir::new()?;
+        let mut ctx = create_ctx(&tmp_dir, 1)?;
+        ctx.register_table("t", test::table_with_timestamps())
+            .unwrap();
+
+        let results = plan_and_collect(
+            &mut ctx,
+            "SELECT count(nanos), count(micros), count(millis), count(secs) FROM t",
+        )
+        .await
+        .unwrap();
+
+        let expected = vec![
+            "+--------------+---------------+---------------+-------------+",
+            "| COUNT(nanos) | COUNT(micros) | COUNT(millis) | COUNT(secs) |",
+            "+--------------+---------------+---------------+-------------+",
+            "| 3            | 3             | 3             | 3           |",
+            "+--------------+---------------+---------------+-------------+",
+        ];
+        assert_batches_sorted_eq!(expected, &results);
+
+        Ok(())
+    }
+
+    #[tokio::test]
+    async fn aggregate_timestamps_min() -> Result<()> {

Review comment:
       Min and max definitely makes sense.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] Dandandan edited a comment on pull request #9970: ARROW-12277: [Rust][DataFusion] Implement Sum/Count/Min/Max aggregates for Timestamp(_,_)

Posted by GitBox <gi...@apache.org>.
Dandandan edited a comment on pull request #9970:
URL: https://github.com/apache/arrow/pull/9970#issuecomment-817946446






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on pull request #9970: ARROW-12277: [Rust][DataFusion] Implement Sum/Count/Min/Max aggregates for Timestamp(_,_)

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #9970:
URL: https://github.com/apache/arrow/pull/9970#issuecomment-817979546


   I have removed `SUM` from this PR and will figure out what IOx is doing that seems to require sum and doesn't make sense. Thanks @Dandandan 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on pull request #9970: ARROW-12277: [Rust][DataFusion] Implement Sum/Count/Min/Max aggregates for Timestamp(_,_)

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #9970:
URL: https://github.com/apache/arrow/pull/9970#issuecomment-817113666


   FYI @Dandandan  / @returnString 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] Dandandan commented on pull request #9970: ARROW-12277: [Rust][DataFusion] Implement Sum/Count/Min/Max aggregates for Timestamp(_,_)

Posted by GitBox <gi...@apache.org>.
Dandandan commented on pull request #9970:
URL: https://github.com/apache/arrow/pull/9970#issuecomment-817317167


   > > What are the exact use cases for summing timestamps? When does it make sense?
   > 
   > @Dandandan that is an excellent question. I will freely admit I was just heads down trying to get all the aggregates working rather than thinking if I _should_ be working. I can't think of any time that `sum(timestamp)` really makes sense to be honest.
   > 
   > Do you think I should remove support for summing timestamps?
   > 
   > As for AVG, I added a note to https://issues.apache.org/jira/browse/ARROW-12318 with your observation. 🤔 good question
   
   At least would be best to remove support forsum, as we don't have an absolute zero for timestamps. For average, I think it may make some more sense, but I think the use case seems limited to me.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] velvia commented on a change in pull request #9970: ARROW-12277: [Rust][DataFusion] Implement Sum/Count/Min/Max aggregates for Timestamp(_,_)

Posted by GitBox <gi...@apache.org>.
velvia commented on a change in pull request #9970:
URL: https://github.com/apache/arrow/pull/9970#discussion_r611953748



##########
File path: rust/datafusion/src/execution/context.rs
##########
@@ -1403,6 +1403,128 @@ mod tests {
         Ok(())
     }
 
+    #[tokio::test]
+    async fn aggregate_timestamps_sum() -> Result<()> {
+        let tmp_dir = TempDir::new()?;
+        let mut ctx = create_ctx(&tmp_dir, 1)?;
+        ctx.register_table("t", test::table_with_timestamps())
+            .unwrap();
+
+        let results = plan_and_collect(
+            &mut ctx,
+            "SELECT sum(nanos), sum(micros), sum(millis), sum(secs) FROM t",
+        )
+        .await
+        .unwrap();
+
+        let expected = vec![
+            "+-------------------------------+----------------------------+-------------------------+---------------------+",
+            "| SUM(nanos)                    | SUM(micros)                | SUM(millis)             | SUM(secs)           |",
+            "+-------------------------------+----------------------------+-------------------------+---------------------+",
+            "| 2111-10-27 09:35:30.566825885 | 2111-10-27 09:35:30.566825 | 2111-10-27 09:35:30.566 | 2111-10-27 09:35:30 |",
+            "+-------------------------------+----------------------------+-------------------------+---------------------+",
+];
+        assert_batches_sorted_eq!(expected, &results);
+
+        Ok(())
+    }
+
+    #[tokio::test]
+    async fn aggregate_timestamps_count() -> Result<()> {
+        let tmp_dir = TempDir::new()?;
+        let mut ctx = create_ctx(&tmp_dir, 1)?;
+        ctx.register_table("t", test::table_with_timestamps())
+            .unwrap();
+
+        let results = plan_and_collect(
+            &mut ctx,
+            "SELECT count(nanos), count(micros), count(millis), count(secs) FROM t",
+        )
+        .await
+        .unwrap();
+
+        let expected = vec![
+            "+--------------+---------------+---------------+-------------+",
+            "| COUNT(nanos) | COUNT(micros) | COUNT(millis) | COUNT(secs) |",
+            "+--------------+---------------+---------------+-------------+",
+            "| 3            | 3             | 3             | 3           |",
+            "+--------------+---------------+---------------+-------------+",
+        ];
+        assert_batches_sorted_eq!(expected, &results);
+
+        Ok(())
+    }
+
+    #[tokio::test]
+    async fn aggregate_timestamps_min() -> Result<()> {
+        let tmp_dir = TempDir::new()?;
+        let mut ctx = create_ctx(&tmp_dir, 1)?;
+        ctx.register_table("t", test::table_with_timestamps())
+            .unwrap();
+
+        let results = plan_and_collect(
+            &mut ctx,
+            "SELECT min(nanos), min(micros), min(millis), min(secs) FROM t",
+        )
+        .await
+        .unwrap();
+
+        let expected = vec![
+            "+----------------------------+----------------------------+-------------------------+---------------------+",
+            "| MIN(nanos)                 | MIN(micros)                | MIN(millis)             | MIN(secs)           |",
+            "+----------------------------+----------------------------+-------------------------+---------------------+",
+            "| 2011-12-13 11:13:10.123450 | 2011-12-13 11:13:10.123450 | 2011-12-13 11:13:10.123 | 2011-12-13 11:13:10 |",
+            "+----------------------------+----------------------------+-------------------------+---------------------+",
+        ];
+        assert_batches_sorted_eq!(expected, &results);
+
+        Ok(())
+    }
+
+    #[tokio::test]
+    async fn aggregate_timestamps_max() -> Result<()> {
+        let tmp_dir = TempDir::new()?;
+        let mut ctx = create_ctx(&tmp_dir, 1)?;
+        ctx.register_table("t", test::table_with_timestamps())
+            .unwrap();
+
+        let results = plan_and_collect(
+            &mut ctx,
+            "SELECT max(nanos), max(micros), max(millis), max(secs) FROM t",
+        )
+        .await
+        .unwrap();
+
+        let expected = vec![
+            "+-------------------------+-------------------------+-------------------------+---------------------+",
+            "| MAX(nanos)              | MAX(micros)             | MAX(millis)             | MAX(secs)           |",
+            "+-------------------------+-------------------------+-------------------------+---------------------+",
+            "| 2021-01-01 05:11:10.432 | 2021-01-01 05:11:10.432 | 2021-01-01 05:11:10.432 | 2021-01-01 05:11:10 |",
+            "+-------------------------+-------------------------+-------------------------+---------------------+",
+];
+        assert_batches_sorted_eq!(expected, &results);
+
+        Ok(())
+    }
+
+    #[tokio::test]
+    async fn aggregate_timestamps_avg() -> Result<()> {
+        let tmp_dir = TempDir::new()?;
+        let mut ctx = create_ctx(&tmp_dir, 1)?;
+        ctx.register_table("t", test::table_with_timestamps())
+            .unwrap();
+
+        let results = plan_and_collect(
+            &mut ctx,
+            "SELECT avg(nanos), avg(micros), avg(millis), avg(secs) FROM t",
+        )
+        .await
+        .unwrap_err();
+
+        assert_eq!(results.to_string(), "Error during planning: Coercion from [Timestamp(Nanosecond, None)] to the signature Uniform(1, [Int8, Int16, Int32, Int64, UInt8, UInt16, UInt32, UInt64, Float32, Float64]) failed.");

Review comment:
       Agreed, thanks for filing a ticket!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on a change in pull request #9970: ARROW-12277: [Rust][DataFusion] Implement Sum/Count/Min/Max aggregates for Timestamp(_,_)

Posted by GitBox <gi...@apache.org>.
alamb commented on a change in pull request #9970:
URL: https://github.com/apache/arrow/pull/9970#discussion_r610904035



##########
File path: rust/datafusion/src/scalar.rs
##########
@@ -72,12 +77,14 @@ pub enum ScalarValue {
     Date32(Option<i32>),
     /// Date stored as a signed 64bit int
     Date64(Option<i64>),
+    /// Timestamp Second
+    TimestampSecond(Option<i64>),

Review comment:
       I added support for `TimestampSecond` here and renamed `ScalarValue::TimeMillisecond` --> `ScalarValue::TimestampMillisecond` so that the `ScalarValue` enum type matches up with the `DataType` enum type name




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #9970: ARROW-12277: [Rust][DataFusion] Implement Sum/Count/Min/Max aggregates for Timestamp(_,_)

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #9970:
URL: https://github.com/apache/arrow/pull/9970#issuecomment-816969912


   https://issues.apache.org/jira/browse/ARROW-12277


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] Dandandan edited a comment on pull request #9970: ARROW-12277: [Rust][DataFusion] Implement Sum/Count/Min/Max aggregates for Timestamp(_,_)

Posted by GitBox <gi...@apache.org>.
Dandandan edited a comment on pull request #9970:
URL: https://github.com/apache/arrow/pull/9970#issuecomment-817946446


   > > At least would be best to remove support for sum, as we don't have an absolute zero for timestamps. For average, I think it may make some more sense, but I think the use case seems limited to me.
   > 
   > @Dandandan -- I just double checked and the IOx project actually has need of `Sum` (and `Avg`) for timestamps due to the semantics of the `Flux` and `InfluxQL` languages. I can implement this functionality for custom user defined aggregates if needed, but I would prefer having the functionality in DataFusion directly so that anyone else can potentially use it too
   > 
   > I don't really understand the comment about "absolute zero" for timestamps.
   > 
   > Would it be ok with you if I merged in support for Sum for timestamps?
   
   My reaction to enable sum for timestamps was that it depends on the starting/zero value for the value, in this case the UNIX epoch.
   The result of a addition/sum on timestamp is meaningless as it depends on the choice of the value of zero (1980 + 1980 + 1980 = 2000!?, 1970 + 1970 = 1970!?, 1960 + 1960 = 1950!?). When we would change the zero value (to Jan 1 1900 for example), we would totally change the meaning of adding timestamps.
   
   If there was an "actual" zero date (like Kelvin for example) summing the values makes sense but there isn't such a thing for timestamps of course.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] Dandandan commented on pull request #9970: ARROW-12277: [Rust][DataFusion] Implement Sum/Count/Min/Max aggregates for Timestamp(_,_)

Posted by GitBox <gi...@apache.org>.
Dandandan commented on pull request #9970:
URL: https://github.com/apache/arrow/pull/9970#issuecomment-817946446


   > > At least would be best to remove support for sum, as we don't have an absolute zero for timestamps. For average, I think it may make some more sense, but I think the use case seems limited to me.
   > 
   > @Dandandan -- I just double checked and the IOx project actually has need of `Sum` (and `Avg`) for timestamps due to the semantics of the `Flux` and `InfluxQL` languages. I can implement this functionality for custom user defined aggregates if needed, but I would prefer having the functionality in DataFusion directly so that anyone else can potentially use it too
   > 
   > I don't really understand the comment about "absolute zero" for timestamps.
   > 
   > Would it be ok with you if I merged in support for Sum for timestamps?
   
   My reaction to enable sum for timestamps was that it depends on the starting/zero value for the value, in this case the UNIX epoch.
   The result of a addition/sum on timestamp are relatively non-nonsensical as it depends on the choice of the value of zero (1980 + 1980 + 1980 = 2000!?, 1970 + 1970 = 1970!?). When we would change the zero value (to Jan 1 1900 for example), we would totally change the meaning of adding timestamps.
   
   If there was an "actual" zero date (like Kelvin for example) summing the values makes sense but there isn't such a thing for timestamps of course.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] returnString edited a comment on pull request #9970: ARROW-12277: [Rust][DataFusion] Implement Sum/Count/Min/Max aggregates for Timestamp(_,_)

Posted by GitBox <gi...@apache.org>.
returnString edited a comment on pull request #9970:
URL: https://github.com/apache/arrow/pull/9970#issuecomment-817931084


   Thinking more about this: do we need to be careful about numeric limits? E.g. the current implementation of `AvgAccumulator` in the physical plan tracks the whole sum and count, and does `sum / count` at the end, meaning we might hit `i64::MAX` in the sum component given enough entries in an `avg` input with a sufficiently high-precision timestamp type.
   
   Edit: don't mind me, it forces an f64 accumulator - I saw `ScalarValue` and panicked :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb closed pull request #9970: ARROW-12277: [Rust][DataFusion] Implement and test Count/Min/Max aggregates for Timestamp(_,_)

Posted by GitBox <gi...@apache.org>.
alamb closed pull request #9970:
URL: https://github.com/apache/arrow/pull/9970


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] returnString commented on pull request #9970: ARROW-12277: [Rust][DataFusion] Implement Sum/Count/Min/Max aggregates for Timestamp(_,_)

Posted by GitBox <gi...@apache.org>.
returnString commented on pull request #9970:
URL: https://github.com/apache/arrow/pull/9970#issuecomment-817931084


   Thinking more about this: do we need to be careful about numeric limits? E.g. the current implementation of `AvgAccumulator` in the physical plan tracks the whole sum and count, and does `sum / count` at the end, meaning we might hit `i64::MAX` in the sum component given enough entries in an `avg` input with a sufficiently high-precision timestamp type.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] Dandandan edited a comment on pull request #9970: ARROW-12277: [Rust][DataFusion] Implement Sum/Count/Min/Max aggregates for Timestamp(_,_)

Posted by GitBox <gi...@apache.org>.
Dandandan edited a comment on pull request #9970:
URL: https://github.com/apache/arrow/pull/9970#issuecomment-817317167


   > > What are the exact use cases for summing timestamps? When does it make sense?
   > 
   > @Dandandan that is an excellent question. I will freely admit I was just heads down trying to get all the aggregates working rather than thinking if I _should_ be working. I can't think of any time that `sum(timestamp)` really makes sense to be honest.
   > 
   > Do you think I should remove support for summing timestamps?
   > 
   > As for AVG, I added a note to https://issues.apache.org/jira/browse/ARROW-12318 with your observation. 🤔 good question
   
   At least would be best to remove support for sum, as we don't have an absolute zero for timestamps. For average, I think it may make some more sense, but I think the use case seems limited to me.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org