You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/07/21 17:32:12 UTC

[GitHub] [arrow-datafusion] alamb opened a new issue #765: Predicates on to_timestamp do not work as expected

alamb opened a new issue #765:
URL: https://github.com/apache/arrow-datafusion/issues/765


   **Describe the bug**
   Given a `TimestampNanosecondArray` which pretty-prints as follows:
   ```
       // +---------------------+
       // | time                |
       // +---------------------+
       // | 2021-07-20 23:28:50 |
       // | 2021-07-20 23:30:30 |
       // +---------------------+
   ```
   Queries involving a predicate such as `time < to_timestamp('2021-07-20 23:29:30')` do not filter any rows (even though they should filter the row with `2021-07-20 23:30:30`)
   
   
   **To Reproduce**
   ```rust
   async fn datafusion_reproducer() {
       let array = TimestampNanosecondArray::from(vec![1626823730000000000, 1626823830000000000]);
       let array: ArrayRef = Arc::new(array);
       println!("array[0]: {:?}", array_value_to_string(&array, 0).unwrap());
       println!("array[1]: {:?}", array_value_to_string(&array, 1).unwrap());
   
       let batch = RecordBatch::try_from_iter(vec![("time", array)]).unwrap();
       let table = MemTable::try_new(batch.schema(), vec![vec![batch]]).unwrap();
       let table = Arc::new(table);
   
       // select * from t
       // +---------------------+
       // | time                |
       // +---------------------+
       // | 2021-07-20 23:28:50 |
       // | 2021-07-20 23:30:30 |
       // +---------------------+
       run_query(table.clone(), "select * from t").await;
   
       // Using the following predicate should result in a single row,
       // but instead results in both
       //
       // select * from t where time < to_timestamp('2021-07-20 23:29:30')
       // +---------------------+
       // | time                |
       // +---------------------+
       // | 2021-07-20 23:28:50 |
       // | 2021-07-20 23:30:30 |
       // +---------------------+
       run_query(table.clone(), "select * from t where time < to_timestamp('2021-07-20 23:29:30')").await;
   
   
   
       // explain select * from t where time < to_timestamp('2021-07-20 23:29:30')
       // +---------------+---------------------------------------------------------------+
       // | plan_type     | plan                                                          |
       // +---------------+---------------------------------------------------------------+
       // | logical_plan  | Projection: #t.time                                           |
       // |               |   Filter: #t.time Lt TimestampNanosecond(1626838170000000000) |
       // |               |     TableScan: t projection=Some([0])                         |
       // | physical_plan | ProjectionExec: expr=[time@0 as time]                         |
       // |               |   CoalesceBatchesExec: target_batch_size=4096                 |
       // |               |     FilterExec: time@0 < 1626838170000000000                  |
       // |               |       RepartitionExec: partitioning=RoundRobinBatch(16)       |
       // |               |         MemoryExec: partitions=1, partition_sizes=[1]         |
       // +---------------+---------------------------------------------------------------+
   
       run_query(table.clone(), "explain select * from t where time < to_timestamp('2021-07-20 23:29:30')").await;
   }
   
   #[allow(dead_code)]
   async fn run_query(csvdata: Arc<dyn TableProvider>, query: &str)  {
   
   
       let mut ctx = ExecutionContext::new();
       ctx.register_table("t", csvdata).unwrap();
   
       let results = ctx.sql(query)
           .unwrap()
           .collect()
           .await
           .unwrap();
   
       let pretty = pretty_format_batches(&results).unwrap();
       println!("{}\n{}", query, pretty);
   }
   ```
   
   **Expected behavior**
   The query should produce a single row with timestamp `2021-07-20 23:28:50`
   
   However the actual query returns both rows
   
   **Additional context**
   Add any other context about the problem here.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on issue #765: Predicates on to_timestamp do not work as expected with "naive" timestamp strings

Posted by GitBox <gi...@apache.org>.

alamb commented on issue #765:
URL: https://github.com/apache/arrow-datafusion/issues/765#issuecomment-884429961


   https://github.com/apache/arrow-datafusion/issues/686#issuecomment-884429679 is the proposal for handling timezones more properly


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on issue #765: Predicates on to_timestamp do not work as expected with "naive" timestamp strings

Posted by GitBox <gi...@apache.org>.

alamb commented on issue #765:
URL: https://github.com/apache/arrow-datafusion/issues/765#issuecomment-884415600


   The core of the problem is that `TimestampNanosecondArray` is defined to have type `Timestamp(Nanoseconds, None)` -- the second argument of `None` means the following according to the arrow spec [reference to schema.fbs](https://github.com/apache/arrow/blob/master/format/Schema.fbs#L251-L270): 
   
   ```
     /// * If the time zone is null or an empty string, the data is a local date-time
     ///   and does not represent a single moment in time.  Instead it represents a wall clock
     ///   time and care should be taken to avoid interpreting it semantically as an instant.
   ```
   So that certainly suggests we should not be applying any normalization to timestamps if there is no specific timezone set; Instead, we should return the raw "naive" timestamp (which corresponds to the arrow semantics for `Timestamp(_, None)` I think)
   
   Now this leaves open the question of "what do we do if the timestamp has an explicit timezone in it"?  For example, `2021-07-20 23:28:50-05:00`
   
   If the desired output timezone is `UTC` then it makes sense to convert this to UTC 👍 ; However if the desired output timezone is "None" then what?
   
   I feel this is very similar to the question that @velvia  was getting at in https://github.com/apache/arrow-datafusion/issues/686
   
   I will continue the conversation there. 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on issue #765: Predicates on to_timestamp do not work as expected

Posted by GitBox <gi...@apache.org>.

alamb commented on issue #765:
URL: https://github.com/apache/arrow-datafusion/issues/765#issuecomment-884385463


   FYI @mcassels @lvheyang and @velvia 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org