You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Kevin Crouse (Jira)" <ji...@apache.org> on 2022/03/25 05:45:00 UTC

[jira] [Comment Edited] (ARROW-16022) floor_temporal / ceil_temporal throws exception for existing timestamps if ambiguous/existing

    [ https://issues.apache.org/jira/browse/ARROW-16022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17512204#comment-17512204 ] 

Kevin Crouse edited comment on ARROW-16022 at 3/25/22, 5:44 AM:
----------------------------------------------------------------

[~rokm] , 

I understand the overhead issue and am fine with it not throwing an Exception on creation. My point is more that _if it must throw an exception,_ it should be on creation and not when calling a function to round to the nearest second/millisecond/nanosecond. The variable already exists with a value, and whether technically valid or not, floor/ceil/round to the next second doesn't change its valid-ness.

I think the PR you referenced would make it so that one has an option to use the *_temporal methods without it throwing an exception, which is not possible right now. That's better, but it does change the underlying data for something that conceptually should not - consider the ambiguous time of 1:30am on US Eastern Time's daylight savings day. I can create a pyarrow array and scalar with this timestamp. At the time I call floor_temporal, having only an option (via the referenced PR) for that to jump to 3:00:00am or drop to 12:59am isn't very good and will mess up comparisons to timestamps that I don't have to call *_timestamp on. Whatever the behavior, I would propose an invariant that a call to *_temporal with a unit of 'seconds' should render values within 1 second of the original value.

 


was (Author: JIRAUSER286896):
[~rokm] , 

I understand the overhead issue and am fine with it not throwing an Exception on creation. My point is more that _if it must throw an exception,_ it should be on creation and not when calling a function to round to the nearest second/millisecond/nanosecond. The variable already exists with a value, and whether technically valid or not, floor/ceil/round to the next second doesn't change it valid-ness.

I think the PR you referenced would make it so that one has an option to use the *_temporal methods without it throwing an exception, which is not possible right now. That's better, but it does change the underlying data for something that conceptually should not - consider the ambiguous time of 1:30am on US Eastern Time's daylight savings day. I can create a pyarrow array and scalar with this timestamp. At the time I call floor_temporal, having only an option (via the referenced PR) for that to jump to 3:00:00am or drop to 12:59am isn't very good and will mess up comparisons to timestamps that I don't have to call *_timestamp on. Whatever the behavior, I would propose an invariant that a call to *_temporal with a unit of 'seconds' should render values within 1 second of the original value.

 

> floor_temporal / ceil_temporal throws exception for existing timestamps if ambiguous/existing
> ---------------------------------------------------------------------------------------------
>
>                 Key: ARROW-16022
>                 URL: https://issues.apache.org/jira/browse/ARROW-16022
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 7.0.0
>            Reporter: Kevin Crouse
>            Priority: Major
>
> Running pyarrow.compute.floor_temporal for timestamps that exist will throw exceptions if the times are ambiguous during the daylight savings time transitions. 
> As the *_temporal functions do not fundamentally change the times, it does not make sense that they would fail due to a timezone issue. If they must fail, it should be when the pyarrow.Timestamp is created.
>  
>  
> {code:java}
> import pyarrow
> import pyarrow.compute as pc
> import datetime
> import pytz
> t = pyarrow.timestamp('s', tz='America/New_York')
> dt = datetime.datetime(2013, 11, 3, 1, 3, 14, tzinfo = pytz.timezone('America/New_York'))
> # if a timestamp must be invalid, this could fail
> za = pyarrow.array([dt], t) 
> # raises an exception, even though this is conceptually an identity function here
> pc.floor_temporal(za, unit = 'second') {code}
>  
> And this actually works just fine (continued from above)
> {code:java}
> pc.cast(    
>     pc.floor_temporal(        
>         pc.cast(za, pyarrow.timestamp('s', 'UTC')),         
>     unit='second'),     
>     pyarrow.timestamp('s','America/New_York')
> )
>  {code}
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)