You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/08/19 07:27:04 UTC

[GitHub] [arrow] jorisvandenbossche commented on pull request #10933: ARROW-13549: [C++] Add date/time extraction functions

jorisvandenbossche commented on pull request #10933:
URL: https://github.com/apache/arrow/pull/10933#issuecomment-901676354


   Personally, I think having a separate (non-cast) kernel to extract those components make sense from a user perspective (but can of course share implementation), and complementing the other timestamp component extraction kernels we already have. 
   As @lidavidm also mentions, this is in general "unsafe" cast, but when you explicitly want to extract a time/date component, it is obvious you want this and having to allow an unsafe cast feels unnecessarily.
   
   > There's already a cast from timestamp to date32/date64, however, placing this implementation there would change semantics a little bit:
   
   I would say that the current casting semantics are wrong and should be fixed? (in any case, the "extracted" date is clearly wrong, whether that's seen as a consequence of the "unsafe" cast or not is to be discussed I suppose)
   
   There are actually two "unsafe" steps in this conversion it seems (which explains the wrong part):
   
   ```python
   arr = pa.array(["1970-01-01 00:00:59.123456789","2000-02-29 23:23:23.999999999","1899-01-01 00:59:20.001001001"]).cast(pa.timestamp("ns"))
   
   >>> arr.cast(pa.date64())
   ...
   ArrowInvalid: Casting from timestamp[ns] to date64[ms] would lose data: 59123456789
   ../src/arrow/compute/kernels/scalar_cast_temporal.cc:178  (ShiftTime<int64_t, int64_t>(ctx, conversion.first, conversion.second, input, output))
   
   # that error is actually coming from a conversion to milliseconds
   # (and you don't really care about the part being lost for conversion to date ..)
   >>> arr.cast(pa.timestamp("ms"))
   ...
   ArrowInvalid: Casting from timestamp[ns] to timestamp[ms] would lose data: 59123456789
   
   # when ignoring this lost part in conversion to ms, then casting to date gives another error:
   >>> arr.cast(pa.timestamp("ms"), safe=False).cast(pa.date64())
   ...
   ArrowInvalid: Timestamp value had non-zero intraday milliseconds
   ```
   
   And I _suppose_ that when those intraday milliseconds are ignored by doing `safe=False`, we do a simple round to get rid of those:
   
   https://github.com/apache/arrow/blob/d3af6d479bd3f6966b793cc9a1cf5bafe0cd032e/cpp/src/arrow/compute/kernels/scalar_cast_temporal.cc#L198-L202
   
   By subtracting the remainder we basically round towards zero (it seems C++ module operator behaves differently as Python when involving negative integers `-12 % 10 = -2` in C++ and `-12 % 10 = 8` in Python), which means that for negative values we are rounding up, and we should round down instead? (that could be seen as a bug fix?)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org