You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Shannon C Lewis (Jira)" <ji...@apache.org> on 2019/09/13 18:50:00 UTC

[jira] [Commented] (ARROW-3543) [R] Time zone adjustment issue when reading Feather file written by Python

    [ https://issues.apache.org/jira/browse/ARROW-3543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16929452#comment-16929452 ] 

Shannon C Lewis commented on ARROW-3543:
----------------------------------------

I ran a simple test trying to read delimited file in R, and i'm seeing issue with timestamps changing:

system: Red Hat Enterprise Linux Server release 7.6 (Maipo)
$ rpm -qa |grep arrow
arrow-glib-libs-0.14.1-1.el7.x86_64
arrow-glib-devel-0.14.1-1.el7.x86_64
arrow-devel-0.14.1-1.el7.x86_64
arrow-libs-0.14.1-1.el7.x86_64
$ rpm -qa |grep parq
parquet-glib-libs-0.14.1-1.el7.x86_64
parquet-devel-0.14.1-1.el7.x86_64
parquet-libs-0.14.1-1.el7.x86_64
parquet-glib-devel-0.14.1-1.el7.x86_64



> packageVersion("arrow")> packageVersion("arrow")[1] ‘0.14.1.1’
|> library(arrow)> Sys.timezone()[1] "America/Los_Angeles"> readLines("test.log") [1] "DateTime\|LogLevel\|Type\|FunctionName\|Message\|UserName"                                                             
 [2] "2019-09-11 21:36:22\|[INFO]\|CentralLogger\| \|CentralLogger Initialized\|shannon.lewis"                               
 [3] "2019-09-11 21:36:22\|[INFO]\|Controller\| \|Controller.R Initialized\|shannon.lewis"                                   
 [4] "2019-09-11 22:43:58\|[INFO]\|studyId1\|archiveStudy\|Start of archiveStudy : Intialized\|shannon.lewis"                
 [5] "2019-09-11 22:43:58\|[INFO]\|studyId1\|archiveStudy\|archiveStudy All done, quitting.\|shannon.lewis"                  
 [6] "2019-09-11 23:11:39\|[INFO]\|central\|consolidateStudyLogs\|Start of consolidateStudyLogs : Initialized\|shannon.lewis"
 [7] "2019-09-12 00:36:22\|[INFO]\|CentralLogger\| \|CentralLogger Initialized\|shannon.lewis"                               
 [8] "2019-09-12 00:36:22\|[INFO]\|Controller\| \|Controller.R Initialized\|shannon.lewis"                                   
 [9] "2019-09-12 00:43:58\|[INFO]\|studyId1\|archiveStudy\|Start of archiveStudy : Intialized\|shannon.lewis"                
[10] "2019-09-12 00:43:58\|[INFO]\|studyId1\|archiveStudy\|archiveStudy All done, quitting.\|shannon.lewis"                  
[11] "2019-09-12 01:11:39\|[INFO]\|central\|consolidateStudyLogs\|Start of consolidateStudyLogs : Initialized\|shannon.lewis"Warning message:In readLines("test.log") : incomplete final line found on 'test.log'> arrow_log <-read_delim_arrow("test.log",delim = "\|")> str(arrow_log)Classes ‘tbl_df’, ‘tbl’ and 'data.frame':	10 obs. of  6 variables:
 $ DateTime    : POSIXct, format: "2019-09-11 14:36:22" "2019-09-11 14:36:22" "2019-09-11 15:43:58" "2019-09-11 15:43:58" ...
 $ LogLevel    : chr  "[INFO]" "[INFO]" "[INFO]" "[INFO]" ...
 $ Type        : chr  "CentralLogger" "Controller" "studyId1" "studyId1" ...
 $ FunctionName: chr  " " " " "archiveStudy" "archiveStudy" ...
 $ Message     : chr  "CentralLogger Initialized" "Controller.R Initialized" "Start of archiveStudy : Intialized" "archiveStudy All done, quitting." ...
 $ UserName    : chr  "shannon.lewis" "shannon.lewis" "shannon.lewis" "shannon.lewis" ...> arrow_log# A tibble: 10 x 6   DateTime            LogLevel Type          FunctionName         Message                                     UserName        <dttm>              <chr>    <chr>         <chr>                <chr>                                       <chr> 1 2019-09-11 14:36:22 [INFO]   CentralLogger " "                  CentralLogger Initialized                   shannon.lewis 2 2019-09-11 14:36:22 [INFO]   Controller    " "                  Controller.R Initialized                    shannon.lewis 3 2019-09-11 15:43:58 [INFO]   studyId1      archiveStudy         Start of archiveStudy : Intialized          shannon.lewis 4 2019-09-11 15:43:58 [INFO]   studyId1      archiveStudy         archiveStudy All done, quitting.            shannon.lewis 5 2019-09-11 16:11:39 [INFO]   central       consolidateStudyLogs Start of consolidateStudyLogs : Initialized shannon.lewis 6 2019-09-11 17:36:22 [INFO]   CentralLogger " "                  CentralLogger Initialized                   shannon.lewis 7 2019-09-11 17:36:22 [INFO]   Controller    " "                  Controller.R Initialized                    shannon.lewis 8 2019-09-11 17:43:58 [INFO]   studyId1      archiveStudy         Start of archiveStudy : Intialized          shannon.lewis 9 2019-09-11 17:43:58 [INFO]   studyId1      archiveStudy         archiveStudy All done, quitting.            shannon.lewis10 2019-09-11 18:11:39 [INFO]   central       consolidateStudyLogs Start of consolidateStudyLogs : Initialized shannon.lewis> base_log <-read.delim("test.log",sep = "\|")> str(base_log)'data.frame':	10 obs. of  6 variables:
 $ DateTime    : Factor w/ 6 levels "2019-09-11 21:36:22",..: 1 1 2 2 3 4 4 5 5 6
 $ LogLevel    : Factor w/ 1 level "[INFO]": 1 1 1 1 1 1 1 1 1 1
 $ Type        : Factor w/ 4 levels "central","CentralLogger",..: 2 3 4 4 1 2 3 4 4 1
 $ FunctionName: Factor w/ 3 levels " ","archiveStudy",..: 1 1 2 2 3 1 1 2 2 3
 $ Message     : Factor w/ 5 levels "archiveStudy All done, quitting.",..: 2 3 4 1 5 2 3 4 1 5
 $ UserName    : Factor w/ 1 level "shannon.lewis": 1 1 1 1 1 1 1 1 1 1> base_log              DateTime LogLevel          Type         FunctionName                                     Message      UserName
1  2019-09-11 21:36:22   [INFO] CentralLogger                                        CentralLogger Initialized shannon.lewis
2  2019-09-11 21:36:22   [INFO]    Controller                                         Controller.R Initialized shannon.lewis
3  2019-09-11 22:43:58   [INFO]      studyId1         archiveStudy          Start of archiveStudy : Intialized shannon.lewis
4  2019-09-11 22:43:58   [INFO]      studyId1         archiveStudy            archiveStudy All done, quitting. shannon.lewis
5  2019-09-11 23:11:39   [INFO]       central consolidateStudyLogs Start of consolidateStudyLogs : Initialized shannon.lewis
6  2019-09-12 00:36:22   [INFO] CentralLogger                                        CentralLogger Initialized shannon.lewis
7  2019-09-12 00:36:22   [INFO]    Controller                                         Controller.R Initialized shannon.lewis
8  2019-09-12 00:43:58   [INFO]      studyId1         archiveStudy          Start of archiveStudy : Intialized shannon.lewis
9  2019-09-12 00:43:58   [INFO]      studyId1         archiveStudy            archiveStudy All done, quitting. shannon.lewis
10 2019-09-12 01:11:39   [INFO]       central consolidateStudyLogs Start of consolidateStudyLogs : Initialized shannon.lewis|
| |
|
| |
|

> [R] Time zone adjustment issue when reading Feather file written by Python
> --------------------------------------------------------------------------
>
>                 Key: ARROW-3543
>                 URL: https://issues.apache.org/jira/browse/ARROW-3543
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: R
>            Reporter: Olaf
>            Priority: Major
>             Fix For: 1.0.0
>
>
> Hello the dream team,
> Pasting from [https://github.com/wesm/feather/issues/351]
> Thanks for this wonderful package. I was playing with feather and some timestamps and I noticed some dangerous behavior. Maybe it is a bug.
> Consider this
>  
> {code:java}
> import pandas as pd
> import feather
> import numpy as np
> df = pd.DataFrame(
> {'string_time_utc' : [pd.to_datetime('2018-02-01 14:00:00.531'), pd.to_datetime('2018-02-01 14:01:00.456'), pd.to_datetime('2018-03-05 14:01:02.200')]}
> )
> df['timestamp_est'] = pd.to_datetime(df.string_time_utc).dt.tz_localize('UTC').dt.tz_convert('US/Eastern').dt.tz_localize(None)
> df
>  Out[17]: 
>  string_time_utc timestamp_est
>  0 2018-02-01 14:00:00.531 2018-02-01 09:00:00.531
>  1 2018-02-01 14:01:00.456 2018-02-01 09:01:00.456
>  2 2018-03-05 14:01:02.200 2018-03-05 09:01:02.200
> {code}
> Here I create the corresponding `EST` timestamp of my original timestamps (in `UTC` time).
> Now saving the dataframe to `csv` or to `feather` will generate two completely different results.
>  
> {code:java}
> df.to_csv('P://testing.csv')
> df.to_feather('P://testing.feather')
> {code}
> Switching to R.
> Using the good old `csv` gives me something a bit annoying, but expected. R thinks my timezone is `UTC` by default, and wrongly attached this timezone to `timestamp_est`. No big deal, I can always use `with_tz` or even better: import as character and process as timestamp while in R.
>  
> {code:java}
> > dataframe <- read_csv('P://testing.csv')
>  Parsed with column specification:
>  cols(
>  X1 = col_integer(),
>  string_time_utc = col_datetime(format = ""),
>  timestamp_est = col_datetime(format = "")
>  )
>  Warning message:
>  Missing column names filled in: 'X1' [1] 
>  > 
>  > dataframe %>% mutate(mytimezone = tz(timestamp_est))
> A tibble: 3 x 4
>  X1 string_time_utc timestamp_est 
>  <int> <dttm> <dttm> 
>  1 0 2018-02-01 14:00:00.530 2018-02-01 09:00:00.530
>  2 1 2018-02-01 14:01:00.456 2018-02-01 09:01:00.456
>  3 2 2018-03-05 14:01:02.200 2018-03-05 09:01:02.200
>  mytimezone
>  <chr> 
>  1 UTC 
>  2 UTC 
>  3 UTC  {code}
> {code:java}
> #Now look at what happens with feather:
>  
>  > dataframe <- read_feather('P://testing.feather')
>  > 
>  > dataframe %>% mutate(mytimezone = tz(timestamp_est))
> A tibble: 3 x 3
>  string_time_utc timestamp_est mytimezone
>  <dttm> <dttm> <chr> 
>  1 2018-02-01 09:00:00.531 2018-02-01 04:00:00.531 "" 
>  2 2018-02-01 09:01:00.456 2018-02-01 04:01:00.456 "" 
>  3 2018-03-05 09:01:02.200 2018-03-05 04:01:02.200 "" {code}
> My timestamps have been converted!!! pure insanity. 
>  Am I missing something here?
> Thanks!!



--
This message was sent by Atlassian Jira
(v8.3.2#803003)