You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Niclas Roos (Jira)" <ji...@apache.org> on 2020/10/19 07:44:00 UTC

[jira] [Created] (ARROW-10343) Unable to parse strings into timestamps

Niclas Roos created ARROW-10343:
-----------------------------------

             Summary: Unable to parse strings into timestamps
                 Key: ARROW-10343
                 URL: https://issues.apache.org/jira/browse/ARROW-10343
             Project: Apache Arrow
          Issue Type: Bug
          Components: Python
    Affects Versions: 1.0.1
         Environment: macOS 10.15.7, Python 3.8.2
            Reporter: Niclas Roos


Hi,

I'm working with parquet files generated by a AWS RDS Postgres snapshot export. 

I'm trying to parse a date column stored as a string into a timestamp, but it fails.

I've managed to parse the same date format (as in the first example below) when reading from a csv, so I tried to investigate it as far as I could on my own, and here's my results:
{code:java}
// code placeholder 
import pyarrow as pa
import pytz

#################################################################################
## the format I get from the database
us_tz_arr = pa.array([
  "2014-12-07 07:48:59.285332+00",
  "2014-12-07 08:01:49.758975+00",
  "2014-12-07 10:11:35.884304+00"])

us_tz_arr.cast(pa.timestamp('us', tz=pytz.UTC))
-> ArrowInvalid: Failed to parse string: 2014-12-07 10:11:35.884304+00

#################################################################################
## tried removing the timezone
us_arr = pa.array([
  "2014-12-07 07:48:59.285332",
  "2014-12-07 08:01:49.758975",
  "2014-12-07 10:11:35.884304"])

us_arr.cast(pa.timestamp('us'))
-> ArrowInvalid: Failed to parse string: 2014-12-07 10:11:35.884304

#################################################################################
## tried removing the microseconds but keeping the timezone
second_tz_arr = pa.array([
  "2014-12-07 07:48:59+00",
  "2014-12-07 08:01:49+00",
  "2014-12-07 10:11:35+00"])

second_tz_arr.cast(pa.timestamp('s', tz=pytz.UTC))
-> ArrowInvalid: Failed to parse string: 2014-12-07 10:11:35+00

#################################################################################
## removing microseconds and timezone, makes it work!
s_arr = pa.array([
  "2014-12-07 07:48:59",
  "2014-12-07 08:01:49",
  "2014-12-07 10:11:35"])

s_arr.cast(pa.timestamp('s'))
-> <pyarrow.lib.TimestampArray object at 0x7fbdf81ae460>
[
  2014-12-07 07:48:59,
  2014-12-07 08:01:49,
  2014-12-07 10:11:35
]{code}
 PS. This is my first bug report, so apologies if important things are missing.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)