You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Hyukjin Kwon (JIRA)" <ji...@apache.org> on 2016/09/16 05:21:20 UTC
[jira] [Comment Edited] (SPARK-17545) Spark SQL Catalyst doesn't handle ISO 8601 date without colon in offset

    [ https://issues.apache.org/jira/browse/SPARK-17545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15495408#comment-15495408 ] 

Hyukjin Kwon edited comment on SPARK-17545 at 9/16/16 5:20 AM:
---------------------------------------------------------------

Hi [~nbeyer], the basic ISO format currently follows https://www.w3.org/TR/NOTE-datetime

That says

{quote}
1997-07-16T19:20:30.45+01:00
{quote}

is the right ISO format where timezone is

{quote}
TZD  = time zone designator (Z or +hh:mm or -hh:mm)
{quote}

To make sure, I double-checked the ISO 8601 - 2004 full specification in http://www.uai.cl/images/sitio/biblioteca/citas/ISO_8601_2004en.pdf

That says,

{quote}
...
the expression shall either be completely in basic format, in which case the minimum number of
separators necessary for the required expression is used, or completely in extended format, in which case
additional separators shall be used
...
{quote}

where the basic format is {{20160707T211822+0300}} whereas the extended format is {{2016-07-07T21:18:22+03:00}}.

In addition, basic format seems even discouraged in text format

{quote}
NOTE : The basic format should be avoided in plain text.
{quote}

Therefore, {{2016-07-07T21:18:22+03:00}} Is the right ISO 8601:2004.
whereas {{2016-07-07T21:18:22+0300}} Is not because the zone designator may not be in the basic format when the date and time of day is in the extended format.





was (Author: hyukjin.kwon):
Hi [~nbeyer], the basic ISO format currently follows https://www.w3.org/TR/NOTE-datetime

That says

{quote}
1997-07-16T19:20:30.45+01:00
{quote}

is the right ISO format where timezone is

{quote}
TZD  = time zone designator (Z or +hh:mm or -hh:mm)
{quote}

To make sure, I double-checked the ISO 8601 - 2004 full specification in http://www.uai.cl/images/sitio/biblioteca/citas/ISO_8601_2004en.pdf

That says,

{quote}
...
the expression shall either be completely in basic format, in which case the minimum number of
separators necessary for the required expression is used, or completely in extended format, in which case
additional separators shall be used
...
{quote}

where the basic format is {{20160707T211822+0300 }} whereas the extended format is {{2016-07-07T21:18:22+03:00}}.

In addition, basic format seems even discouraged in text format

{quote}
NOTE : The basic format should be avoided in plain text.
{quote}

Therefore, {{2016-07-07T21:18:22+03:00}} Is the right ISO 8601:2004.
whereas {{2016-07-07T21:18:22+0300}} Is not because the zone designator may not be in the basic format when the date and time of day is in the extended format.




> Spark SQL Catalyst doesn't handle ISO 8601 date without colon in offset
> -----------------------------------------------------------------------
>
>                 Key: SPARK-17545
>                 URL: https://issues.apache.org/jira/browse/SPARK-17545
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.0.0
>            Reporter: Nathan Beyer
>
> When parsing a CSV with a date/time column that contains a variant ISO 8601 that doesn't include a colon in the offset, casting to Timestamp fails.
> Here's a simple, example CSV content.
> {quote}
> time
> "2015-07-20T15:09:23.736-0500"
> "2015-07-20T15:10:51.687-0500"
> "2015-11-21T23:15:01.499-0600"
> {quote}
> Here's the stack trace that results from processing this data.
> {quote}
> 16/09/14 15:22:59 ERROR Utils: Aborting task
> java.lang.IllegalArgumentException: 2015-11-21T23:15:01.499-0600
> 	at org.apache.xerces.jaxp.datatype.XMLGregorianCalendarImpl$Parser.skip(Unknown Source)
> 	at org.apache.xerces.jaxp.datatype.XMLGregorianCalendarImpl$Parser.parse(Unknown Source)
> 	at org.apache.xerces.jaxp.datatype.XMLGregorianCalendarImpl.<init>(Unknown Source)
> 	at org.apache.xerces.jaxp.datatype.DatatypeFactoryImpl.newXMLGregorianCalendar(Unknown Source)
> 	at javax.xml.bind.DatatypeConverterImpl._parseDateTime(DatatypeConverterImpl.java:422)
> 	at javax.xml.bind.DatatypeConverterImpl.parseDateTime(DatatypeConverterImpl.java:417)
> 	at javax.xml.bind.DatatypeConverter.parseDateTime(DatatypeConverter.java:327)
> 	at org.apache.spark.sql.catalyst.util.DateTimeUtils$.stringToTime(DateTimeUtils.scala:140)
> 	at org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$.castTo(CSVInferSchema.scala:287)
> {quote}
> Somewhat related, I believe Python standard libraries can produce this form of zone offset. The system I got the data from is written in Python.
> https://docs.python.org/2/library/datetime.html#strftime-strptime-behavior



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org