You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Michael Nazario (JIRA)" <ji...@apache.org> on 2015/05/14 00:00:00 UTC
[jira] [Comment Edited] (SPARK-6289) PySpark doesn't maintain SQL date Types

    [ https://issues.apache.org/jira/browse/SPARK-6289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14542807#comment-14542807 ] 

Michael Nazario edited comment on SPARK-6289 at 5/13/15 9:59 PM:
-----------------------------------------------------------------

This is the problem I have in my tests in Spark 1.3.1.

I've reproduced this problem with a much simpler piece of code in the pyspark shell:

{code}
>>> import pandas, datetime
>>> df = pandas.DataFrame([[datetime.datetime(1990, 1, 1), datetime.date(2000, 3, 3)]], columns=["foo", "bar"])
>>> sdf = sqlCtx.createDataFrame(df)
>>> sdf
DataFrame[foo: bigint, bar: date]
>>> row = sdf.first()
>>> row
Row(foo=631152000000000000, bar=datetime.date(2000, 3, 3))
>>> row[1]
datetime.datetime(2000, 3, 3, 0, 0)
>>> row.bar
datetime.date(2000, 3, 3)
{code}


was (Author: mnazario):
I still have the same problem in my tests. This is what I have to reproduce it.

I start up a spark context and get a Spark DataFrame from an avro file The dataframe has a bunch of simple types. These are the results I get:
{code}
>>> print(df)
DataFrame[a: string, Boolean: boolean, BigDecimal: decimal(10,0), DateTime: timestamp, LocalDate: date, Double: double, Float: float, Integer: int, Long: bigint, String: string]
>>> print(row)
Row(a=u'0', Boolean=True, BigDecimal=Decimal('1.0'), DateTime=datetime.datetime(2000, 1, 1, 3, 31), LocalDate=datetime.date(2000, 2, 2), Double=0.1, Float=0.20000000298023224, Integer=1, Long=2, String=u'foo')
>>> print(row.LocalDate)
2000-02-02
>>> print(type(row.LocalDate))
<type 'datetime.date'>
>>> print(row[4])
2000-02-02 00:00:00
>>> print(type(row[4]))
<type 'datetime.datetime'>
{code}

I've reproduced this problem with a much simpler piece of code in the pyspark shell:

{code}
>>> import pandas, datetime
>>> df = pandas.DataFrame([[datetime.datetime(1990, 1, 1), datetime.date(2000, 3, 3)]], columns=["foo", "bar"])
>>> sdf = sqlCtx.createDataFrame(df)
>>> sdf
DataFrame[foo: bigint, bar: date]
>>> row = sdf.first()
>>> row
Row(foo=631152000000000000, bar=datetime.date(2000, 3, 3))
>>> row[1]
datetime.datetime(2000, 3, 3, 0, 0)
>>> row.bar
datetime.date(2000, 3, 3)
{code}

> PySpark doesn't maintain SQL date Types
> ---------------------------------------
>
>                 Key: SPARK-6289
>                 URL: https://issues.apache.org/jira/browse/SPARK-6289
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark, SQL
>    Affects Versions: 1.2.1
>            Reporter: Michael Nazario
>            Assignee: Davies Liu
>
> For the TimestampType, Spark SQL requires a datetime.date in Python. However, if you collect a row based on that type, you'll end up with a returned value which is type datetime.datetime.
> I have tried to reproduce this using the pyspark shell, but have been unable to. This is definitely a problem coming from pyrolite though:
> https://github.com/irmen/Pyrolite/
> Pyrolite is being used for datetime and date serialization, but appears to not map to date objects, but maps to datetime objects.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org