You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Michael Nazario (JIRA)" <ji...@apache.org> on 2015/05/14 00:00:00 UTC
[jira] [Comment Edited] (SPARK-6289) PySpark doesn't maintain SQL
date Types
[ https://issues.apache.org/jira/browse/SPARK-6289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14542807#comment-14542807 ]
Michael Nazario edited comment on SPARK-6289 at 5/13/15 9:59 PM:
-----------------------------------------------------------------
This is the problem I have in my tests in Spark 1.3.1.
I've reproduced this problem with a much simpler piece of code in the pyspark shell:
{code}
>>> import pandas, datetime
>>> df = pandas.DataFrame([[datetime.datetime(1990, 1, 1), datetime.date(2000, 3, 3)]], columns=["foo", "bar"])
>>> sdf = sqlCtx.createDataFrame(df)
>>> sdf
DataFrame[foo: bigint, bar: date]
>>> row = sdf.first()
>>> row
Row(foo=631152000000000000, bar=datetime.date(2000, 3, 3))
>>> row[1]
datetime.datetime(2000, 3, 3, 0, 0)
>>> row.bar
datetime.date(2000, 3, 3)
{code}
was (Author: mnazario):
I still have the same problem in my tests. This is what I have to reproduce it.
I start up a spark context and get a Spark DataFrame from an avro file The dataframe has a bunch of simple types. These are the results I get:
{code}
>>> print(df)
DataFrame[a: string, Boolean: boolean, BigDecimal: decimal(10,0), DateTime: timestamp, LocalDate: date, Double: double, Float: float, Integer: int, Long: bigint, String: string]
>>> print(row)
Row(a=u'0', Boolean=True, BigDecimal=Decimal('1.0'), DateTime=datetime.datetime(2000, 1, 1, 3, 31), LocalDate=datetime.date(2000, 2, 2), Double=0.1, Float=0.20000000298023224, Integer=1, Long=2, String=u'foo')
>>> print(row.LocalDate)
2000-02-02
>>> print(type(row.LocalDate))
<type 'datetime.date'>
>>> print(row[4])
2000-02-02 00:00:00
>>> print(type(row[4]))
<type 'datetime.datetime'>
{code}
I've reproduced this problem with a much simpler piece of code in the pyspark shell:
{code}
>>> import pandas, datetime
>>> df = pandas.DataFrame([[datetime.datetime(1990, 1, 1), datetime.date(2000, 3, 3)]], columns=["foo", "bar"])
>>> sdf = sqlCtx.createDataFrame(df)
>>> sdf
DataFrame[foo: bigint, bar: date]
>>> row = sdf.first()
>>> row
Row(foo=631152000000000000, bar=datetime.date(2000, 3, 3))
>>> row[1]
datetime.datetime(2000, 3, 3, 0, 0)
>>> row.bar
datetime.date(2000, 3, 3)
{code}
> PySpark doesn't maintain SQL date Types
> ---------------------------------------
>
> Key: SPARK-6289
> URL: https://issues.apache.org/jira/browse/SPARK-6289
> Project: Spark
> Issue Type: Bug
> Components: PySpark, SQL
> Affects Versions: 1.2.1
> Reporter: Michael Nazario
> Assignee: Davies Liu
>
> For the TimestampType, Spark SQL requires a datetime.date in Python. However, if you collect a row based on that type, you'll end up with a returned value which is type datetime.datetime.
> I have tried to reproduce this using the pyspark shell, but have been unable to. This is definitely a problem coming from pyrolite though:
> https://github.com/irmen/Pyrolite/
> Pyrolite is being used for datetime and date serialization, but appears to not map to date objects, but maps to datetime objects.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org