You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Hyukjin Kwon (Jira)" <ji...@apache.org> on 2021/10/06 06:47:00 UTC
[jira] [Updated] (SPARK-36934) Timestamp are written as array
bytes.
[ https://issues.apache.org/jira/browse/SPARK-36934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Hyukjin Kwon updated SPARK-36934:
---------------------------------
Description:
This is tested with master build 04.10.21
{code}
df = ps.DataFrame({'year': ['2015-2-4', '2016-3-5'],
'month': [2, 3],
'day': [4, 5],
'test': [1, 2]})
df["year"] = ps.to_datetime(df["year"])
df.info()
<class 'pyspark.pandas.frame.DataFrame'> Int64Index: 2 entries, 0 to 1 Data columns (total 4 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 year 2 non-null datetime64 1 month 2 non-null int64 2 day 2 non-null int64 3 test 2 non-null int64 dtypes: datetime64(1), int64(3)
spark_df_date = df.to_spark()
spark_df_date.printSchema()
root
|-- year: timestamp (nullable = true)
|-- month: long (nullable = false)
|-- day: long (nullable = false)
|-- test: long (nullable = false)
spark_df_date.write.parquet("s3a://falk0509/spark_df_date.parquet")
{code}
Load the files in to Apache drill I use docker apache/drill:master-openjdk-14
SELECT * FROM cp.`/data/spark_df_date.*`
It print's
year
{code}
\x00\x00\x00\x00\x00\x00\x00\x00\xE2}%\x00
\x00\x00\x00\x00\x00\x00\x00\x00m\x7F%\x00
{code}
The rest of the columns are ok.
So is this a spark problem or Apache drill?
was:
This is tested with master build 04.10.21
df = ps.DataFrame({'year': ['2015-2-4', '2016-3-5'],
'month': [2, 3],
'day': [4, 5],
'test': [1, 2]})
df["year"] = ps.to_datetime(df["year"])
df.info()
<class 'pyspark.pandas.frame.DataFrame'> Int64Index: 2 entries, 0 to 1 Data columns (total 4 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 year 2 non-null datetime64 1 month 2 non-null int64 2 day 2 non-null int64 3 test 2 non-null int64 dtypes: datetime64(1), int64(3)
spark_df_date = df.to_spark()
spark_df_date.printSchema()
root |-- year: timestamp (nullable = true) |-- month: long (nullable = false) |-- day: long (nullable = false) |-- test: long (nullable = false)
spark_df_date.write.parquet("s3a://falk0509/spark_df_date.parquet")
Load the files in to Apache drill I use docker apache/drill:master-openjdk-14
SELECT * FROM cp.`/data/spark_df_date.*`
It print's
year
\x00\x00\x00\x00\x00\x00\x00\x00\xE2}%\x00
\x00\x00\x00\x00\x00\x00\x00\x00m\x7F%\x00
The rest of the columns are ok.
So is this a spark problem or Apache drill?
> Timestamp are written as array bytes.
> -------------------------------------
>
> Key: SPARK-36934
> URL: https://issues.apache.org/jira/browse/SPARK-36934
> Project: Spark
> Issue Type: Bug
> Components: PySpark
> Affects Versions: 3.3.0
> Reporter: Bjørn Jørgensen
> Priority: Major
>
> This is tested with master build 04.10.21
> {code}
> df = ps.DataFrame({'year': ['2015-2-4', '2016-3-5'],
> 'month': [2, 3],
> 'day': [4, 5],
> 'test': [1, 2]})
> df["year"] = ps.to_datetime(df["year"])
> df.info()
> <class 'pyspark.pandas.frame.DataFrame'> Int64Index: 2 entries, 0 to 1 Data columns (total 4 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 year 2 non-null datetime64 1 month 2 non-null int64 2 day 2 non-null int64 3 test 2 non-null int64 dtypes: datetime64(1), int64(3)
> spark_df_date = df.to_spark()
> spark_df_date.printSchema()
> root
> |-- year: timestamp (nullable = true)
> |-- month: long (nullable = false)
> |-- day: long (nullable = false)
> |-- test: long (nullable = false)
> spark_df_date.write.parquet("s3a://falk0509/spark_df_date.parquet")
> {code}
> Load the files in to Apache drill I use docker apache/drill:master-openjdk-14
> SELECT * FROM cp.`/data/spark_df_date.*`
> It print's
> year
> {code}
> \x00\x00\x00\x00\x00\x00\x00\x00\xE2}%\x00
> \x00\x00\x00\x00\x00\x00\x00\x00m\x7F%\x00
> {code}
>
> The rest of the columns are ok.
> So is this a spark problem or Apache drill?
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org