You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Rohan Barman (Jira)" <ji...@apache.org> on 2022/10/18 19:57:00 UTC
[jira] [Comment Edited] (SPARK-40835) to_utc_timestamp creates null column

    [ https://issues.apache.org/jira/browse/SPARK-40835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17619761#comment-17619761 ] 

Rohan Barman edited comment on SPARK-40835 at 10/18/22 7:56 PM:
----------------------------------------------------------------

I found a similar issue https://issues.apache.org/jira/browse/SPARK-37067 that was resolved in the Spark 3.2.1 release notes.

Does this mean the only way to fix this issue is to move up to Spark 3.2.1? There is no solution in Spark 3.2.0?


was (Author: JIRAUSER296654):
I found a similar issue https://issues.apache.org/jira/browse/SPARK-37067 that was resolved the Spark 3.2.1 release notes.

Does this mean the only way to fix this issue is to move up to Spark 3.2.1? There is no solution in Spark 3.2.0?

> to_utc_timestamp creates null column
> ------------------------------------
>
>                 Key: SPARK-40835
>                 URL: https://issues.apache.org/jira/browse/SPARK-40835
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 3.2.0
>            Reporter: Rohan Barman
>            Priority: Major
>
> We are in the process of migrating our PySpark applications from Spark version 3.1.2 to Spark version 3.2.0. 
> This bug is present in version 3.2.0. We do not see this issue in version 3.1.2.
>  
> *Minimal example to reproduce bug*
> Below is a minimal example of applying to_utc_timestamp() on String column that has data representing a timestamp
> {code:java}
> from pyspark.sql.types import StringType
> from pyspark.sql.functions import *
> # Source data
> columns = ["id","timestamp_field"]
> data = [("1", "2022-10-17T00:00:00+0000"), ("2", "2022-10-17T00:00:00+0000")]
> source_df = spark.createDataFrame(data).toDF(*columns)
> source_df.createOrReplaceTempView("source")
> print("Source:")
> print(source_df.show())
> # Execute query
> query = """
> SELECT
>     id, 
>     timestamp_field as original,  
>     to_utc_timestamp(timestamp_field, 'UTC')     AS received_timestamp
> FROM source
> """
> df = spark.sql(query)
> print("Transformed:")
> print(df.show())
> print(df.count()) {code}
> *Post Execution*
> The source data has a column called _timestamp_field_ which is a string type.
> {code:java}
> Source:
> +---+--------------------+                                                      
> | id|     timestamp_field|
> +---+--------------------+
> |  1|2022-10-17T00:00:...|
> |  2|2022-10-17T00:00:...|
> +---+--------------------+
> {code}
> The query applies to_utc_timestamp() on timestamp_field to create a new column. The new column is null.
> {code:java}
> Transformed:
> +---+--------------------+------------------+
> | id|            original|received_timestamp|
> +---+--------------------+------------------+
> |  1|2022-10-16T00:00:...|              null|
> |  2|2022-10-16T00:00:...|              null|
> +---+--------------------+------------------+ {code}
> –
>  
> *Questions*
>  * Did the to_utc_timestamp function get any new changes in spark version 3.2.0? We don't see this issue in spark 3.1.2
>  * Can we apply any spark settings to resolve this?
>  * Is there a new preferred function in spark 3.2.0 that replaces to_utc_timestamp?
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org