You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Rohan Barman (Jira)" <ji...@apache.org> on 2022/10/18 17:23:00 UTC

[jira] [Created] (SPARK-40835) to_utc_timestamp creates null column

Rohan Barman created SPARK-40835:
------------------------------------

             Summary: to_utc_timestamp creates null column
                 Key: SPARK-40835
                 URL: https://issues.apache.org/jira/browse/SPARK-40835
             Project: Spark
          Issue Type: Bug
          Components: PySpark
    Affects Versions: 3.2.0
            Reporter: Rohan Barman


We are in the process of migrating our PySpark applications from Spark version 3.1.2 to Spark version 3.2.0. 

This bug is present in version 3.2.0. We do not see this issue in version 3.1.2.

 

*Minimal example to reproduce bug*

Below is a minimal example. 
from pyspark.sql.types import StringType
from pyspark.sql.functions import *

# Source data
columns = ["id","timestamp_field"]
data = [("1", "2022-10-16T00:00:00.000+0000"), ("2", "2022-10-16T00:00:00.000+0000")]
source_df = spark.createDataFrame(data).toDF(*columns)
source_df.createOrReplaceTempView("source")
print("Source:")
print(source_df.show())

# Execute query
query = """
SELECT
    id, 
    timestamp_field as original,  
    to_utc_timestamp(timestamp_field, 'UTC')     AS received_timestamp
FROM source
"""
df = spark.sql(query)
print("Transformed:")
print(df.show())
print(df.count())
 

*Post Execution*

The source data has a column called _timestamp_field_ which is a string type.

 
{code:java}
Source:
+---+--------------------+
| id|     timestamp_field|
+---+--------------------+
|  1|2022-10-16T00:00:...|
|  2|2022-10-16T00:00:...|
+---+--------------------+ {code}
The query applies to_utc_timestamp() on timestamp_field. The result column is null.

 

 
{code:java}
+---+--------------------+------------------+
| id|            original|received_timestamp|
+---+--------------------+------------------+
|  1|2022-10-16T00:00:...|              null|
|  2|2022-10-16T00:00:...|              null|
+---+--------------------+------------------+ {code}
--

 

*Questions*
 * Did the to_utc_timestamp function get any new changes in spark version 3.2.0? We don't see this issue in spark 3.1.2
 * Can we apply any spark settings to resolve this?
 * Is there a new preferred function in spark 3.2.0 that replaces to_utc_timestamp?

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org