You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Yi Zhang (Jira)" <ji...@apache.org> on 2022/12/21 20:11:00 UTC

[jira] [Updated] (SPARK-41650) json expressions much slower in optimized mode

     [ https://issues.apache.org/jira/browse/SPARK-41650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Yi Zhang updated SPARK-41650:
-----------------------------
    Affects Version/s: 3.3.1

> json expressions much slower in optimized mode
> ----------------------------------------------
>
>                 Key: SPARK-41650
>                 URL: https://issues.apache.org/jira/browse/SPARK-41650
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core, Structured Streaming
>    Affects Versions: 3.1.3, 3.2.2, 3.3.1
>            Reporter: Yi Zhang
>            Priority: Major
>
> I noticed spark structured streaming reading from Kafka json string into struct type is much slower in spark-3.1+ than spark-3.0. Profiling reveals the json expressions in spark-3.0 mostly on evaluate subExpr, while spark-3.1/3.2 spent a lot time on writeField. 
> Suspect this may be related to SPARK-32948, so I tried with add a bogus option 
> from_json($"value", mySchema, Map("bogus_key"-> "bogus_value")
> this turns off the optimization and the performance is much better. For reference, 
> for same amount #records, it is 30 seconds vs. 3 minute on a task processing 500k records. This is big difference for a streaming job.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org