You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Gengliang Wang (Jira)" <ji...@apache.org> on 2022/05/25 02:51:00 UTC

[jira] [Updated] (SPARK-39193) Fasten Timestamp type inference of default format in JSON/CSV data source

     [ https://issues.apache.org/jira/browse/SPARK-39193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Gengliang Wang updated SPARK-39193:
-----------------------------------
    Summary: Fasten Timestamp type inference of default format in JSON/CSV data source  (was: Improve the performance of inferring Timestamp type in JSON/CSV data source)

> Fasten Timestamp type inference of default format in JSON/CSV data source
> -------------------------------------------------------------------------
>
>                 Key: SPARK-39193
>                 URL: https://issues.apache.org/jira/browse/SPARK-39193
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 3.3.0
>            Reporter: Gengliang Wang
>            Assignee: Gengliang Wang
>            Priority: Major
>             Fix For: 3.3.0
>
>
> When reading JSON/CSV files with inferring timestamp types `.option("inferTimestamp", true)`, the Timestamp conversion will throw and catch exceptions. As we are putting decent error messages in the exception, the creation of the exceptions is actually not cheap. It consumes more than 90% of the type inference time. 
> We can use the parsing methods which return optional results instead.
> Before the change, it takes 166 seconds to infer a JSON file of 624MB with inferring timestamp enabled.
> After the change, it only 16 seconds.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org