You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/05/16 08:29:13 UTC

[GitHub] [spark] gengliangwang opened a new pull request, #36562: [SPARK-39193][SQL] Fasten Timestamp type inference of JSON/CSV data sources

gengliangwang opened a new pull request, #36562:
URL: https://github.com/apache/spark/pull/36562

### What changes were proposed in this pull request?

When reading JSON/CSV files with inferring timestamp types (`.option("inferTimestamp", true)`), the Timestamp conversion will throw and catch exceptions.
As we are putting decent error messages in the exception:
```
def cannotCastToDateTimeError(
value: Any, from: DataType, to: DataType, errorContext: String): Throwable = {
val valueString = toSQLValue(value, from)
new SparkDateTimeException("INVALID_SYNTAX_FOR_CAST",
Array(toSQLType(to), valueString, SQLConf.ANSI_ENABLED.key, errorContext))
}
```
the creation of the exceptions is actually not cheap. It consumes more than 90% of the type inference time.

We can use the parsing methods which return optional results to avoid creating the exceptions. With this PR, the schema inference time is reduced by 90% in a local benchmark.

### Why are the changes needed?

Performance improvement

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Existing UT
Also manual test the runtime to inferring a JSON file of 624MB with inferring timestamp enabled:
```
spark.read.option("inferTimestamp", true).json(file)
```

- Before the change, it takes 166 seconds
- After the change, it only 16 seconds.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org