You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/03/08 18:53:02 UTC

[GitHub] [spark] xkrogen commented on pull request #35756: [SPARK-38437][SQL] Lenient serialization of datetime from datasource

xkrogen commented on pull request #35756:
URL: https://github.com/apache/spark/pull/35756#issuecomment-1062097418


   Perhaps I should have been more careful with my wording. I agree that no **checks** are weakened. However, we previously generated code that was specialized to a single type, either Java native or SQL type. Now, we perform a type-check at runtime, meaning we have to do an `instanceof` call for _every single record_. My understanding is that `instanceof` is pretty fast on modern JVMs, but still, it is additional overhead on a per-record basis, so there is a performance degradation.
   
   I think the PR looks much more reasonable now that it is scoped to only the Datasource API, and I do understand the problem you're addressing, though it still imposes extra overhead for well-behaved datasources that obey the `java8API.enabled` config. I am wondering if this behavior should be configurable? That would enable performance-sensitive users to generate code which strictly obeys the `java8API.enabled` config, and users who need to accommodate legacy datasources (which don't obey the config) can accept the small performance hit. We would just need to change `true` in this line to a configurable value:
   ```scala
   val toRow = RowEncoder(StructType.fromAttributes(output), lenient = true).createSerializer()
   ```
   
   Not a strong concern, I think it is likely that the performance difference is small, but something to consider.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org