You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/11/15 00:00:22 UTC

[GitHub] [spark] xkrogen commented on pull request #38660: [SPARK-40199][SQL][WIP] Provide useful error when encountering null values in non-null fields

xkrogen commented on PR #38660:
URL: https://github.com/apache/spark/pull/38660#issuecomment-1314567023

   One point that I'd be interested in discussing is handling of untrusted input data sources (not UDFs). For some context, in our environment this situation mostly arises because we have a DSv2 source which tracks the schema for a table in a catalog (including nullability information) as well as a pointer to an HDFS location. At times due to erroneous pipelines, the schema can reflect non-null even though there are underlying files written with null values. Currently, diagnosing such issues and determining where the mismatched input lives is very challenging.
   
   But, I suspect that treating _all_ DSv2 sources as untrusted doesn't make sense either. One option I was considering is to add a list of "trusted" (or "untrusted") entities, essentially an include/exclude list, denoting which DSv2 sources and/or UDFs are considered trusted or not.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org