You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2020/03/03 22:12:54 UTC

[GitHub] [spark] nchammas commented on issue #22775: [SPARK-24709][SQL][FOLLOW-UP] Make schema_of_json's input json as literal only

nchammas commented on issue #22775: [SPARK-24709][SQL][FOLLOW-UP] Make schema_of_json's input json as literal only
URL: https://github.com/apache/spark/pull/22775#issuecomment-594197783
 
 
   This change seems like a step back from the original version introduced in #21686.
   
   I have a DataFrame with a JSON column. I suspect the JSON values have an inconsistent schema, so I want to first check whether a single schema can apply before trying to parse the column.
   
   With the original version of `schema_of_json()`, I could do something like this to check whether or not I have a consistent schema:
   
   ```python
   df.select(schema_of_json(...)).distinct().count()
   ```
   
   But now I can't do that. I can't even wrap `schema_of_json()` in a UDF to get something like that, because it returns a `Column`. It seems surprising from an API design point of view for a function to only accept literals but return Columns. And it seems inconsistent with the general tenor of Spark SQL functions for a function _not_ to accept Columns as input.
   
   Can we revisit the design of this function (as well as that of its cousin, `schema_of_csv()`)?
   
   Alternately, would it make sense to deprecate these functions and instead recommend the approach that @HyukjinKwon suggested?
   
   > Actually, that usecase can more easily accomplished by simply inferring schema by JSON datasource. Yea, I indeed suggested that as workaround for this issue before. Let's say, `spark.read.json(df.select("json").as[String]).schema`.
   
   This demonstrates good Spark style (at least to me), and perhaps we can just promote this as a solution and do away with these functions.
   
   For the passing reader, the Python equivalent of Hyukjin's suggestion is:
   
   ```python
   spark.read.json(df.rdd.map(lambda x: x[0])).schema
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org