You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Hyukjin Kwon (JIRA)" <ji...@apache.org> on 2019/05/03 08:29:00 UTC

[jira] [Updated] (SPARK-27609) [Documentation Issue?] from_json expects values of options dictionary to be

     [ https://issues.apache.org/jira/browse/SPARK-27609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hyukjin Kwon updated SPARK-27609:
---------------------------------
    Description: 
When reading a column of a DataFrame that consists of serialized JSON, one of the options for inferring the schema and then parsing the JSON is to do a two step process consisting of:
 
{code}
# this results in a new dataframe where the top-level keys of the JSON # are columns
df_parsed_direct = spark.read.json(df.rdd.map(lambda row: row.json_col))

# this does that while preserving the rest of df
schema = df_parsed_direct.schema
df_parsed = df.withColumn('parsed', from_json(df.json_col, schema)
{code}


When I do this, I sometimes find myself passing in options. My understanding is, from the documentation [here|http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.functions.from_json], that the nature of these options passed should be the same whether I do

{code}
spark.read.option('option',value)
{code}

or

{code}
from_json(df.json_col, schema, options={'option':value})
{code}

 
However, I've found that the latter expects value to be a string representation of the value that can be decoded by JSON. So, for example options=\{'multiLine':True} fails with 

{code}
java.lang.ClassCastException: java.lang.Boolean cannot be cast to java.lang.String
{code}

whereas {{options={'multiLine':'true'}}} works just fine. 

Notably, providing {{spark.read.option('multiLine',True)}} works fine!

The code for reproducing this issue as well as the stacktrace from hitting it are provided in [this gist|https://gist.github.com/zmjjmz/0af5cf9b059b4969951e825565e266aa]. 

I also noticed that from_json doesn't complain if you give it a garbage option key – but that seems separate.

  was:
When reading a column of a DataFrame that consists of serialized JSON, one of the options for inferring the schema and then parsing the JSON is to do a two step process consisting of:

 

{{#this results in a new dataframe where the top-level keys of the JSON # are columns}}

{{df_parsed_direct = spark.read.json(df.rdd.map(lambda row: row.json_col))}}

{{# this does that while preserving the rest of df}}
 {{schema = df_parsed_direct.schema}}
 {{df_parsed = df.withColumn('parsed', from_json(df.json_col, schema)}}

When I do this, I sometimes find myself passing in options. My understanding is, from the documentation [here|http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.functions.from_json], that the nature of these options passed should be the same whether I do

{{spark.read.option('option',value)}}

or

{{from_json(df.json_col, schema, options=\{'option':value})}}

 

However, I've found that the latter expects value to be a string representation of the value that can be decoded by JSON. So, for example options=\{'multiLine':True} fails with 

{{java.lang.ClassCastException: java.lang.Boolean cannot be cast to java.lang.String}}

whereas options=\{'multiLine':'true'} works just fine. 

Notably, providing spark.read.option('multiLine',True) works fine!

 

The code for reproducing this issue as well as the stacktrace from hitting it are provided in [this gist|https://gist.github.com/zmjjmz/0af5cf9b059b4969951e825565e266aa]. 

 

I also noticed that from_json doesn't complain if you give it a garbage option key – but that seems separate.


> [Documentation Issue?] from_json expects values of options dictionary to be 
> ----------------------------------------------------------------------------
>
>                 Key: SPARK-27609
>                 URL: https://issues.apache.org/jira/browse/SPARK-27609
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 2.2.1
>         Environment: I've found this issue on an AWS Glue development endpoint which is running Spark 2.2.1 and being given jobs through a SparkMagic Python 2 kernel, running through livy and all that. I don't know how much of that is important for reproduction, and can get more details if needed. 
>            Reporter: Zachary Jablons
>            Priority: Minor
>
> When reading a column of a DataFrame that consists of serialized JSON, one of the options for inferring the schema and then parsing the JSON is to do a two step process consisting of:
>  
> {code}
> # this results in a new dataframe where the top-level keys of the JSON # are columns
> df_parsed_direct = spark.read.json(df.rdd.map(lambda row: row.json_col))
> # this does that while preserving the rest of df
> schema = df_parsed_direct.schema
> df_parsed = df.withColumn('parsed', from_json(df.json_col, schema)
> {code}
> When I do this, I sometimes find myself passing in options. My understanding is, from the documentation [here|http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.functions.from_json], that the nature of these options passed should be the same whether I do
> {code}
> spark.read.option('option',value)
> {code}
> or
> {code}
> from_json(df.json_col, schema, options={'option':value})
> {code}
>  
> However, I've found that the latter expects value to be a string representation of the value that can be decoded by JSON. So, for example options=\{'multiLine':True} fails with 
> {code}
> java.lang.ClassCastException: java.lang.Boolean cannot be cast to java.lang.String
> {code}
> whereas {{options={'multiLine':'true'}}} works just fine. 
> Notably, providing {{spark.read.option('multiLine',True)}} works fine!
> The code for reproducing this issue as well as the stacktrace from hitting it are provided in [this gist|https://gist.github.com/zmjjmz/0af5cf9b059b4969951e825565e266aa]. 
> I also noticed that from_json doesn't complain if you give it a garbage option key – but that seems separate.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org