You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Yuriy Davygora (JIRA)" <ji...@apache.org> on 2018/08/22 12:57:00 UTC
[jira] [Updated] (SPARK-25195) Extending from_json function

     [ https://issues.apache.org/jira/browse/SPARK-25195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Yuriy Davygora updated SPARK-25195:
-----------------------------------
    Description: 
  Dear Spark and PySpark maintainers,

  I hope, that opening a JIRA issue is the correct way to request an improvement. If it's not, please forgive me and kindly instruct me on how to do it instead.

  At our company, we are currently rewriting a lot of old MapReduce code with SPARK, and the following use-case is quite frequent: Some string-valued dataframe columns are JSON-arrays, and we want to parse them into array-typed columns.

  Problem number 1: The from_json function accepts as a schema only StructType or ArrayType(StructType), but not an ArrayType of primitives. Submitting the schema in a string form like {noformat}{"containsNull":true,"elementType":"string","type":"array"}{noformat} does not work either, the error message says, among other things, {noformat}data type mismatch: Input schema array<string> must be a struct or an array of structs.{noformat}

  Problem number 2: Sometimes, in our JSON arrays we have elements of different types. For example, we might have some JSON array like {noformat}["string_value", 0, true, null]{noformat} which is JSON-valid with schema {noformat}{"containsNull":true,"elementType":["string","integer","boolean"],"type":"array"}{noformat} (and, for instance the Python json.loads function has no problem parsing this), but such a schema is not recognized, at all. The error message gets quite unreadable after the words {noformat}ParseException: u'\nmismatched input{noformat}

  Here is some simple Python code to reproduce the problems (using pyspark 2.3.1 and pandas 0.23.4):

  {noformat}
import pandas as pd

from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from pyspark.sql.types import StringType, ArrayType

spark = SparkSession.builder.appName('test').getOrCreate()

data = {'id' : [1,2,3], 'data' : ['["string1", true, null]', '["string2", false, null]', '["string3", true, "another_string3"]']}
pdf = pd.DataFrame.from_dict(data)
df = spark.createDataFrame(pdf)
df.show()

df = df.withColumn("parsed_data", F.from_json(F.col('data'),
    ArrayType(StringType()))) # Does not work, because not a struct of array of structs

df = df.withColumn("parsed_data", F.from_json(F.col('data'),
    '{"containsNull":true,"elementType":["string","integer","boolean"],"type":"array"}')) # Does not work at all
  {noformat}

  For now, we have to use a UDF function, which calls python's json.loads, but this is, for obvious reasons, suboptimal. If you could extend the functionality of the Spark from_json function in the next release, this would be really helpful. Thank you in advance!

  was:
  Dear Spark and PySpark maintainers,

  I hope, that opening a JIRA issue is the correct way to request an improvement. If it's not, please forgive me and kindly instruct me on how to do it instead.

  At our company, we are currently rewriting a lot of old MapReduce code with SPARK, and the following use-case is quite frequent: Some string-valued dataframe columns are JSON-arrays, and we want to parse them into array-typed columns.

  Problem number 1: The from_json function accepts as a schema only StructType or ArrayType(StructType), but not an ArrayType of primitives. Submitting the schema in a string form like {noformat}{"containsNull":true,"elementType":"string","type":"array"}{noformat} does not work either, the error message says, among other things, {noformat}data type mismatch: Input schema array<string> must be a struct or an array of structs.{noformat}

  Problem number 2: Sometimes, in our JSON arrays we have elements of different types. For example, we might have some JSON array like {noformat}["string_value", 0, true, null]{noformat} which is JSON-valid with schema {noformat}{"containsNull":true,"elementType":["string","integer","boolean"],"type":"array"}{noformat} (and, for instance the Python json.loads function has no problem parsing this), but such a schema is not recognized, at all. The error message gets quite unreadable after the words {noformat}ParseException: u'\nmismatched input{noformat}

  Here is some simple Python code to reproduce the problems (using pyspark 2.3.1 and pandas 0.23.4):

  {noformat}
import pandas as pd

from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from pyspark.sql.types import StringType, ArrayType

spark = SparkSession.builder.appName('test').getOrCreate()

data = {'id' : [1,2,3], 'data' : ['["string1", true, null]', '["string2", false, null]', '["string3", true, "another_string3]']}
pdf = pd.DataFrame.from_dict(data)
df = spark.createDataFrame(pdf)
df.show()

df = df.withColumn("parsed_data", F.from_json(F.col('data'),
    ArrayType(StringType()))) # Does not work, because not a struct of array of structs

df = df.withColumn("parsed_data", F.from_json(F.col('data'),
    '{"containsNull":true,"elementType":["string","integer","boolean"],"type":"array"}')) # Does not work at all
  {noformat}

  For now, we have to use a UDF function, which calls python's json.loads, but this is, for obvious reasons, suboptimal. If you could extend the functionality of the Spark from_json function in the next release, this would be really helpful. Thank you in advance!


> Extending from_json function
> ----------------------------
>
>                 Key: SPARK-25195
>                 URL: https://issues.apache.org/jira/browse/SPARK-25195
>             Project: Spark
>          Issue Type: Improvement
>          Components: PySpark, Spark Core
>    Affects Versions: 2.3.1
>            Reporter: Yuriy Davygora
>            Priority: Minor
>
>   Dear Spark and PySpark maintainers,
>   I hope, that opening a JIRA issue is the correct way to request an improvement. If it's not, please forgive me and kindly instruct me on how to do it instead.
>   At our company, we are currently rewriting a lot of old MapReduce code with SPARK, and the following use-case is quite frequent: Some string-valued dataframe columns are JSON-arrays, and we want to parse them into array-typed columns.
>   Problem number 1: The from_json function accepts as a schema only StructType or ArrayType(StructType), but not an ArrayType of primitives. Submitting the schema in a string form like {noformat}{"containsNull":true,"elementType":"string","type":"array"}{noformat} does not work either, the error message says, among other things, {noformat}data type mismatch: Input schema array<string> must be a struct or an array of structs.{noformat}
>   Problem number 2: Sometimes, in our JSON arrays we have elements of different types. For example, we might have some JSON array like {noformat}["string_value", 0, true, null]{noformat} which is JSON-valid with schema {noformat}{"containsNull":true,"elementType":["string","integer","boolean"],"type":"array"}{noformat} (and, for instance the Python json.loads function has no problem parsing this), but such a schema is not recognized, at all. The error message gets quite unreadable after the words {noformat}ParseException: u'\nmismatched input{noformat}
>   Here is some simple Python code to reproduce the problems (using pyspark 2.3.1 and pandas 0.23.4):
>   {noformat}
> import pandas as pd
> from pyspark.sql import SparkSession
> import pyspark.sql.functions as F
> from pyspark.sql.types import StringType, ArrayType
> spark = SparkSession.builder.appName('test').getOrCreate()
> data = {'id' : [1,2,3], 'data' : ['["string1", true, null]', '["string2", false, null]', '["string3", true, "another_string3"]']}
> pdf = pd.DataFrame.from_dict(data)
> df = spark.createDataFrame(pdf)
> df.show()
> df = df.withColumn("parsed_data", F.from_json(F.col('data'),
>     ArrayType(StringType()))) # Does not work, because not a struct of array of structs
> df = df.withColumn("parsed_data", F.from_json(F.col('data'),
>     '{"containsNull":true,"elementType":["string","integer","boolean"],"type":"array"}')) # Does not work at all
>   {noformat}
>   For now, we have to use a UDF function, which calls python's json.loads, but this is, for obvious reasons, suboptimal. If you could extend the functionality of the Spark from_json function in the next release, this would be really helpful. Thank you in advance!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org