You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Huon Wilson (JIRA)" <ji...@apache.org> on 2019/03/04 04:46:00 UTC

[jira] [Commented] (SPARK-26964) to_json/from_json do not match JSON spec due to not supporting scalars

    [ https://issues.apache.org/jira/browse/SPARK-26964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16782958#comment-16782958 ] 

Huon Wilson commented on SPARK-26964:
-------------------------------------

I see. Could you say why you're resolving it as Later? I'm not quite sure I understand how the error handling for corrupt records differs between this and the existing functionality in {{from_json}}, e.g. the corrupt record handling for decoding {{"x"}} as {{int}} seems to already exist (in the form of {{JacksonParser.parse}} converting exceptions into {{BadRecordException}}s, and {{FailureSafeParser}} catching them) because the same error occurs when decoding {{\{"value":"x"\}}} as {{struct<value:int>}}.

Along those lines, we're now using the following code to map arbitrary values to their JSON strings, and back. It involves wrapping the values in a struct, and using string manipulation to pull out the true JSON string.

{code:scala}
import java.util.regex.Pattern
// ...

object JsonHacks {
  // FIXME: massive hack working-around (a) the requirement to make an
  // explicit map<string, binary> for storage (would be nicer to just dump
  // columns in directly, and (b) to_json/from_json not supporting scalars
  // (https://issues.apache.org/jira/browse/SPARK-26964)
  private val TempName = "value"
  private val Prefix = "{\"" + TempName + "\":"
  private val Suffix = "}"
  // remove the prefix only when it is at the start of the string, and the
  // suffix only at the end
  private val StripRegexp =
    s"^${Pattern.quote(Prefix)}|${Pattern.quote(Suffix)}$$"

 def valueToJson(col: Column): Column = {
    // Nest the column in a struct so that to_json can work ...
    val structJson = to_json(struct(col as TempName))
    // ... but, because of this nesting, to_json(...) gives "{}" (not
    // null) if col is null, while this function needs to preserve that
    // null-ness.
    val nullOrStruct = when(col.isNull, null).otherwise(structJson)

    // Strip off the struct wrapping to pull out the JSON-ified `col`
    regexp_replace(nullOrStruct, StripRegexp, "")
  }
 def valueFromJson(
    col: Column,
    dataType: DataType,
    nullable: Boolean
  ): Column = {
    // from_json only works with a struct, so that's what we're going to be
    // parsing.
    val json_schema = new StructType().add(TempName, dataType, nullable)

    // To be able to parse into a struct, the JSON column needs to be wrapped
    // in what was stripped off above.
    val structJson = concat(lit(Prefix), col, lit(Suffix))
    // Now we're finally ready to parse
    val parsedStruct = from_json(structJson, json_schema)
    // ... and extract the field to get the actual parsed column.
    parsedStruct(TempName)
  }
}
{code}

> to_json/from_json do not match JSON spec due to not supporting scalars
> ----------------------------------------------------------------------
>
>                 Key: SPARK-26964
>                 URL: https://issues.apache.org/jira/browse/SPARK-26964
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 2.3.2, 2.4.0
>            Reporter: Huon Wilson
>            Priority: Major
>
> Spark SQL's {{to_json}} and {{from_json}} currently support arrays and objects, but not the scalar/primitive types. This doesn't match the JSON spec on https://www.json.org/ or [RFC8259|https://tools.ietf.org/html/rfc8259]: a JSON document ({{json: element}}) consists of a value surrounded by whitespace ({{element: ws value ws}}), where a value is an object or array _or_ a number or string etc.:
> {code:none}
> value
>     object
>     array
>     string
>     number
>     "true"
>     "false"
>     "null"
> {code}
> Having {{to_json}} and {{from_json}} support scalars would make them flexible enough for a library I'm working on, where an arbitrary (user-supplied) column needs to be turned into JSON.
> NB. these newer specs differ to the original [RFC4627| https://tools.ietf.org/html/rfc4627] (which is now obsolete) that (essentially) had {{value: object | array}}.
> This is related to SPARK-24391 and SPARK-25252, which added support for arrays of scalars.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org