You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Huon Wilson (JIRA)" <ji...@apache.org> on 2019/03/04 04:46:00 UTC
[jira] [Commented] (SPARK-26964) to_json/from_json do not match
JSON spec due to not supporting scalars
[ https://issues.apache.org/jira/browse/SPARK-26964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16782958#comment-16782958 ]
Huon Wilson commented on SPARK-26964:
-------------------------------------
I see. Could you say why you're resolving it as Later? I'm not quite sure I understand how the error handling for corrupt records differs between this and the existing functionality in {{from_json}}, e.g. the corrupt record handling for decoding {{"x"}} as {{int}} seems to already exist (in the form of {{JacksonParser.parse}} converting exceptions into {{BadRecordException}}s, and {{FailureSafeParser}} catching them) because the same error occurs when decoding {{\{"value":"x"\}}} as {{struct<value:int>}}.
Along those lines, we're now using the following code to map arbitrary values to their JSON strings, and back. It involves wrapping the values in a struct, and using string manipulation to pull out the true JSON string.
{code:scala}
import java.util.regex.Pattern
// ...
object JsonHacks {
// FIXME: massive hack working-around (a) the requirement to make an
// explicit map<string, binary> for storage (would be nicer to just dump
// columns in directly, and (b) to_json/from_json not supporting scalars
// (https://issues.apache.org/jira/browse/SPARK-26964)
private val TempName = "value"
private val Prefix = "{\"" + TempName + "\":"
private val Suffix = "}"
// remove the prefix only when it is at the start of the string, and the
// suffix only at the end
private val StripRegexp =
s"^${Pattern.quote(Prefix)}|${Pattern.quote(Suffix)}$$"
def valueToJson(col: Column): Column = {
// Nest the column in a struct so that to_json can work ...
val structJson = to_json(struct(col as TempName))
// ... but, because of this nesting, to_json(...) gives "{}" (not
// null) if col is null, while this function needs to preserve that
// null-ness.
val nullOrStruct = when(col.isNull, null).otherwise(structJson)
// Strip off the struct wrapping to pull out the JSON-ified `col`
regexp_replace(nullOrStruct, StripRegexp, "")
}
def valueFromJson(
col: Column,
dataType: DataType,
nullable: Boolean
): Column = {
// from_json only works with a struct, so that's what we're going to be
// parsing.
val json_schema = new StructType().add(TempName, dataType, nullable)
// To be able to parse into a struct, the JSON column needs to be wrapped
// in what was stripped off above.
val structJson = concat(lit(Prefix), col, lit(Suffix))
// Now we're finally ready to parse
val parsedStruct = from_json(structJson, json_schema)
// ... and extract the field to get the actual parsed column.
parsedStruct(TempName)
}
}
{code}
> to_json/from_json do not match JSON spec due to not supporting scalars
> ----------------------------------------------------------------------
>
> Key: SPARK-26964
> URL: https://issues.apache.org/jira/browse/SPARK-26964
> Project: Spark
> Issue Type: Improvement
> Components: SQL
> Affects Versions: 2.3.2, 2.4.0
> Reporter: Huon Wilson
> Priority: Major
>
> Spark SQL's {{to_json}} and {{from_json}} currently support arrays and objects, but not the scalar/primitive types. This doesn't match the JSON spec on https://www.json.org/ or [RFC8259|https://tools.ietf.org/html/rfc8259]: a JSON document ({{json: element}}) consists of a value surrounded by whitespace ({{element: ws value ws}}), where a value is an object or array _or_ a number or string etc.:
> {code:none}
> value
> object
> array
> string
> number
> "true"
> "false"
> "null"
> {code}
> Having {{to_json}} and {{from_json}} support scalars would make them flexible enough for a library I'm working on, where an arbitrary (user-supplied) column needs to be turned into JSON.
> NB. these newer specs differ to the original [RFC4627| https://tools.ietf.org/html/rfc4627] (which is now obsolete) that (essentially) had {{value: object | array}}.
> This is related to SPARK-24391 and SPARK-25252, which added support for arrays of scalars.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org