You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Robert Joseph Evans (Jira)" <ji...@apache.org> on 2024/01/19 18:48:00 UTC

[jira] [Created] (SPARK-46778) get_json_object flattens wildcard queries that match a single value

Robert Joseph Evans created SPARK-46778:
-------------------------------------------

             Summary: get_json_object flattens wildcard queries that match a single value
                 Key: SPARK-46778
                 URL: https://issues.apache.org/jira/browse/SPARK-46778
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 3.4.1
            Reporter: Robert Joseph Evans


I think this impacts all versions of {{{}get_json_object{}}}, but I am not 100% sure.

The unit test for [$.store.book[*].reader|https://github.com/apache/spark/blob/39f8e1a5953b5897f893151d24dc585a80c0c8a0/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/JsonExpressionsSuite.scala#L142-L146] verifies that the output of this query is a single level JSON array, but when I put the same JSON and JSON path into [http://jsonpath.com/] I get a result with multiple levels of nesting. It looks like Apache Spark tries to flatten lists for {{[*]}} matches when there is only a single element that matches.
{code:java}
scala> Seq("""[{"a":"A"},{"b":"B"}]""","""[{"a":"A"},{"a":"B"}]""").toDF("jsonStr").selectExpr("""get_json_object(jsonStr,"$[*].a")""").show(false)
+--------------------------------+
|get_json_object(jsonStr, $[*].a)|
+--------------------------------+
|"A"                             |
|["A","B"]                       |
+--------------------------------+ {code}
But this has problems in that I no longer have a consistent schema returned, even if the input schema is known to be consistent. For example if I wanted to know how many elements matched this query I could wrap it in a {{json_array_length}} but that does not work in the generic case.
{code:java}
scala> Seq("""[{"a":"A"},{"b":"B"}]""","""[{"a":"A"},{"a":"B"}]""").toDF("jsonStr").selectExpr("""json_array_length(get_json_object(jsonStr,"$[*].a"))""").show(false)
+---------------------------------------------------+
|json_array_length(get_json_object(jsonStr, $[*].a))|
+---------------------------------------------------+
|null                                               |
|2                                                  |
+---------------------------------------------------+ {code}
If the value returned might be a JSON array, and then I would get a number, but it is wrong.
{code:java}
scala> Seq("""[{"a":[1,2,3,4,5]},{"b":"B"}]""","""[{"a":[1,2,3,4,5]},{"a":[1,2,3,4,5]}]""").toDF("jsonStr").selectExpr("""json_array_length(get_json_object(jsonStr,"$[*].a"))""").show(false)
+---------------------------------------------------+
|json_array_length(get_json_object(jsonStr, $[*].a))|
+---------------------------------------------------+
|5                                                  |
|2                                                  |
+---------------------------------------------------+ {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org