You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "chenhao-db (via GitHub)" <gi...@apache.org> on 2024/03/12 20:26:52 UTC

[PR] [SPARK-47366][SQL] Implement parse_json. [spark]

chenhao-db opened a new pull request, #45479:
URL: https://github.com/apache/spark/pull/45479

   ### What changes were proposed in this pull request?
   
   This PR adds the implementation of the `parse_json` expression to replace the current fake implementation. This expression parses a JSON string as a variant value. It throws an exception when the string is not a valid JSON value or when the resulting variant cannot fit into the size limit.
   
   This feature is achieved by introducing a library`common/variant`. It includes the utility functions to build and manipulate binary-encoded variant values. It also contains a `README` that describes the variant binary format. It is intended that this library can be used outside of Spark.
   
   Some usage examples of `parse_json`:
   
   ```
   select parse_json('{"a": 1, "b": 2}');
   
   create table variant_table as select parse_json(j) as v from json_table;
   
   select parse_json('[');
   -- will throw an exception because the input is not valid JSON
   
   select parse_json('"' || repeat('a', 16 * 1024 * 1024) || '"');
   -- will throw an exception because the variant exceeds the size limit
   ```
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes, the `parse_json` expression will return an actual binary-encoded variant value rather than the original placeholder value.
   
   ### How was this patch tested?
   
   Unit tests that validate the `parse_json` result. Negative cases where the expression fail on invalid/large JSON are also covered.
   
   Some unit tests need to be temporarily disabled because the `toString` implementation doesn't match the `parse_json` implementation yet. I will shortly add a new `toString` implementation and re-enable them.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47366][SQL] Implement parse_json. [spark]

Posted by "dtenedor (via GitHub)" <gi...@apache.org>.
dtenedor commented on PR #45479:
URL: https://github.com/apache/spark/pull/45479#issuecomment-1992583581

   @HyukjinKwon @cloud-fan do you want to help look? This approach seems fine to me.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47366][SQL] Implement parse_json. [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon commented on code in PR #45479:
URL: https://github.com/apache/spark/pull/45479#discussion_r1522344448


##########
project/SparkBuild.scala:
##########
@@ -58,10 +58,10 @@ object BuildCommons {
 
   val allProjects@Seq(
     core, graphx, mllib, mllibLocal, repl, networkCommon, networkShuffle, launcher, unsafe, tags, sketch, kvstore,
-    commonUtils, sqlApi, _*
+    commonUtils, sqlApi, variant, _*

Review Comment:
   Would also have to add at `dev/sparktestsupport/modules.py` and `.github/workflows/build_and_test.yml` so it runs the tests for individual modules.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47366][SQL] Implement parse_json. [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #45479:
URL: https://github.com/apache/spark/pull/45479#discussion_r1524526381


##########
common/unsafe/src/main/java/org/apache/spark/unsafe/types/VariantVal.java:
##########
@@ -107,4 +107,17 @@ public String toString() {
     // NOTE: the encoding is not yet implemented, this is not the final implementation.
     return new String(value);
   }
+
+  /**
+   * Compare two variants in bytes. The variant equality is more complex than it, and we haven't
+   * supported it in the user surface yet. This method is only intended for tests.
+   */
+  @Override
+  public boolean equals(Object other) {

Review Comment:
   `equals` should be overriden with `hashCode` together.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47366][SQL] Implement parse_json. [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on PR #45479:
URL: https://github.com/apache/spark/pull/45479#issuecomment-2008427418

   SGTM, @chenhao-db can you open a followup PR to address it?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47366][SQL] Implement parse_json. [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan closed pull request #45479: [SPARK-47366][SQL] Implement parse_json.
URL: https://github.com/apache/spark/pull/45479


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47366][SQL] Implement parse_json. [spark]

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun commented on PR #45479:
URL: https://github.com/apache/spark/pull/45479#issuecomment-2016998132

   I made a PR, @cloud-fan and @chenhao-db .
   - https://github.com/apache/spark/pull/45685


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47366][SQL] Implement parse_json. [spark]

Posted by "chenhao-db (via GitHub)" <gi...@apache.org>.
chenhao-db commented on code in PR #45479:
URL: https://github.com/apache/spark/pull/45479#discussion_r1525473094


##########
common/unsafe/src/main/java/org/apache/spark/unsafe/types/VariantVal.java:
##########
@@ -107,4 +107,17 @@ public String toString() {
     // NOTE: the encoding is not yet implemented, this is not the final implementation.
     return new String(value);
   }
+
+  /**
+   * Compare two variants in bytes. The variant equality is more complex than it, and we haven't
+   * supported it in the user surface yet. This method is only intended for tests.
+   */
+  @Override
+  public boolean equals(Object other) {

Review Comment:
   Done.



##########
common/variant/src/main/java/org/apache/spark/variant/Variant.java:
##########
@@ -0,0 +1,40 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.variant;
+
+/**
+ * This class is structurally equivalent to {@link org.apache.spark.unsafe.types.VariantVal}. We
+ * define a new class to avoid depending on or modifying Spark.

Review Comment:
   I think it is better to have separate classes. It allows Spark to have full control of its internal in-memory format and the variant library can have more flexbility. I will later introduce a starting offset in `Variant`, but not in `VariantVal`.



##########
common/variant/README.md:
##########
@@ -0,0 +1,127 @@
+# Overview
+
+A Variant represents a type that contain one of:
+- Primitive: A type and corresponding value (e.g. INT, STRING)
+- Array: An ordered list of Variant values
+- Object: An unordered collection of string/Variant pairs (i.e. key/value pairs). An object may not contain duplicate keys.
+
+A variant is encoded with 2 binary values, the value and the metadata.
+
+There are a fixed number of allowed primitive types, provided in the table below. These represent a commonly supported subset of the [logical types](https://github.com/apache/parquet-format/blob/master/LogicalTypes.md) allowed by the Parquet.
+
+The Variant spec allows representation of semi-structured data (e.g. JSON) in a form that can be efficiently queried by path. The design is intended to allow efficient access to nested data even in the presence of very wide or deep structures.
+
+Another motivation for the representation is that (aside from metadata) each inner Variant value is contiguous and self-contained. For example, in a Variant containing an Array of Variant values, the representation of an inner Variant value, when paired with the metadata of the full variant, is itself a valid Variant.
+
+# Metadata encoding
+
+The grammar for encoded metadata is as follows
+
+```
+metadata: <header> <dictionary_size> <dictionary>
+header: 1 byte (<version> | <sorted_strings> << 4 | (<offset_size_minus_one> << 6))
+version: a 4-bit version ID. Currently, must always contain the value 1

Review Comment:
   It is expected that the version will be updated very rarely. Even if we ever have the need of more than 16 versions, we can use a special version value to indicate there is another byte following it to further describe the version. This is a valid solution because different versions can have different binary formats.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47366][SQL] Implement parse_json. [spark]

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun commented on PR #45479:
URL: https://github.com/apache/spark/pull/45479#issuecomment-2008477595

   Thank you, @cloud-fan .


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47366][SQL] Implement parse_json. [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #45479:
URL: https://github.com/apache/spark/pull/45479#discussion_r1524535881


##########
common/variant/README.md:
##########
@@ -0,0 +1,127 @@
+# Overview
+
+A Variant represents a type that contain one of:
+- Primitive: A type and corresponding value (e.g. INT, STRING)
+- Array: An ordered list of Variant values
+- Object: An unordered collection of string/Variant pairs (i.e. key/value pairs). An object may not contain duplicate keys.
+
+A variant is encoded with 2 binary values, the value and the metadata.
+
+There are a fixed number of allowed primitive types, provided in the table below. These represent a commonly supported subset of the [logical types](https://github.com/apache/parquet-format/blob/master/LogicalTypes.md) allowed by the Parquet.
+
+The Variant spec allows representation of semi-structured data (e.g. JSON) in a form that can be efficiently queried by path. The design is intended to allow efficient access to nested data even in the presence of very wide or deep structures.
+
+Another motivation for the representation is that (aside from metadata) each inner Variant value is contiguous and self-contained. For example, in a Variant containing an Array of Variant values, the representation of an inner Variant value, when paired with the metadata of the full variant, is itself a valid Variant.
+
+# Metadata encoding
+
+The grammar for encoded metadata is as follows
+
+```
+metadata: <header> <dictionary_size> <dictionary>
+header: 1 byte (<version> | <sorted_strings> << 4 | (<offset_size_minus_one> << 6))
+version: a 4-bit version ID. Currently, must always contain the value 1

Review Comment:
   hmmm, at most 16 versions?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47366][SQL] Implement parse_json. [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #45479:
URL: https://github.com/apache/spark/pull/45479#discussion_r1524542499


##########
common/variant/src/main/java/org/apache/spark/variant/Variant.java:
##########
@@ -0,0 +1,40 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.variant;
+
+/**
+ * This class is structurally equivalent to {@link org.apache.spark.unsafe.types.VariantVal}. We
+ * define a new class to avoid depending on or modifying Spark.

Review Comment:
   shall we make Spark depend on this class and remove `VariantVal`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47366][SQL] Implement parse_json. [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on PR #45479:
URL: https://github.com/apache/spark/pull/45479#issuecomment-1998681496

   thanks, merging to master!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org