You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by "sadikovi (via GitHub)" <gi...@apache.org> on 2023/09/04 04:03:55 UTC

[GitHub] [spark] sadikovi opened a new pull request, #42790: [SPARK-44940][SQL] Improve performance of JSON parsing when "spark.sql.json.enablePartialResults" is enabled

sadikovi opened a new pull request, #42790:
URL: https://github.com/apache/spark/pull/42790

   ### What changes were proposed in this pull request?
   
   Backport of https://github.com/apache/spark/pull/42667 to branch-3.5.
   
   The PR improves JSON parsing when `spark.sql.json.enablePartialResults` is enabled:
   - Fixes the issue when using nested arrays `ClassCastException: org.apache.spark.sql.catalyst.util.GenericArrayData cannot be cast to org.apache.spark.sql.catalyst.InternalRow`
   - Improves parsing of the nested struct fields, e.g. `{"a1": "AAA", "a2": [{"f1": "", "f2": ""}], "a3": "id1", "a4": "XXX"}` used to be parsed as `|AAA|NULL    |NULL|NULL|` and now is parsed as `|AAA|[{NULL, }]|id1|XXX|`.
   - Improves performance of nested JSON parsing. The initial implementation would throw too many exceptions when multiple nested fields failed to parse. When the config is disabled, it is not a problem because the entire record is marked as NULL.
   
   The internal benchmarks show the performance improvement from slowdown of over 160% to an improvement of 7-8% compared to the master branch when the flag is enabled. I will create a follow-up ticket to add a benchmark for this regression.
   
   ### Why are the changes needed?
   
   Fixes some corner cases in JSON parsing and improves performance when `spark.sql.json.enablePartialResults` is enabled.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   I added tests to verify nested structs, maps, and arrays can be parsed without affecting the subsequent fields in the JSON. I also updated the existing tests when `spark.sql.json.enablePartialResults` is enabled because we parse more data now.
   
   I added a benchmark to check performance.
   
   Before the change (master, https://github.com/apache/spark/commit/a45a3a3d60cb97b107a177ad16bfe36372bc3e9b):
   ```
   [info] OpenJDK 64-Bit Server VM 1.8.0_292-8u292-b10-0ubuntu1~18.04-b10 on Linux 5.4.0-1045-aws
   [info] Intel(R) Xeon(R) Platinum 8375C CPU  2.90GHz
   [info] Partial JSON results:                     Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   [info] ------------------------------------------------------------------------------------------------------------------------
   [info] parse invalid JSON                                 9537           9820         452          0.0      953651.6       1.0X
   ```
   
   After the change (this PR):
   ```
   OpenJDK 64-Bit Server VM 1.8.0_292-8u292-b10-0ubuntu1~18.04-b10 on Linux 5.4.0-1045-aws
   Intel(R) Xeon(R) Platinum 8375C CPU  2.90GHz
   Partial JSON results:                     Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   ------------------------------------------------------------------------------------------------------------------------
   parse invalid JSON                                 3100           3106           6          0.0      309967.6       1.0X
   ```
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] sadikovi commented on pull request #42790: [SPARK-44940][SQL][3.5] Improve performance of JSON parsing when "spark.sql.json.enablePartialResults" is enabled

Posted by "sadikovi (via GitHub)" <gi...@apache.org>.

sadikovi commented on PR #42790:
URL: https://github.com/apache/spark/pull/42790#issuecomment-1704587030

   cc @cloud-fan @HyukjinKwon 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] dongjoon-hyun closed pull request #42790: [SPARK-44940][SQL][3.5] Improve performance of JSON parsing when "spark.sql.json.enablePartialResults" is enabled

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.

dongjoon-hyun closed pull request #42790: [SPARK-44940][SQL][3.5] Improve performance of JSON parsing when "spark.sql.json.enablePartialResults" is enabled
URL: https://github.com/apache/spark/pull/42790


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org