You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@spark.apache.org by do...@apache.org on 2019/01/26 00:14:59 UTC
[spark] branch master updated: [SPARK-26711][SQL] Lazily convert string values to BigDecimal during JSON schema inference

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
     new f17a3d9  [SPARK-26711][SQL] Lazily convert string values to BigDecimal during JSON schema inference
f17a3d9 is described below

commit f17a3d9c3afed77156ab7468b80be1f32bbdb5c6
Author: Bruce Robbins <be...@gmail.com>
AuthorDate: Fri Jan 25 16:14:38 2019 -0800

    [SPARK-26711][SQL] Lazily convert string values to BigDecimal during JSON schema inference
    
    ## What changes were proposed in this pull request?
    
    This PR fixes a bug where JSON schema inference attempts to convert every String value to a BigDecimal regardless of the setting of "prefersDecimal". With that bug, behavior is still correct, but performance is impacted.
    
    This PR makes this conversion lazy, so it is only performed if prefersDecimal is set to true.
    
    Using Spark with a single executor thread to infer the schema of a single-column, 100M row JSON file, the performance impact is as follows:
    
    option | baseline | pr
    -----|----|-----
    inferTimestamp=_default_<br>prefersDecimal=_default_ | 12.5 minutes | 6.1 minutes |
    inferTimestamp=false<br>prefersDecimal=_default_ | 6.5 minutes | 49 seconds |
    inferTimestamp=false<br>prefersDecimal=true | 6.5 minutes | 6.5 minutes |
    
    ## How was this patch tested?
    
    I ran JsonInferSchemaSuite and JsonSuite. Also, I ran manual tests to see performance impact (see above).
    
    Closes #23653 from bersprockets/SPARK-26711_improved.
    
    Authored-by: Bruce Robbins <be...@gmail.com>
    Signed-off-by: Dongjoon Hyun <do...@apache.org>
---
 .../main/scala/org/apache/spark/sql/catalyst/json/JsonInferSchema.scala | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JsonInferSchema.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JsonInferSchema.scala
index 0bf3f03c..1fb4594 100644
--- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JsonInferSchema.scala
+++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JsonInferSchema.scala
@@ -122,7 +122,7 @@ private[sql] class JsonInferSchema(options: JSONOptions) extends Serializable {
 
       case VALUE_STRING =>
         val field = parser.getText
-        val decimalTry = allCatch opt {
+        lazy val decimalTry = allCatch opt {
           val bigDecimal = decimalParser(field)
             DecimalType(bigDecimal.precision, bigDecimal.scale)
         }


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@spark.apache.org
For additional commands, e-mail: commits-help@spark.apache.org