You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2019/12/03 16:30:39 UTC

[GitHub] [spark] johnhany97 opened a new pull request #26749: [SPARK-30082][SQL] Do not replace Zeros when replacing NaNs

johnhany97 opened a new pull request #26749: [SPARK-30082][SQL] Do not replace Zeros when replacing NaNs
URL: https://github.com/apache/spark/pull/26749
 
 
   ### What changes were proposed in this pull request?
   Do not cast `NaN` to an `Integer`, `Long`, `Short` or `Byte`. This is because casting `NaN` to those types results in a `0` which erroneously replaces `0`s while only `NaN`s should be replaced.
   
   
   ### Why are the changes needed?
   This Scala code snippet:
   ```
   import scala.math;
   
   println(Double.NaN.toLong)
   ```
   returns `0` which is problematic as if you run the following Spark code, `0`s get replaced as well:
   ```
   >>> df = spark.createDataFrame([(1.0, 0), (0.0, 3), (float('nan'), 0)], ("index", "value"))
   >>> df.show()
   +-----+-----+
   |index|value|
   +-----+-----+
   |  1.0|    0|
   |  0.0|    3|
   |  NaN|    0|
   +-----+-----+
   >>> df.replace(float('nan'), 2).show()
   +-----+-----+
   |index|value|
   +-----+-----+
   |  1.0|    2|
   |  0.0|    3|
   |  2.0|    2|
   +-----+-----+ 
   ```
   
   ### Does this PR introduce any user-facing change?
   Yes, after the PR, running the same above code snippet returns the correct expected results:
   ```
   >>> df = spark.createDataFrame([(1.0, 0), (0.0, 3), (float('nan'), 0)], ("index", "value"))
   >>> df.show()
   +-----+-----+
   |index|value|
   +-----+-----+
   |  1.0|    0|
   |  0.0|    3|
   |  NaN|    0|
   +-----+-----+
   
   >>> df.replace(float('nan'), 2).show()
   +-----+-----+
   |index|value|
   +-----+-----+
   |  1.0|    0|
   |  0.0|    3|
   |  2.0|    0|
   +-----+-----+
   ```
   
   
   ### How was this patch tested?
   <!--
   If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible.
   If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future.
   If tests were not added, please describe why they were not added and/or why it was difficult to add.
   -->
   Added unit tests to verify replacing `NaN` only affects columns of type `Float` and `Double`
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org