You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Apache Spark (JIRA)" <ji...@apache.org> on 2017/02/07 15:17:41 UTC
[jira] [Assigned] (SPARK-19488) CSV infer schema does not take into account Inf,-Inf,NaN

     [ https://issues.apache.org/jira/browse/SPARK-19488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Apache Spark reassigned SPARK-19488:
------------------------------------

    Assignee: Apache Spark

> CSV infer schema does not take into account Inf,-Inf,NaN
> --------------------------------------------------------
>
>                 Key: SPARK-19488
>                 URL: https://issues.apache.org/jira/browse/SPARK-19488
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 2.0.2
>         Environment: Windows 10, SparkShell
>            Reporter: Shivam Dalmia
>            Assignee: Apache Spark
>              Labels: easyfix, features
>
> I observed that while loading a CSV as a dataframe, user-specified values for nanValue, positiveInf and negativeInf are disregarded when inferSchema = true. (They work if a user-specified schema is provided). However, even the spark defaults for the infinities (Inf and -Inf) do not work with inferSchema. 
> Taking a look at the source code for the inferSchema for CSV (CSVInferSchema.scala), I found the following code snippet.
> {code}
> 1.		private def tryParseDouble(field: String, options: CSVOptions): DataType = {
> 2.		    if ((allCatch opt field.toDouble).isDefined) {
> 3.		      DoubleType
> 4.		    } else {
> 5.		      tryParseTimestamp(field, options)
> 6.		    }
> 7.		  }
> 8.		
> 9.		  private def tryParseTimestamp(field: String, options: CSVOptions): DataType = {
> 10.		    // This case infers a custom `dataFormat` is set.
> 11.		    if ((allCatch opt options.timestampFormat.parse(field)).isDefined) {
> 12.		      TimestampType
> 13.		    } else if ((allCatch opt DateTimeUtils.stringToTime(field)).isDefined) {
> 14.		      // We keep this for backwords competibility.
> 15.		      TimestampType
> 16.		    } else {
> 17.		      tryParseBoolean(field, options)
> 18.		    }
> 19.		  }
> {code}
> Interestingly, the user-specified csv options are not at all used while determining if the field is of type double (as we can see in line 2). We can see that the options is used for timestamp type (line 11), which is why the 'dateFormat' option does work. 
> However, when the field is NaN, it works because scala's toDouble function does convert the string NaN to the double equivalent of NaN. (I tried it using the shell):
> {code}
> scala> allCatch.opt(field.toDouble)
> res12: Option[Double] = Some(8.0942)
> scala> var field = "NaN";
> field: String = NaN
> scala> allCatch.opt(field.toDouble)
> res13: Option[Double] = Some(NaN)
> scala> var field = "Inf";
> field: String = Inf
> scala> allCatch.opt(field.toDouble)
> res14: Option[Double] = None
> {code}
> Interestingly, scala does have Double equivalents of Infinity and -Infinity (but spark defaults are Inf and -Inf, which is why they don't work):
> {code}
> scala> field = "Infinity";
> field: String = Infinity
> scala> allCatch.opt(field.toDouble)
> res15: Option[Double] = Some(Infinity)
> scala> field = "-Infinity";
> field: String = -Infinity
> scala> allCatch.opt(field.toDouble)
> res16: Option[Double] = Some(-Infinity)
> {code}
> The following csv, when ingested with inferSchema = true, therefore interprets the value column as a Double! (Regardless of the user-specified options)
> {code}
> ID,name,value,irrational,prime,real
> 1,e,2.7,true,false,true
> 2,pi,3.14,true,false,true
> 3,inf,Infinity,false,false,true
> 4,-inf,-Infinity,false,false,true
> 5,i,NaN,false,false,false
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org