You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Shivam Dalmia (JIRA)" <ji...@apache.org> on 2017/02/07 08:01:41 UTC
[jira] [Created] (SPARK-19488) CSV infer schema does not take into account Inf,-Inf,NaN

Shivam Dalmia created SPARK-19488:
-------------------------------------

             Summary: CSV infer schema does not take into account Inf,-Inf,NaN
                 Key: SPARK-19488
                 URL: https://issues.apache.org/jira/browse/SPARK-19488
             Project: Spark
          Issue Type: Bug
          Components: Spark Core
    Affects Versions: 2.0.2
         Environment: Windows 10, SparkShell
            Reporter: Shivam Dalmia


I observed that while loading a CSV as a dataframe, user-specified values for nanValue, positiveInf and negativeInf are disregarded when inferSchema = true. (They work if a user-specified schema is provided). However, even the spark defaults for the infinities (Inf and -Inf) do not work with inferSchema. 

Taking a look at the source code for the inferSchema for CSV (CSVInferSchema.scala), I found the following code snippet.
{code}
1.		private def tryParseDouble(field: String, options: CSVOptions): DataType = {
2.		    if ((allCatch opt field.toDouble).isDefined) {
3.		      DoubleType
4.		    } else {
5.		      tryParseTimestamp(field, options)
6.		    }
7.		  }
8.		
9.		  private def tryParseTimestamp(field: String, options: CSVOptions): DataType = {
10.		    // This case infers a custom `dataFormat` is set.
11.		    if ((allCatch opt options.timestampFormat.parse(field)).isDefined) {
12.		      TimestampType
13.		    } else if ((allCatch opt DateTimeUtils.stringToTime(field)).isDefined) {
14.		      // We keep this for backwords competibility.
15.		      TimestampType
16.		    } else {
17.		      tryParseBoolean(field, options)
18.		    }
19.		  }
{code}
Interestingly, the user-specified csv options are not at all used while determining if the field is of type double (as we can see in line 2). We can see that the options is used for timestamp type (line 11), which is why the 'dateFormat' option does work. 
However, when the field is NaN, it works because scala's toDouble function does convert the string NaN to the double equivalent of NaN. (I tried it using the shell):

{code}
scala> allCatch.opt(field.toDouble)
res12: Option[Double] = Some(8.0942)

scala> var field = "NaN";
field: String = NaN

scala> allCatch.opt(field.toDouble)
res13: Option[Double] = Some(NaN)

scala> var field = "Inf";
field: String = Inf

scala> allCatch.opt(field.toDouble)
res14: Option[Double] = None
{code}
Interestingly, scala does have Double equivalents of Infinity and -Infinity (but spark defaults are Inf and -Inf, which is why they don't work):

{code}
scala> field = "Infinity";
field: String = Infinity

scala> allCatch.opt(field.toDouble)
res15: Option[Double] = Some(Infinity)

scala> field = "-Infinity";
field: String = -Infinity

scala> allCatch.opt(field.toDouble)
res16: Option[Double] = Some(-Infinity)
{code}

The following csv, when ingested with inferSchema = true, therefore interprets the value column as a Double! (Regardless of the user-specified options)

{code}
ID,name,value,irrational,prime,real
1,e,2.7,true,false,true
2,pi,3.14,true,false,true
3,inf,Infinity,false,false,true
4,-inf,-Infinity,false,false,true
5,i,NaN,false,false,false

{code}





--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org