You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Davies Liu (JIRA)" <ji...@apache.org> on 2015/07/09 02:19:04 UTC

[jira] [Assigned] (SPARK-6573) Convert inbound NaN values as null

     [ https://issues.apache.org/jira/browse/SPARK-6573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Davies Liu reassigned SPARK-6573:
---------------------------------

    Assignee: Davies Liu

> Convert inbound NaN values as null
> ----------------------------------
>
>                 Key: SPARK-6573
>                 URL: https://issues.apache.org/jira/browse/SPARK-6573
>             Project: Spark
>          Issue Type: Sub-task
>          Components: SQL
>    Affects Versions: 1.3.0
>            Reporter: Fabian Boehnlein
>            Assignee: Davies Liu
>
> In pandas it is common to use numpy.nan as the null value, for missing data or whatever.
> http://pandas.pydata.org/pandas-docs/dev/gotchas.html#nan-integer-na-values-and-na-type-promotions
> http://stackoverflow.com/questions/17534106/what-is-the-difference-between-nan-and-none
> http://pandas.pydata.org/pandas-docs/dev/missing_data.html#filling-missing-values-fillna
> createDataFrame however only works with None as null values, parsing them as None in the RDD.
> I suggest to add support for np.nan values in pandas DataFrames.
> current stracktrace when calling a DataFrame with object type columns with np.nan values (which are floats)
> {code}
> TypeError                                 Traceback (most recent call last)
> <ipython-input-38-34f0263f0bf4> in <module>()
> ----> 1 sqldf = sqlCtx.createDataFrame(df_, schema=schema)
> /opt/spark/spark-1.3.0-bin-hadoop2.4/python/pyspark/sql/context.py in createDataFrame(self, data, schema, samplingRatio)
>     339             schema = self._inferSchema(data.map(lambda r: row_cls(*r)), samplingRatio)
>     340 
> --> 341         return self.applySchema(data, schema)
>     342 
>     343     def registerDataFrameAsTable(self, rdd, tableName):
> /opt/spark/spark-1.3.0-bin-hadoop2.4/python/pyspark/sql/context.py in applySchema(self, rdd, schema)
>     246 
>     247         for row in rows:
> --> 248             _verify_type(row, schema)
>     249 
>     250         # convert python objects to sql data
> /opt/spark/spark-1.3.0-bin-hadoop2.4/python/pyspark/sql/types.py in _verify_type(obj, dataType)
>    1064                              "length of fields (%d)" % (len(obj), len(dataType.fields)))
>    1065         for v, f in zip(obj, dataType.fields):
> -> 1066             _verify_type(v, f.dataType)
>    1067 
>    1068 _cached_cls = weakref.WeakValueDictionary()
> /opt/spark/spark-1.3.0-bin-hadoop2.4/python/pyspark/sql/types.py in _verify_type(obj, dataType)
>    1048     if type(obj) not in _acceptable_types[_type]:
>    1049         raise TypeError("%s can not accept object in type %s"
> -> 1050                         % (dataType, type(obj)))
>    1051 
>    1052     if isinstance(dataType, ArrayType):
> TypeError: StringType can not accept object in type <type 'float'>{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org