You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Joseph K. Bradley (JIRA)" <ji...@apache.org> on 2015/04/11 02:48:12 UTC
[jira] [Created] (SPARK-6857) Python SQL schema inference should
support numpy types
Joseph K. Bradley created SPARK-6857:
----------------------------------------
Summary: Python SQL schema inference should support numpy types
Key: SPARK-6857
URL: https://issues.apache.org/jira/browse/SPARK-6857
Project: Spark
Issue Type: Improvement
Components: MLlib, PySpark, SQL
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
If you try to use SQL's schema inference to create a DataFrame out of a list or RDD of numpy types (such as numpy.float64), SQL will not recognize the numpy types. It would be handy if it did.
E.g.:
{code}
import numpy
from collections import namedtuple
from pyspark.sql import SQLContext
MyType = namedtuple('MyType', 'x')
myValues = map(lambda x: MyType(x), numpy.random.randint(100, size=10))
sqlContext = SQLContext(sc)
data = sqlContext.createDataFrame(myValues)
{code}
The above code fails with:
{code}
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/josephkb/spark/python/pyspark/sql/context.py", line 331, in createDataFrame
return self.inferSchema(data, samplingRatio)
File "/Users/josephkb/spark/python/pyspark/sql/context.py", line 205, in inferSchema
schema = self._inferSchema(rdd, samplingRatio)
File "/Users/josephkb/spark/python/pyspark/sql/context.py", line 160, in _inferSchema
schema = _infer_schema(first)
File "/Users/josephkb/spark/python/pyspark/sql/types.py", line 660, in _infer_schema
fields = [StructField(k, _infer_type(v), True) for k, v in items]
File "/Users/josephkb/spark/python/pyspark/sql/types.py", line 637, in _infer_type
raise ValueError("not supported type: %s" % type(obj))
ValueError: not supported type: <type 'numpy.int64'>
{code}
But if we cast to int (not numpy types) first, it's OK:
{code}
myNativeValues = map(lambda x: MyType(int(x.x)), myValues)
data = sqlContext.createDataFrame(myNativeValues) # OK
{code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org