You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Bryan Cutler (JIRA)" <ji...@apache.org> on 2018/01/09 19:21:00 UTC
[jira] [Updated] (SPARK-23009) PySpark should not assume Pandas cols are a basestring type

     [ https://issues.apache.org/jira/browse/SPARK-23009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Bryan Cutler updated SPARK-23009:
---------------------------------
    Description: 
When calling {{SparkSession.createDataFrame}} using a Pandas DataFrame as input, Spark assumes that the columns will either be a {{str}} type or {{unicode}} type.  They can actually be any type that a dict can key off of.  If they are not a {{basestr}} type, then a confusing AttributeError is thrown:

{{noformat}}
In [16]: pdf = pd.DataFrame(np.random.rand(4, 2))

In [17]: pdf
Out[17]: 
          0         1
0  0.145171  0.482940
1  0.151336  0.299861
2  0.220338  0.830133
3  0.001659  0.513787

In [18]: pdf.columns
Out[18]: RangeIndex(start=0, stop=2, step=1)

In [19]: df = spark.createDataFrame(pdf)
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-18-11bcb07e0e39> in <module>()
----> 1 df = spark.createDataFrame(pdf)

/home/bryan/git/spark/python/pyspark/sql/session.pyc in createDataFrame(self, data, schema, samplingRatio, verifySchema)
    646             # If no schema supplied by user then get the names of columns only
    647             if schema is None:
--> 648                 schema = [x.encode('utf-8') if not isinstance(x, str) else x for x in data.columns]
    649 
    650             if self.conf.get("spark.sql.execution.arrow.enabled", "false").lower() == "true" \

AttributeError: 'int' object has no attribute 'encode'
{{noformat}}

  was:
When calling {{SparkSession.createDataFrame}} using a Pandas DataFrame as input, Spark assumes that the columns will either be a {{str}} type or {{unicode}} type.  They can actually be any type that a dict can key off of.  If they are not a {{basestr}} type, then a confusing AttributeError is thrown:

{{code}}
In [16]: pdf = pd.DataFrame(np.random.rand(4, 2))

In [17]: pdf
Out[17]: 
          0         1
0  0.145171  0.482940
1  0.151336  0.299861
2  0.220338  0.830133
3  0.001659  0.513787

In [18]: pdf.columns
Out[18]: RangeIndex(start=0, stop=2, step=1)

In [19]: df = spark.createDataFrame(pdf)
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-18-11bcb07e0e39> in <module>()
----> 1 df = spark.createDataFrame(pdf)

/home/bryan/git/spark/python/pyspark/sql/session.pyc in createDataFrame(self, data, schema, samplingRatio, verifySchema)
    646             # If no schema supplied by user then get the names of columns only
    647             if schema is None:
--> 648                 schema = [x.encode('utf-8') if not isinstance(x, str) else x for x in data.columns]
    649 
    650             if self.conf.get("spark.sql.execution.arrow.enabled", "false").lower() == "true" \

AttributeError: 'int' object has no attribute 'encode'
{{code}}


> PySpark should not assume Pandas cols are a basestring type
> -----------------------------------------------------------
>
>                 Key: SPARK-23009
>                 URL: https://issues.apache.org/jira/browse/SPARK-23009
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 2.3.0
>            Reporter: Bryan Cutler
>
> When calling {{SparkSession.createDataFrame}} using a Pandas DataFrame as input, Spark assumes that the columns will either be a {{str}} type or {{unicode}} type.  They can actually be any type that a dict can key off of.  If they are not a {{basestr}} type, then a confusing AttributeError is thrown:
> {{noformat}}
> In [16]: pdf = pd.DataFrame(np.random.rand(4, 2))
> In [17]: pdf
> Out[17]: 
>           0         1
> 0  0.145171  0.482940
> 1  0.151336  0.299861
> 2  0.220338  0.830133
> 3  0.001659  0.513787
> In [18]: pdf.columns
> Out[18]: RangeIndex(start=0, stop=2, step=1)
> In [19]: df = spark.createDataFrame(pdf)
> ---------------------------------------------------------------------------
> AttributeError                            Traceback (most recent call last)
> <ipython-input-18-11bcb07e0e39> in <module>()
> ----> 1 df = spark.createDataFrame(pdf)
> /home/bryan/git/spark/python/pyspark/sql/session.pyc in createDataFrame(self, data, schema, samplingRatio, verifySchema)
>     646             # If no schema supplied by user then get the names of columns only
>     647             if schema is None:
> --> 648                 schema = [x.encode('utf-8') if not isinstance(x, str) else x for x in data.columns]
>     649 
>     650             if self.conf.get("spark.sql.execution.arrow.enabled", "false").lower() == "true" \
> AttributeError: 'int' object has no attribute 'encode'
> {{noformat}}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org