You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Bryan Cutler (JIRA)" <ji...@apache.org> on 2018/01/09 19:21:00 UTC
[jira] [Updated] (SPARK-23009) PySpark should not assume Pandas
cols are a basestring type
[ https://issues.apache.org/jira/browse/SPARK-23009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Bryan Cutler updated SPARK-23009:
---------------------------------
Description:
When calling {{SparkSession.createDataFrame}} using a Pandas DataFrame as input, Spark assumes that the columns will either be a {{str}} type or {{unicode}} type. They can actually be any type that a dict can key off of. If they are not a {{basestr}} type, then a confusing AttributeError is thrown:
{{noformat}}
In [16]: pdf = pd.DataFrame(np.random.rand(4, 2))
In [17]: pdf
Out[17]:
0 1
0 0.145171 0.482940
1 0.151336 0.299861
2 0.220338 0.830133
3 0.001659 0.513787
In [18]: pdf.columns
Out[18]: RangeIndex(start=0, stop=2, step=1)
In [19]: df = spark.createDataFrame(pdf)
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-18-11bcb07e0e39> in <module>()
----> 1 df = spark.createDataFrame(pdf)
/home/bryan/git/spark/python/pyspark/sql/session.pyc in createDataFrame(self, data, schema, samplingRatio, verifySchema)
646 # If no schema supplied by user then get the names of columns only
647 if schema is None:
--> 648 schema = [x.encode('utf-8') if not isinstance(x, str) else x for x in data.columns]
649
650 if self.conf.get("spark.sql.execution.arrow.enabled", "false").lower() == "true" \
AttributeError: 'int' object has no attribute 'encode'
{{noformat}}
was:
When calling {{SparkSession.createDataFrame}} using a Pandas DataFrame as input, Spark assumes that the columns will either be a {{str}} type or {{unicode}} type. They can actually be any type that a dict can key off of. If they are not a {{basestr}} type, then a confusing AttributeError is thrown:
{{code}}
In [16]: pdf = pd.DataFrame(np.random.rand(4, 2))
In [17]: pdf
Out[17]:
0 1
0 0.145171 0.482940
1 0.151336 0.299861
2 0.220338 0.830133
3 0.001659 0.513787
In [18]: pdf.columns
Out[18]: RangeIndex(start=0, stop=2, step=1)
In [19]: df = spark.createDataFrame(pdf)
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-18-11bcb07e0e39> in <module>()
----> 1 df = spark.createDataFrame(pdf)
/home/bryan/git/spark/python/pyspark/sql/session.pyc in createDataFrame(self, data, schema, samplingRatio, verifySchema)
646 # If no schema supplied by user then get the names of columns only
647 if schema is None:
--> 648 schema = [x.encode('utf-8') if not isinstance(x, str) else x for x in data.columns]
649
650 if self.conf.get("spark.sql.execution.arrow.enabled", "false").lower() == "true" \
AttributeError: 'int' object has no attribute 'encode'
{{code}}
> PySpark should not assume Pandas cols are a basestring type
> -----------------------------------------------------------
>
> Key: SPARK-23009
> URL: https://issues.apache.org/jira/browse/SPARK-23009
> Project: Spark
> Issue Type: Bug
> Components: PySpark
> Affects Versions: 2.3.0
> Reporter: Bryan Cutler
>
> When calling {{SparkSession.createDataFrame}} using a Pandas DataFrame as input, Spark assumes that the columns will either be a {{str}} type or {{unicode}} type. They can actually be any type that a dict can key off of. If they are not a {{basestr}} type, then a confusing AttributeError is thrown:
> {{noformat}}
> In [16]: pdf = pd.DataFrame(np.random.rand(4, 2))
> In [17]: pdf
> Out[17]:
> 0 1
> 0 0.145171 0.482940
> 1 0.151336 0.299861
> 2 0.220338 0.830133
> 3 0.001659 0.513787
> In [18]: pdf.columns
> Out[18]: RangeIndex(start=0, stop=2, step=1)
> In [19]: df = spark.createDataFrame(pdf)
> ---------------------------------------------------------------------------
> AttributeError Traceback (most recent call last)
> <ipython-input-18-11bcb07e0e39> in <module>()
> ----> 1 df = spark.createDataFrame(pdf)
> /home/bryan/git/spark/python/pyspark/sql/session.pyc in createDataFrame(self, data, schema, samplingRatio, verifySchema)
> 646 # If no schema supplied by user then get the names of columns only
> 647 if schema is None:
> --> 648 schema = [x.encode('utf-8') if not isinstance(x, str) else x for x in data.columns]
> 649
> 650 if self.conf.get("spark.sql.execution.arrow.enabled", "false").lower() == "true" \
> AttributeError: 'int' object has no attribute 'encode'
> {{noformat}}
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org