You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Darcy Shen (Jira)" <ji...@apache.org> on 2021/07/27 08:41:00 UTC
[jira] [Commented] (SPARK-35211) Bug when creating dataframe
without schema and with Arrow disabled
[ https://issues.apache.org/jira/browse/SPARK-35211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17387883#comment-17387883 ]
Darcy Shen commented on SPARK-35211:
------------------------------------
Updated
> Bug when creating dataframe without schema and with Arrow disabled
> ------------------------------------------------------------------
>
> Key: SPARK-35211
> URL: https://issues.apache.org/jira/browse/SPARK-35211
> Project: Spark
> Issue Type: Sub-task
> Components: PySpark
> Affects Versions: 3.1.1
> Reporter: Darcy Shen
> Priority: Major
> Labels: correctness
>
> A reproducible small repo can be found here: https://github.com/darcy-shen/spark-36283
> h2. Case 1: Create PySpark Dataframe using Pandas DataFrame with Arrow disabled and without schema
> {code:python}
> spark = SparkSession.builder.getOrCreate()
> spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "false")
> pdf = pd.DataFrame({'point': pd.Series([ExamplePoint(1, 1), ExamplePoint(2, 2)])})
> df = spark.createDataFrame(pdf)
> df.show()
> {code}
> h3. Incorrect result
> {code}
> +----------+
> | point|
> +----------+
> |(0.0, 0.0)|
> |(0.0, 0.0)|
> +----------+
> {code}
> h2. Case 2: Create PySpark Dataframe using Pandas DataFrame with Arrow disabled and with unmatched schema
> {code}
> spark = SparkSession.builder.getOrCreate()
> spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "false")
> pdf = pd.DataFrame({'point': pd.Series([ExamplePoint(1, 1), ExamplePoint(2, 2)])})
> schema = StructType([StructField('point', ExamplePointUDT(), False)])
> df = spark.createDataFrame(pdf, schema)
> df.show()
> {code}
> h3. Error throwed as expected
> {code}
> Traceback (most recent call last):
> File "bug2.py", line 54, in <module>
> df = spark.createDataFrame(pdf, schema)
> File "/Users/da/.pyenv/versions/spark-36283/lib/python3.8/site-packages/pyspark/sql/session.py", line 673, in createDataFrame
> return super(SparkSession, self).createDataFrame(
> File "/Users/da/.pyenv/versions/spark-36283/lib/python3.8/site-packages/pyspark/sql/pandas/conversion.py", line 300, in createDataFrame
> return self._create_dataframe(data, schema, samplingRatio, verifySchema)
> File "/Users/da/.pyenv/versions/spark-36283/lib/python3.8/site-packages/pyspark/sql/session.py", line 700, in _create_dataframe
> rdd, schema = self._createFromLocal(map(prepare, data), schema)
> File "/Users/da/.pyenv/versions/spark-36283/lib/python3.8/site-packages/pyspark/sql/session.py", line 509, in _createFromLocal
> data = list(data)
> File "/Users/da/.pyenv/versions/spark-36283/lib/python3.8/site-packages/pyspark/sql/session.py", line 682, in prepare
> verify_func(obj)
> File "/Users/da/.pyenv/versions/spark-36283/lib/python3.8/site-packages/pyspark/sql/types.py", line 1409, in verify
> verify_value(obj)
> File "/Users/da/.pyenv/versions/spark-36283/lib/python3.8/site-packages/pyspark/sql/types.py", line 1390, in verify_struct
> verifier(v)
> File "/Users/da/.pyenv/versions/spark-36283/lib/python3.8/site-packages/pyspark/sql/types.py", line 1409, in verify
> verify_value(obj)
> File "/Users/da/.pyenv/versions/spark-36283/lib/python3.8/site-packages/pyspark/sql/types.py", line 1304, in verify_udf
> verifier(dataType.toInternal(obj))
> File "/Users/da/.pyenv/versions/spark-36283/lib/python3.8/site-packages/pyspark/sql/types.py", line 1409, in verify
> verify_value(obj)
> File "/Users/da/.pyenv/versions/spark-36283/lib/python3.8/site-packages/pyspark/sql/types.py", line 1354, in verify_array
> element_verifier(i)
> File "/Users/da/.pyenv/versions/spark-36283/lib/python3.8/site-packages/pyspark/sql/types.py", line 1409, in verify
> verify_value(obj)
> File "/Users/da/.pyenv/versions/spark-36283/lib/python3.8/site-packages/pyspark/sql/types.py", line 1403, in verify_default
> verify_acceptable_types(obj)
> File "/Users/da/.pyenv/versions/spark-36283/lib/python3.8/site-packages/pyspark/sql/types.py", line 1291, in verify_acceptable_types
> raise TypeError(new_msg("%s can not accept object %r in type %s"
> TypeError: element in array field point: DoubleType can not accept object 1 in type <class 'int'>
> {code}
> h2. Case 3: Create PySpark Dataframe using Pandas DataFrame with Arrow disabled and with matched schema
> {code}
> spark = SparkSession.builder.getOrCreate()
> spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "false")
> pdf = pd.DataFrame({'point': pd.Series([ExamplePoint(1.0, 1.0), ExamplePoint(2.0, 2.0)])})
> schema = StructType([StructField('point', ExamplePointUDT(), False)])
> df = spark.createDataFrame(pdf, schema)
> df.show()
> {code}
> h3. Correct result
> {code}
> +----------+
> | point|
> +----------+
> |(1.0, 1.0)|
> |(2.0, 2.0)|
> +----------+
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org