You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "bo zhao (Jira)" <ji...@apache.org> on 2022/07/20 07:16:00 UTC
[jira] [Created] (SPARK-39822) Provides a good error during create Index with different dtype elements
bo zhao created SPARK-39822:
-------------------------------
Summary: Provides a good error during create Index with different dtype elements
Key: SPARK-39822
URL: https://issues.apache.org/jira/browse/SPARK-39822
Project: Spark
Issue Type: Bug
Components: PySpark
Affects Versions: 3.2.2
Reporter: bo zhao
PANDAS
{code:java}
>>> import pandas as pd >>> pd.Index([1,2,'3',4]) Index([1, 2, '3', 4], dtype='object') >>>
{code}
PYSPARK
{code:java}
Using Python version 3.8.13 (default, Jun 29 2022 11:50:19)
Spark context Web UI available at http://172.25.179.45:4042
Spark context available as 'sc' (master = local[*], app id = local-1658301116572).
SparkSession available as 'spark'.
>>> from pyspark import pandas as ps
WARNING:root:'PYARROW_IGNORE_TIMEZONE' environment variable was not set. It is required to set this environment variable to '1' in both driver and executor sides if you use pyarrow>=2.0.0. pandas-on-Spark will set it for you but it does not work if there is a Spark context already launched.
>>> ps.Index([1,2,'3',4])
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/spark/spark/python/pyspark/pandas/indexes/base.py", line 184, in __new__
ps.from_pandas(
File "/home/spark/spark/python/pyspark/pandas/namespace.py", line 155, in from_pandas
return DataFrame(pd.DataFrame(index=pobj)).index
File "/home/spark/spark/python/pyspark/pandas/frame.py", line 463, in __init__
internal = InternalFrame.from_pandas(pdf)
File "/home/spark/spark/python/pyspark/pandas/internal.py", line 1469, in from_pandas
) = InternalFrame.prepare_pandas_frame(pdf, prefer_timestamp_ntz=prefer_timestamp_ntz)
File "/home/spark/spark/python/pyspark/pandas/internal.py", line 1570, in prepare_pandas_frame
spark_type = infer_pd_series_spark_type(reset_index[col], dtype, prefer_timestamp_ntz)
File "/home/spark/spark/python/pyspark/pandas/typedef/typehints.py", line 360, in infer_pd_series_spark_type
return from_arrow_type(pa.Array.from_pandas(pser).type, prefer_timestamp_ntz)
File "pyarrow/array.pxi", line 1033, in pyarrow.lib.Array.from_pandas
File "pyarrow/array.pxi", line 312, in pyarrow.lib.array
File "pyarrow/array.pxi", line 83, in pyarrow.lib._ndarray_to_array
File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Could not convert '3' with type str: tried to convert to int64
{code}
I understand that pyspark pandas need the dtype to be the same, but we need a good error msg or something to tell the user how to avoid.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org