You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Apache Spark (Jira)" <ji...@apache.org> on 2022/07/20 07:25:00 UTC
[jira] [Assigned] (SPARK-39822) Provides a good error during create Index with different dtype elements
[ https://issues.apache.org/jira/browse/SPARK-39822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Apache Spark reassigned SPARK-39822:
------------------------------------
Assignee: Apache Spark
> Provides a good error during create Index with different dtype elements
> -----------------------------------------------------------------------
>
> Key: SPARK-39822
> URL: https://issues.apache.org/jira/browse/SPARK-39822
> Project: Spark
> Issue Type: Bug
> Components: PySpark
> Affects Versions: 3.2.2
> Reporter: bo zhao
> Assignee: Apache Spark
> Priority: Minor
>
> PANDAS
>
> {code:java}
> >>> import pandas as pd >>> pd.Index([1,2,'3',4]) Index([1, 2, '3', 4], dtype='object') >>>
> {code}
> PYSPARK
>
>
> {code:java}
> Using Python version 3.8.13 (default, Jun 29 2022 11:50:19)
> Spark context Web UI available at http://172.25.179.45:4042
> Spark context available as 'sc' (master = local[*], app id = local-1658301116572).
> SparkSession available as 'spark'.
> >>> from pyspark import pandas as ps
> WARNING:root:'PYARROW_IGNORE_TIMEZONE' environment variable was not set. It is required to set this environment variable to '1' in both driver and executor sides if you use pyarrow>=2.0.0. pandas-on-Spark will set it for you but it does not work if there is a Spark context already launched.
> >>> ps.Index([1,2,'3',4])
> Traceback (most recent call last):
> File "<stdin>", line 1, in <module>
> File "/home/spark/spark/python/pyspark/pandas/indexes/base.py", line 184, in __new__
> ps.from_pandas(
> File "/home/spark/spark/python/pyspark/pandas/namespace.py", line 155, in from_pandas
> return DataFrame(pd.DataFrame(index=pobj)).index
> File "/home/spark/spark/python/pyspark/pandas/frame.py", line 463, in __init__
> internal = InternalFrame.from_pandas(pdf)
> File "/home/spark/spark/python/pyspark/pandas/internal.py", line 1469, in from_pandas
> ) = InternalFrame.prepare_pandas_frame(pdf, prefer_timestamp_ntz=prefer_timestamp_ntz)
> File "/home/spark/spark/python/pyspark/pandas/internal.py", line 1570, in prepare_pandas_frame
> spark_type = infer_pd_series_spark_type(reset_index[col], dtype, prefer_timestamp_ntz)
> File "/home/spark/spark/python/pyspark/pandas/typedef/typehints.py", line 360, in infer_pd_series_spark_type
> return from_arrow_type(pa.Array.from_pandas(pser).type, prefer_timestamp_ntz)
> File "pyarrow/array.pxi", line 1033, in pyarrow.lib.Array.from_pandas
> File "pyarrow/array.pxi", line 312, in pyarrow.lib.array
> File "pyarrow/array.pxi", line 83, in pyarrow.lib._ndarray_to_array
> File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
> pyarrow.lib.ArrowInvalid: Could not convert '3' with type str: tried to convert to int64
> {code}
> I understand that pyspark pandas need the dtype to be the same, but we need a good error msg or something to tell the user how to avoid.
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org