You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Apache Spark (Jira)" <ji...@apache.org> on 2022/07/20 07:25:00 UTC
[jira] [Assigned] (SPARK-39822) Provides a good error during create Index with different dtype elements

     [ https://issues.apache.org/jira/browse/SPARK-39822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Apache Spark reassigned SPARK-39822:
------------------------------------

    Assignee: Apache Spark

> Provides a good error during create Index with different dtype elements
> -----------------------------------------------------------------------
>
>                 Key: SPARK-39822
>                 URL: https://issues.apache.org/jira/browse/SPARK-39822
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 3.2.2
>            Reporter: bo zhao
>            Assignee: Apache Spark
>            Priority: Minor
>
> PANDAS
>  
> {code:java}
> >>> import pandas as pd >>> pd.Index([1,2,'3',4]) Index([1, 2, '3', 4], dtype='object') >>> 
>  {code}
> PYSPARK
>  
>  
> {code:java}
> Using Python version 3.8.13 (default, Jun 29 2022 11:50:19)
> Spark context Web UI available at http://172.25.179.45:4042
> Spark context available as 'sc' (master = local[*], app id = local-1658301116572).
> SparkSession available as 'spark'.
> >>> from pyspark import pandas as ps
> WARNING:root:'PYARROW_IGNORE_TIMEZONE' environment variable was not set. It is required to set this environment variable to '1' in both driver and executor sides if you use pyarrow>=2.0.0. pandas-on-Spark will set it for you but it does not work if there is a Spark context already launched.
> >>> ps.Index([1,2,'3',4])
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
>   File "/home/spark/spark/python/pyspark/pandas/indexes/base.py", line 184, in __new__
>     ps.from_pandas(
>   File "/home/spark/spark/python/pyspark/pandas/namespace.py", line 155, in from_pandas
>     return DataFrame(pd.DataFrame(index=pobj)).index
>   File "/home/spark/spark/python/pyspark/pandas/frame.py", line 463, in __init__
>     internal = InternalFrame.from_pandas(pdf)
>   File "/home/spark/spark/python/pyspark/pandas/internal.py", line 1469, in from_pandas
>     ) = InternalFrame.prepare_pandas_frame(pdf, prefer_timestamp_ntz=prefer_timestamp_ntz)
>   File "/home/spark/spark/python/pyspark/pandas/internal.py", line 1570, in prepare_pandas_frame
>     spark_type = infer_pd_series_spark_type(reset_index[col], dtype, prefer_timestamp_ntz)
>   File "/home/spark/spark/python/pyspark/pandas/typedef/typehints.py", line 360, in infer_pd_series_spark_type
>     return from_arrow_type(pa.Array.from_pandas(pser).type, prefer_timestamp_ntz)
>   File "pyarrow/array.pxi", line 1033, in pyarrow.lib.Array.from_pandas
>   File "pyarrow/array.pxi", line 312, in pyarrow.lib.array
>   File "pyarrow/array.pxi", line 83, in pyarrow.lib._ndarray_to_array
>   File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
> pyarrow.lib.ArrowInvalid: Could not convert '3' with type str: tried to convert to int64
>  {code}
> I understand that pyspark pandas need the dtype to be the same, but we need a good error msg or something to tell the user how to avoid.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org