You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "bo zhao (Jira)" <ji...@apache.org> on 2022/07/20 07:16:00 UTC

[jira] [Created] (SPARK-39822) Provides a good error during create Index with different dtype elements

bo zhao created SPARK-39822:
-------------------------------

             Summary: Provides a good error during create Index with different dtype elements
                 Key: SPARK-39822
                 URL: https://issues.apache.org/jira/browse/SPARK-39822
             Project: Spark
          Issue Type: Bug
          Components: PySpark
    Affects Versions: 3.2.2
            Reporter: bo zhao


PANDAS

 
{code:java}
>>> import pandas as pd >>> pd.Index([1,2,'3',4]) Index([1, 2, '3', 4], dtype='object') >>> 
 {code}
PYSPARK

 

 
{code:java}
Using Python version 3.8.13 (default, Jun 29 2022 11:50:19)
Spark context Web UI available at http://172.25.179.45:4042
Spark context available as 'sc' (master = local[*], app id = local-1658301116572).
SparkSession available as 'spark'.
>>> from pyspark import pandas as ps
WARNING:root:'PYARROW_IGNORE_TIMEZONE' environment variable was not set. It is required to set this environment variable to '1' in both driver and executor sides if you use pyarrow>=2.0.0. pandas-on-Spark will set it for you but it does not work if there is a Spark context already launched.
>>> ps.Index([1,2,'3',4])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/spark/spark/python/pyspark/pandas/indexes/base.py", line 184, in __new__
    ps.from_pandas(
  File "/home/spark/spark/python/pyspark/pandas/namespace.py", line 155, in from_pandas
    return DataFrame(pd.DataFrame(index=pobj)).index
  File "/home/spark/spark/python/pyspark/pandas/frame.py", line 463, in __init__
    internal = InternalFrame.from_pandas(pdf)
  File "/home/spark/spark/python/pyspark/pandas/internal.py", line 1469, in from_pandas
    ) = InternalFrame.prepare_pandas_frame(pdf, prefer_timestamp_ntz=prefer_timestamp_ntz)
  File "/home/spark/spark/python/pyspark/pandas/internal.py", line 1570, in prepare_pandas_frame
    spark_type = infer_pd_series_spark_type(reset_index[col], dtype, prefer_timestamp_ntz)
  File "/home/spark/spark/python/pyspark/pandas/typedef/typehints.py", line 360, in infer_pd_series_spark_type
    return from_arrow_type(pa.Array.from_pandas(pser).type, prefer_timestamp_ntz)
  File "pyarrow/array.pxi", line 1033, in pyarrow.lib.Array.from_pandas
  File "pyarrow/array.pxi", line 312, in pyarrow.lib.array
  File "pyarrow/array.pxi", line 83, in pyarrow.lib._ndarray_to_array
  File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Could not convert '3' with type str: tried to convert to int64
 {code}
I understand that pyspark pandas need the dtype to be the same, but we need a good error msg or something to tell the user how to avoid.

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org