You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@spark.apache.org by sr...@apache.org on 2022/03/23 14:00:35 UTC
[spark] branch master updated: [SPARK-18621][PYTHON] Make sql type reprs eval-able
This is an automated email from the ASF dual-hosted git repository.
srowen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git
The following commit(s) were added to refs/heads/master by this push:
new c5ebdc6 [SPARK-18621][PYTHON] Make sql type reprs eval-able
c5ebdc6 is described below
commit c5ebdc6ded0479f557541a08b1312097a9c4244f
Author: flynn <cr...@gmail.com>
AuthorDate: Wed Mar 23 08:59:30 2022 -0500
[SPARK-18621][PYTHON] Make sql type reprs eval-able
### What changes were proposed in this pull request?
These changes update the `__repr__` methods of type classes in `pyspark.sql.types` to print string representations which are `eval`-able. In other words, any instance of a `DataType` will produce a repr which can be passed to `eval()` to create an identical instance.
Similar changes previously submitted: https://github.com/apache/spark/pull/25495
### Why are the changes needed?
This [bug](https://issues.apache.org/jira/browse/SPARK-18621) has been around for a while. The current implementation returns a string representation which is valid in scala rather than python. These changes fix the repr to be valid with python.
The [motivation](https://docs.python.org/3/library/functions.html#repr) is "to return a string that would yield an object with the same value when passed to eval()".
### Does this PR introduce _any_ user-facing change?
Example:
Current implementation:
```python
from pyspark.sql.types import *
struct = StructType([StructField('f1', StringType(), True)])
repr(struct)
# StructType(List(StructField(f1,StringType,true)))
new_struct = eval(repr(struct))
# Traceback (most recent call last):
# File "<input>", line 1, in <module>
# File "<string>", line 1, in <module>
# NameError: name 'List' is not defined
struct_field = StructField('f1', StringType(), True)
repr(struct_field)
# StructField(f1,StringType,true)
new_struct_field = eval(repr(struct_field))
# Traceback (most recent call last):
# File "<input>", line 1, in <module>
# File "<string>", line 1, in <module>
# NameError: name 'f1' is not defined
```
With changes:
```python
from pyspark.sql.types import *
struct = StructType([StructField('f1', StringType(), True)])
repr(struct)
# StructType([StructField('f1', StringType(), True)])
new_struct = eval(repr(struct))
struct == new_struct
# True
struct_field = StructField('f1', StringType(), True)
repr(struct_field)
# StructField('f1', StringType(), True)
new_struct_field = eval(repr(struct_field))
struct_field == new_struct_field
# True
```
### How was this patch tested?
The changes include a test which asserts that an instance of each type is equal to the `eval` of its `repr`, as in the above example.
Closes #34320 from crflynn/sql-types-repr.
Lead-authored-by: flynn <cr...@gmail.com>
Co-authored-by: Flynn <cr...@users.noreply.github.com>
Signed-off-by: Sean Owen <sr...@gmail.com>
---
.../source/migration_guide/pyspark_3.2_to_3.3.rst | 1 +
python/pyspark/ml/functions.py | 8 +--
python/pyspark/pandas/extensions.py | 8 +--
python/pyspark/pandas/internal.py | 78 ++++++++++++----------
python/pyspark/pandas/spark/utils.py | 23 ++++---
python/pyspark/pandas/tests/test_groupby.py | 2 +-
python/pyspark/pandas/tests/test_series.py | 2 +-
python/pyspark/pandas/typedef/typehints.py | 52 +++++++--------
python/pyspark/sql/dataframe.py | 3 +-
python/pyspark/sql/tests/test_dataframe.py | 2 +-
python/pyspark/sql/tests/test_types.py | 23 +++++++
python/pyspark/sql/types.py | 32 ++++-----
12 files changed, 135 insertions(+), 99 deletions(-)
diff --git a/python/docs/source/migration_guide/pyspark_3.2_to_3.3.rst b/python/docs/source/migration_guide/pyspark_3.2_to_3.3.rst
index f2701d4..d81008d 100644
--- a/python/docs/source/migration_guide/pyspark_3.2_to_3.3.rst
+++ b/python/docs/source/migration_guide/pyspark_3.2_to_3.3.rst
@@ -23,3 +23,4 @@ Upgrading from PySpark 3.2 to 3.3
* In Spark 3.3, the ``pyspark.pandas.sql`` method follows [the standard Python string formatter](https://docs.python.org/3/library/string.html#format-string-syntax). To restore the previous behavior, set ``PYSPARK_PANDAS_SQL_LEGACY`` environment variable to ``1``.
* In Spark 3.3, the ``drop`` method of pandas API on Spark DataFrame supports dropping rows by ``index``, and sets dropping by index instead of column by default.
* In Spark 3.3, PySpark upgrades Pandas version, the new minimum required version changes from 0.23.2 to 1.0.5.
+* In Spark 3.3, the ``repr`` return values of SQL DataTypes have been changed to yield an object with the same value when passed to ``eval``.
diff --git a/python/pyspark/ml/functions.py b/python/pyspark/ml/functions.py
index aecafb3..9725d4b 100644
--- a/python/pyspark/ml/functions.py
+++ b/python/pyspark/ml/functions.py
@@ -58,11 +58,11 @@ def vector_to_array(col: Column, dtype: str = "float64") -> Column:
[Row(vec=[1.0, 2.0, 3.0], oldVec=[10.0, 20.0, 30.0]),
Row(vec=[2.0, 0.0, 3.0], oldVec=[20.0, 0.0, 30.0])]
>>> df1.schema.fields
- [StructField(vec,ArrayType(DoubleType,false),false),
- StructField(oldVec,ArrayType(DoubleType,false),false)]
+ [StructField('vec', ArrayType(DoubleType(), False), False),
+ StructField('oldVec', ArrayType(DoubleType(), False), False)]
>>> df2.schema.fields
- [StructField(vec,ArrayType(FloatType,false),false),
- StructField(oldVec,ArrayType(FloatType,false),false)]
+ [StructField('vec', ArrayType(FloatType(), False), False),
+ StructField('oldVec', ArrayType(FloatType(), False), False)]
"""
sc = SparkContext._active_spark_context
assert sc is not None and sc._jvm is not None
diff --git a/python/pyspark/pandas/extensions.py b/python/pyspark/pandas/extensions.py
index 69f7425..eeb02f0 100644
--- a/python/pyspark/pandas/extensions.py
+++ b/python/pyspark/pandas/extensions.py
@@ -109,7 +109,7 @@ def _register_accessor(
...
Traceback (most recent call last):
...
- ValueError: Cannot call DatetimeMethods on type StringType
+ ValueError: Cannot call DatetimeMethods on type StringType()
Note: This function is not meant to be used directly - instead, use register_dataframe_accessor,
register_series_accessor, or register_index_accessor.
@@ -169,7 +169,7 @@ def register_dataframe_accessor(name: str) -> Callable[[Type[T]], Type[T]]:
...
Traceback (most recent call last):
...
- ValueError: Cannot call DatetimeMethods on type StringType
+ ValueError: Cannot call DatetimeMethods on type StringType()
Examples
--------
@@ -250,7 +250,7 @@ def register_series_accessor(name: str) -> Callable[[Type[T]], Type[T]]:
...
Traceback (most recent call last):
...
- ValueError: Cannot call DatetimeMethods on type StringType
+ ValueError: Cannot call DatetimeMethods on type StringType()
Examples
--------
@@ -322,7 +322,7 @@ def register_index_accessor(name: str) -> Callable[[Type[T]], Type[T]]:
...
Traceback (most recent call last):
...
- ValueError: Cannot call DatetimeMethods on type StringType
+ ValueError: Cannot call DatetimeMethods on type StringType()
Examples
--------
diff --git a/python/pyspark/pandas/internal.py b/python/pyspark/pandas/internal.py
index ffc86ba..f79f0ad 100644
--- a/python/pyspark/pandas/internal.py
+++ b/python/pyspark/pandas/internal.py
@@ -206,7 +206,7 @@ class InternalField:
)
def __repr__(self) -> str:
- return "InternalField(dtype={dtype},struct_field={struct_field})".format(
+ return "InternalField(dtype={dtype}, struct_field={struct_field})".format(
dtype=self.dtype, struct_field=self.struct_field
)
@@ -293,13 +293,13 @@ class InternalFrame:
>>> internal.index_names
[None]
>>> internal.data_fields # doctest: +NORMALIZE_WHITESPACE
- [InternalField(dtype=int64,struct_field=StructField(A,LongType,false)),
- InternalField(dtype=int64,struct_field=StructField(B,LongType,false)),
- InternalField(dtype=int64,struct_field=StructField(C,LongType,false)),
- InternalField(dtype=int64,struct_field=StructField(D,LongType,false)),
- InternalField(dtype=int64,struct_field=StructField(E,LongType,false))]
+ [InternalField(dtype=int64, struct_field=StructField('A', LongType(), False)),
+ InternalField(dtype=int64, struct_field=StructField('B', LongType(), False)),
+ InternalField(dtype=int64, struct_field=StructField('C', LongType(), False)),
+ InternalField(dtype=int64, struct_field=StructField('D', LongType(), False)),
+ InternalField(dtype=int64, struct_field=StructField('E', LongType(), False))]
>>> internal.index_fields
- [InternalField(dtype=int64,struct_field=StructField(__index_level_0__,LongType,false))]
+ [InternalField(dtype=int64, struct_field=StructField('__index_level_0__', LongType(), False))]
>>> internal.to_internal_spark_frame.show() # doctest: +NORMALIZE_WHITESPACE
+-----------------+---+---+---+---+---+
|__index_level_0__| A| B| C| D| E|
@@ -355,13 +355,13 @@ class InternalFrame:
['A', 'B', 'C', 'D', 'E']
>>> internal.index_names
[('A',)]
- >>> internal.data_fields
- [InternalField(dtype=int64,struct_field=StructField(B,LongType,false)),
- InternalField(dtype=int64,struct_field=StructField(C,LongType,false)),
- InternalField(dtype=int64,struct_field=StructField(D,LongType,false)),
- InternalField(dtype=int64,struct_field=StructField(E,LongType,false))]
+ >>> internal.data_fields # doctest: +NORMALIZE_WHITESPACE
+ [InternalField(dtype=int64, struct_field=StructField('B', LongType(), False)),
+ InternalField(dtype=int64, struct_field=StructField('C', LongType(), False)),
+ InternalField(dtype=int64, struct_field=StructField('D', LongType(), False)),
+ InternalField(dtype=int64, struct_field=StructField('E', LongType(), False))]
>>> internal.index_fields
- [InternalField(dtype=int64,struct_field=StructField(A,LongType,false))]
+ [InternalField(dtype=int64, struct_field=StructField('A', LongType(), False))]
>>> internal.to_internal_spark_frame.show() # doctest: +NORMALIZE_WHITESPACE
+---+---+---+---+---+
| A| B| C| D| E|
@@ -419,13 +419,13 @@ class InternalFrame:
>>> internal.index_names
[None, ('A',)]
>>> internal.data_fields # doctest: +NORMALIZE_WHITESPACE
- [InternalField(dtype=int64,struct_field=StructField(B,LongType,false)),
- InternalField(dtype=int64,struct_field=StructField(C,LongType,false)),
- InternalField(dtype=int64,struct_field=StructField(D,LongType,false)),
- InternalField(dtype=int64,struct_field=StructField(E,LongType,false))]
+ [InternalField(dtype=int64, struct_field=StructField('B', LongType(), False)),
+ InternalField(dtype=int64, struct_field=StructField('C', LongType(), False)),
+ InternalField(dtype=int64, struct_field=StructField('D', LongType(), False)),
+ InternalField(dtype=int64, struct_field=StructField('E', LongType(), False))]
>>> internal.index_fields # doctest: +NORMALIZE_WHITESPACE
- [InternalField(dtype=int64,struct_field=StructField(__index_level_0__,LongType,false)),
- InternalField(dtype=int64,struct_field=StructField(A,LongType,false))]
+ [InternalField(dtype=int64, struct_field=StructField('__index_level_0__', LongType(), False)),
+ InternalField(dtype=int64, struct_field=StructField('A', LongType(), False))]
>>> internal.to_internal_spark_frame.show() # doctest: +NORMALIZE_WHITESPACE
+-----------------+---+---+---+---+---+
|__index_level_0__| A| B| C| D| E|
@@ -508,9 +508,9 @@ class InternalFrame:
>>> internal.index_names
[('A',)]
>>> internal.data_fields
- [InternalField(dtype=int64,struct_field=StructField(B,LongType,false))]
+ [InternalField(dtype=int64, struct_field=StructField('B', LongType(), False))]
>>> internal.index_fields
- [InternalField(dtype=int64,struct_field=StructField(A,LongType,false))]
+ [InternalField(dtype=int64, struct_field=StructField('A', LongType(), False))]
>>> internal.to_internal_spark_frame.show() # doctest: +NORMALIZE_WHITESPACE
+---+---+
| A| B|
@@ -596,9 +596,12 @@ class InternalFrame:
[('row_index_a',), ('row_index_b',), ('a', 'x')]
>>> internal.index_fields # doctest: +NORMALIZE_WHITESPACE
- [InternalField(dtype=object,struct_field=StructField(__index_level_0__,StringType,false)),
- InternalField(dtype=object,struct_field=StructField(__index_level_1__,StringType,false)),
- InternalField(dtype=int64,struct_field=StructField((a, x),LongType,false))]
+ [InternalField(dtype=object,
+ struct_field=StructField('__index_level_0__', StringType(), False)),
+ InternalField(dtype=object,
+ struct_field=StructField('__index_level_1__', StringType(), False)),
+ InternalField(dtype=int64,
+ struct_field=StructField('(a, x)', LongType(), False))]
>>> internal.column_labels
[('a', 'y'), ('b', 'z')]
@@ -607,8 +610,8 @@ class InternalFrame:
[Column<'(a, y)'>, Column<'(b, z)'>]
>>> internal.data_fields # doctest: +NORMALIZE_WHITESPACE
- [InternalField(dtype=int64,struct_field=StructField((a, y),LongType,false)),
- InternalField(dtype=int64,struct_field=StructField((b, z),LongType,false))]
+ [InternalField(dtype=int64, struct_field=StructField('(a, y)', LongType(), False)),
+ InternalField(dtype=int64, struct_field=StructField('(b, z)', LongType(), False))]
>>> internal.column_label_names
[('column_labels_a',), ('column_labels_b',)]
@@ -1505,13 +1508,14 @@ class InternalFrame:
2 30 c 1
>>> index_columns
['__index_level_0__']
- >>> index_fields
- [InternalField(dtype=int64,struct_field=StructField(__index_level_0__,LongType,false))]
+ >>> index_fields # doctest: +NORMALIZE_WHITESPACE
+ [InternalField(dtype=int64, struct_field=StructField('__index_level_0__',
+ LongType(), False))]
>>> data_columns
['(x, a)', '(y, b)']
>>> data_fields # doctest: +NORMALIZE_WHITESPACE
- [InternalField(dtype=object,struct_field=StructField((x, a),StringType,false)),
- InternalField(dtype=category,struct_field=StructField((y, b),ByteType,false))]
+ [InternalField(dtype=object, struct_field=StructField('(x, a)', StringType(), False)),
+ InternalField(dtype=category, struct_field=StructField('(y, b)', ByteType(), False))]
>>> import datetime
>>> pdf = pd.DataFrame({
@@ -1521,9 +1525,11 @@ class InternalFrame:
>>> _, _, _, _, data_fields = (
... InternalFrame.prepare_pandas_frame(pdf, prefer_timestamp_ntz=True)
... )
- >>> data_fields
- [InternalField(dtype=datetime64[ns],struct_field=StructField(dt,TimestampNTZType,false)),
- InternalField(dtype=object,struct_field=StructField(dt_obj,TimestampNTZType,false))]
+ >>> data_fields # doctest: +NORMALIZE_WHITESPACE
+ [InternalField(dtype=datetime64[ns],
+ struct_field=StructField('dt', TimestampNTZType(), False)),
+ InternalField(dtype=object,
+ struct_field=StructField('dt_obj', TimestampNTZType(), False))]
>>> pdf = pd.DataFrame({
... "td": [datetime.timedelta(0)], "td_obj": [datetime.timedelta(0)]
@@ -1533,8 +1539,10 @@ class InternalFrame:
... InternalFrame.prepare_pandas_frame(pdf)
... )
>>> data_fields # doctest: +NORMALIZE_WHITESPACE
- [InternalField(dtype=timedelta64[ns],struct_field=StructField(td,DayTimeIntervalType(0,3),false)),
- InternalField(dtype=object,struct_field=StructField(td_obj,DayTimeIntervalType(0,3),false))]
+ [InternalField(dtype=timedelta64[ns],
+ struct_field=StructField('td', DayTimeIntervalType(0, 3), False)),
+ InternalField(dtype=object,
+ struct_field=StructField('td_obj', DayTimeIntervalType(0, 3), False))]
"""
pdf = pdf.copy()
diff --git a/python/pyspark/pandas/spark/utils.py b/python/pyspark/pandas/spark/utils.py
index 0940a5e..9b8b5bb 100644
--- a/python/pyspark/pandas/spark/utils.py
+++ b/python/pyspark/pandas/spark/utils.py
@@ -52,7 +52,7 @@ def as_nullable_spark_type(dt: DataType) -> DataType:
>>> as_nullable_spark_type(StructType([
... StructField("A", IntegerType(), True),
... StructField("B", FloatType(), False)])) # doctest: +NORMALIZE_WHITESPACE
- StructType(List(StructField(A,IntegerType,true),StructField(B,FloatType,true)))
+ StructType([StructField('A', IntegerType(), True), StructField('B', FloatType(), True)])
>>> as_nullable_spark_type(StructType([
... StructField("A",
@@ -62,9 +62,12 @@ def as_nullable_spark_type(dt: DataType) -> DataType:
... ArrayType(IntegerType(), False), False), False),
... StructField('b', StringType(), True)])),
... StructField("B", FloatType(), False)])) # doctest: +NORMALIZE_WHITESPACE
- StructType(List(StructField(A,StructType(List(StructField(a,MapType(IntegerType,ArrayType\
-(IntegerType,true),true),true),StructField(b,StringType,true))),true),\
-StructField(B,FloatType,true)))
+ StructType([StructField('A',
+ StructType([StructField('a',
+ MapType(IntegerType(),
+ ArrayType(IntegerType(), True), True), True),
+ StructField('b', StringType(), True)]), True),
+ StructField('B', FloatType(), True)])
"""
if isinstance(dt, StructType):
new_fields = []
@@ -132,7 +135,8 @@ def force_decimal_precision_scale(
>>> force_decimal_precision_scale(StructType([
... StructField("A", DecimalType(10, 0), True),
... StructField("B", DecimalType(14, 7), False)])) # doctest: +NORMALIZE_WHITESPACE
- StructType(List(StructField(A,DecimalType(38,18),true),StructField(B,DecimalType(38,18),false)))
+ StructType([StructField('A', DecimalType(38,18), True),
+ StructField('B', DecimalType(38,18), False)])
>>> force_decimal_precision_scale(StructType([
... StructField("A",
@@ -143,9 +147,12 @@ def force_decimal_precision_scale(
... StructField('b', StringType(), True)])),
... StructField("B", DecimalType(30, 15), False)]),
... precision=30, scale=15) # doctest: +NORMALIZE_WHITESPACE
- StructType(List(StructField(A,StructType(List(StructField(a,MapType(DecimalType(30,15),\
-ArrayType(DecimalType(30,15),false),false),false),StructField(b,StringType,true))),true),\
-StructField(B,DecimalType(30,15),false)))
+ StructType([StructField('A',
+ StructType([StructField('a',
+ MapType(DecimalType(30,15),
+ ArrayType(DecimalType(30,15), False), False), False),
+ StructField('b', StringType(), True)]), True),
+ StructField('B', DecimalType(30,15), False)])
"""
if isinstance(dt, StructType):
new_fields = []
diff --git a/python/pyspark/pandas/tests/test_groupby.py b/python/pyspark/pandas/tests/test_groupby.py
index 661526b..ec17e0d 100644
--- a/python/pyspark/pandas/tests/test_groupby.py
+++ b/python/pyspark/pandas/tests/test_groupby.py
@@ -2227,7 +2227,7 @@ class GroupByTest(PandasOnSparkTestCase, TestUtils):
with self.assertRaisesRegex(
TypeError,
"Expected the return type of this function to be of Series type, "
- "but found type ScalarType\\[LongType\\]",
+ "but found type ScalarType\\[LongType\\(\\)\\]",
):
psdf.groupby("a").transform(udf)
diff --git a/python/pyspark/pandas/tests/test_series.py b/python/pyspark/pandas/tests/test_series.py
index 95be334..c4e5a9a 100644
--- a/python/pyspark/pandas/tests/test_series.py
+++ b/python/pyspark/pandas/tests/test_series.py
@@ -3006,7 +3006,7 @@ class SeriesTest(PandasOnSparkTestCase, SQLTestUtils):
with self.assertRaisesRegex(
ValueError,
r"Expected the return type of this function to be of scalar type, "
- r"but found type SeriesType\[LongType\]",
+ r"but found type SeriesType\[LongType\(\)\]",
):
psser.apply(udf)
diff --git a/python/pyspark/pandas/typedef/typehints.py b/python/pyspark/pandas/typedef/typehints.py
index 695ed31..8a32a14 100644
--- a/python/pyspark/pandas/typedef/typehints.py
+++ b/python/pyspark/pandas/typedef/typehints.py
@@ -319,17 +319,17 @@ def pandas_on_spark_type(tpe: Union[str, type, Dtype]) -> Tuple[Dtype, types.Dat
Examples
--------
>>> pandas_on_spark_type(int)
- (dtype('int64'), LongType)
+ (dtype('int64'), LongType())
>>> pandas_on_spark_type(str)
- (dtype('<U'), StringType)
+ (dtype('<U'), StringType())
>>> pandas_on_spark_type(datetime.date)
- (dtype('O'), DateType)
+ (dtype('O'), DateType())
>>> pandas_on_spark_type(datetime.datetime)
- (dtype('<M8[ns]'), TimestampType)
+ (dtype('<M8[ns]'), TimestampType())
>>> pandas_on_spark_type(datetime.timedelta)
- (dtype('<m8[ns]'), DayTimeIntervalType(0,3))
+ (dtype('<m8[ns]'), DayTimeIntervalType(0, 3))
>>> pandas_on_spark_type(List[bool])
- (dtype('O'), ArrayType(BooleanType,true))
+ (dtype('O'), ArrayType(BooleanType(), True))
"""
try:
dtype = pandas_dtype(tpe)
@@ -381,7 +381,7 @@ def infer_return_type(f: Callable) -> Union[SeriesType, DataFrameType, ScalarTyp
>>> inferred.dtype
dtype('int64')
>>> inferred.spark_type
- LongType
+ LongType()
>>> def func() -> ps.Series[int]:
... pass
@@ -389,7 +389,7 @@ def infer_return_type(f: Callable) -> Union[SeriesType, DataFrameType, ScalarTyp
>>> inferred.dtype
dtype('int64')
>>> inferred.spark_type
- LongType
+ LongType()
>>> def func() -> ps.DataFrame[np.float, str]:
... pass
@@ -397,7 +397,7 @@ def infer_return_type(f: Callable) -> Union[SeriesType, DataFrameType, ScalarTyp
>>> inferred.dtypes
[dtype('float64'), dtype('<U')]
>>> inferred.spark_type
- StructType(List(StructField(c0,DoubleType,true),StructField(c1,StringType,true)))
+ StructType([StructField('c0', DoubleType(), True), StructField('c1', StringType(), True)])
>>> def func() -> ps.DataFrame[np.float]:
... pass
@@ -405,7 +405,7 @@ def infer_return_type(f: Callable) -> Union[SeriesType, DataFrameType, ScalarTyp
>>> inferred.dtypes
[dtype('float64')]
>>> inferred.spark_type
- StructType(List(StructField(c0,DoubleType,true)))
+ StructType([StructField('c0', DoubleType(), True)])
>>> def func() -> 'int':
... pass
@@ -413,7 +413,7 @@ def infer_return_type(f: Callable) -> Union[SeriesType, DataFrameType, ScalarTyp
>>> inferred.dtype
dtype('int64')
>>> inferred.spark_type
- LongType
+ LongType()
>>> def func() -> 'ps.Series[int]':
... pass
@@ -421,7 +421,7 @@ def infer_return_type(f: Callable) -> Union[SeriesType, DataFrameType, ScalarTyp
>>> inferred.dtype
dtype('int64')
>>> inferred.spark_type
- LongType
+ LongType()
>>> def func() -> 'ps.DataFrame[np.float, str]':
... pass
@@ -429,7 +429,7 @@ def infer_return_type(f: Callable) -> Union[SeriesType, DataFrameType, ScalarTyp
>>> inferred.dtypes
[dtype('float64'), dtype('<U')]
>>> inferred.spark_type
- StructType(List(StructField(c0,DoubleType,true),StructField(c1,StringType,true)))
+ StructType([StructField('c0', DoubleType(), True), StructField('c1', StringType(), True)])
>>> def func() -> 'ps.DataFrame[np.float]':
... pass
@@ -437,7 +437,7 @@ def infer_return_type(f: Callable) -> Union[SeriesType, DataFrameType, ScalarTyp
>>> inferred.dtypes
[dtype('float64')]
>>> inferred.spark_type
- StructType(List(StructField(c0,DoubleType,true)))
+ StructType([StructField('c0', DoubleType(), True)])
>>> def func() -> ps.DataFrame['a': np.float, 'b': int]:
... pass
@@ -445,7 +445,7 @@ def infer_return_type(f: Callable) -> Union[SeriesType, DataFrameType, ScalarTyp
>>> inferred.dtypes
[dtype('float64'), dtype('int64')]
>>> inferred.spark_type
- StructType(List(StructField(a,DoubleType,true),StructField(b,LongType,true)))
+ StructType([StructField('a', DoubleType(), True), StructField('b', LongType(), True)])
>>> def func() -> "ps.DataFrame['a': np.float, 'b': int]":
... pass
@@ -453,7 +453,7 @@ def infer_return_type(f: Callable) -> Union[SeriesType, DataFrameType, ScalarTyp
>>> inferred.dtypes
[dtype('float64'), dtype('int64')]
>>> inferred.spark_type
- StructType(List(StructField(a,DoubleType,true),StructField(b,LongType,true)))
+ StructType([StructField('a', DoubleType(), True), StructField('b', LongType(), True)])
>>> pdf = pd.DataFrame({"a": [1, 2, 3], "b": [3, 4, 5]})
>>> def func() -> ps.DataFrame[pdf.dtypes]:
@@ -462,7 +462,7 @@ def infer_return_type(f: Callable) -> Union[SeriesType, DataFrameType, ScalarTyp
>>> inferred.dtypes
[dtype('int64'), dtype('int64')]
>>> inferred.spark_type
- StructType(List(StructField(c0,LongType,true),StructField(c1,LongType,true)))
+ StructType([StructField('c0', LongType(), True), StructField('c1', LongType(), True)])
>>> pdf = pd.DataFrame({"a": [1, 2, 3], "b": [3, 4, 5]})
>>> def func() -> ps.DataFrame[zip(pdf.columns, pdf.dtypes)]:
@@ -471,7 +471,7 @@ def infer_return_type(f: Callable) -> Union[SeriesType, DataFrameType, ScalarTyp
>>> inferred.dtypes
[dtype('int64'), dtype('int64')]
>>> inferred.spark_type
- StructType(List(StructField(a,LongType,true),StructField(b,LongType,true)))
+ StructType([StructField('a', LongType(), True), StructField('b', LongType(), True)])
>>> pdf = pd.DataFrame({("x", "a"): [1, 2, 3], ("y", "b"): [3, 4, 5]})
>>> def func() -> ps.DataFrame[zip(pdf.columns, pdf.dtypes)]:
@@ -480,7 +480,7 @@ def infer_return_type(f: Callable) -> Union[SeriesType, DataFrameType, ScalarTyp
>>> inferred.dtypes
[dtype('int64'), dtype('int64')]
>>> inferred.spark_type
- StructType(List(StructField((x, a),LongType,true),StructField((y, b),LongType,true)))
+ StructType([StructField('(x, a)', LongType(), True), StructField('(y, b)', LongType(), True)])
>>> pdf = pd.DataFrame({"a": [1, 2, 3], "b": pd.Categorical([3, 4, 5])})
>>> def func() -> ps.DataFrame[pdf.dtypes]:
@@ -489,7 +489,7 @@ def infer_return_type(f: Callable) -> Union[SeriesType, DataFrameType, ScalarTyp
>>> inferred.dtypes
[dtype('int64'), CategoricalDtype(categories=[3, 4, 5], ordered=False)]
>>> inferred.spark_type
- StructType(List(StructField(c0,LongType,true),StructField(c1,LongType,true)))
+ StructType([StructField('c0', LongType(), True), StructField('c1', LongType(), True)])
>>> def func() -> ps.DataFrame[zip(pdf.columns, pdf.dtypes)]:
... pass
@@ -497,7 +497,7 @@ def infer_return_type(f: Callable) -> Union[SeriesType, DataFrameType, ScalarTyp
>>> inferred.dtypes
[dtype('int64'), CategoricalDtype(categories=[3, 4, 5], ordered=False)]
>>> inferred.spark_type
- StructType(List(StructField(a,LongType,true),StructField(b,LongType,true)))
+ StructType([StructField('a', LongType(), True), StructField('b', LongType(), True)])
>>> def func() -> ps.Series[pdf.b.dtype]:
... pass
@@ -505,7 +505,7 @@ def infer_return_type(f: Callable) -> Union[SeriesType, DataFrameType, ScalarTyp
>>> inferred.dtype
CategoricalDtype(categories=[3, 4, 5], ordered=False)
>>> inferred.spark_type
- LongType
+ LongType()
>>> def func() -> ps.DataFrame[int, [int, int]]:
... pass
@@ -515,7 +515,7 @@ def infer_return_type(f: Callable) -> Union[SeriesType, DataFrameType, ScalarTyp
>>> inferred.spark_type.simpleString()
'struct<__index_level_0__:bigint,c0:bigint,c1:bigint>'
>>> inferred.index_fields
- [InternalField(dtype=int64,struct_field=StructField(__index_level_0__,LongType,true))]
+ [InternalField(dtype=int64, struct_field=StructField('__index_level_0__', LongType(), True))]
>>> def func() -> ps.DataFrame[pdf.index.dtype, pdf.dtypes]:
... pass
@@ -525,7 +525,7 @@ def infer_return_type(f: Callable) -> Union[SeriesType, DataFrameType, ScalarTyp
>>> inferred.spark_type.simpleString()
'struct<__index_level_0__:bigint,c0:bigint,c1:bigint>'
>>> inferred.index_fields
- [InternalField(dtype=int64,struct_field=StructField(__index_level_0__,LongType,true))]
+ [InternalField(dtype=int64, struct_field=StructField('__index_level_0__', LongType(), True))]
>>> def func() -> ps.DataFrame[
... ("index", CategoricalDtype(categories=[3, 4, 5], ordered=False)),
@@ -537,7 +537,7 @@ def infer_return_type(f: Callable) -> Union[SeriesType, DataFrameType, ScalarTyp
>>> inferred.spark_type.simpleString()
'struct<index:bigint,id:bigint,A:bigint>'
>>> inferred.index_fields
- [InternalField(dtype=category,struct_field=StructField(index,LongType,true))]
+ [InternalField(dtype=category, struct_field=StructField('index', LongType(), True))]
>>> def func() -> ps.DataFrame[
... (pdf.index.name, pdf.index.dtype), zip(pdf.columns, pdf.dtypes)]:
@@ -548,7 +548,7 @@ def infer_return_type(f: Callable) -> Union[SeriesType, DataFrameType, ScalarTyp
>>> inferred.spark_type.simpleString()
'struct<__index_level_0__:bigint,a:bigint,b:bigint>'
>>> inferred.index_fields
- [InternalField(dtype=int64,struct_field=StructField(__index_level_0__,LongType,true))]
+ [InternalField(dtype=int64, struct_field=StructField('__index_level_0__', LongType(), True))]
"""
# We should re-import to make sure the class 'SeriesType' is not treated as a class
# within this module locally. See Series.__class_getitem__ which imports this class
diff --git a/python/pyspark/sql/dataframe.py b/python/pyspark/sql/dataframe.py
index c5de9fb..ea30e5b 100644
--- a/python/pyspark/sql/dataframe.py
+++ b/python/pyspark/sql/dataframe.py
@@ -364,7 +364,8 @@ class DataFrame(PandasMapOpsMixin, PandasConversionMixin):
Examples
--------
>>> df.schema
- StructType(List(StructField(age,IntegerType,true),StructField(name,StringType,true)))
+ StructType([StructField('age', IntegerType(), True),
+ StructField('name', StringType(), True)])
"""
if self._schema is None:
try:
diff --git a/python/pyspark/sql/tests/test_dataframe.py b/python/pyspark/sql/tests/test_dataframe.py
index 5f5e88f..be5e1d9 100644
--- a/python/pyspark/sql/tests/test_dataframe.py
+++ b/python/pyspark/sql/tests/test_dataframe.py
@@ -627,7 +627,7 @@ class DataFrameTests(ReusedSQLTestCase):
# field types mismatch will cause exception at runtime.
self.assertRaisesRegex(
Exception,
- "FloatType can not accept",
+ "FloatType\\(\\) can not accept",
lambda: rdd.toDF("key: float, value: string").collect(),
)
diff --git a/python/pyspark/sql/tests/test_types.py b/python/pyspark/sql/tests/test_types.py
index 9ae6c3a..d9ad234 100644
--- a/python/pyspark/sql/tests/test_types.py
+++ b/python/pyspark/sql/tests/test_types.py
@@ -949,6 +949,29 @@ class TypesTests(ReusedSQLTestCase):
a = array.array(t)
self.spark.createDataFrame([Row(myarray=a)]).collect()
+ def test_repr(self):
+ instances = [
+ NullType(),
+ StringType(),
+ BinaryType(),
+ BooleanType(),
+ DateType(),
+ TimestampType(),
+ DecimalType(),
+ DoubleType(),
+ FloatType(),
+ ByteType(),
+ IntegerType(),
+ LongType(),
+ ShortType(),
+ ArrayType(StringType()),
+ MapType(StringType(), IntegerType()),
+ StructField("f1", StringType(), True),
+ StructType([StructField("f1", StringType(), True)]),
+ ]
+ for instance in instances:
+ self.assertEqual(eval(repr(instance)), instance)
+
def test_daytime_interval_type_constructor(self):
# SPARK-37277: Test constructors in day time interval.
self.assertEqual(DayTimeIntervalType().simpleString(), "interval day to second")
diff --git a/python/pyspark/sql/types.py b/python/pyspark/sql/types.py
index 41db22b..23e54eb 100644
--- a/python/pyspark/sql/types.py
+++ b/python/pyspark/sql/types.py
@@ -80,7 +80,7 @@ class DataType:
"""Base class for data types."""
def __repr__(self) -> str:
- return self.__class__.__name__
+ return self.__class__.__name__ + "()"
def __hash__(self) -> int:
return hash(str(self))
@@ -364,7 +364,7 @@ class DayTimeIntervalType(AtomicType):
jsonValue = _str_repr
def __repr__(self) -> str:
- return "%s(%d,%d)" % (type(self).__name__, self.startField, self.endField)
+ return "%s(%d, %d)" % (type(self).__name__, self.startField, self.endField)
def needConversion(self) -> bool:
return True
@@ -415,7 +415,7 @@ class ArrayType(DataType):
return "array<%s>" % self.elementType.simpleString()
def __repr__(self) -> str:
- return "ArrayType(%s,%s)" % (self.elementType, str(self.containsNull).lower())
+ return "ArrayType(%s, %s)" % (self.elementType, str(self.containsNull))
def jsonValue(self) -> Dict[str, Any]:
return {
@@ -485,11 +485,7 @@ class MapType(DataType):
return "map<%s,%s>" % (self.keyType.simpleString(), self.valueType.simpleString())
def __repr__(self) -> str:
- return "MapType(%s,%s,%s)" % (
- self.keyType,
- self.valueType,
- str(self.valueContainsNull).lower(),
- )
+ return "MapType(%s, %s, %s)" % (self.keyType, self.valueType, str(self.valueContainsNull))
def jsonValue(self) -> Dict[str, Any]:
return {
@@ -570,7 +566,7 @@ class StructField(DataType):
return "%s:%s" % (self.name, self.dataType.simpleString())
def __repr__(self) -> str:
- return "StructField(%s,%s,%s)" % (self.name, self.dataType, str(self.nullable).lower())
+ return "StructField('%s', %s, %s)" % (self.name, self.dataType, str(self.nullable))
def jsonValue(self) -> Dict[str, Any]:
return {
@@ -616,9 +612,9 @@ class StructType(DataType):
--------
>>> struct1 = StructType([StructField("f1", StringType(), True)])
>>> struct1["f1"]
- StructField(f1,StringType,true)
+ StructField('f1', StringType(), True)
>>> struct1[0]
- StructField(f1,StringType,true)
+ StructField('f1', StringType(), True)
>>> struct1 = StructType([StructField("f1", StringType(), True)])
>>> struct2 = StructType([StructField("f1", StringType(), True)])
@@ -753,7 +749,7 @@ class StructType(DataType):
return "struct<%s>" % (",".join(f.simpleString() for f in self))
def __repr__(self) -> str:
- return "StructType(List(%s))" % ",".join(str(field) for field in self)
+ return "StructType([%s])" % ", ".join(str(field) for field in self)
def jsonValue(self) -> Dict[str, Any]:
return {"type": self.typeName(), "fields": [f.jsonValue() for f in self]}
@@ -979,17 +975,17 @@ def _parse_datatype_string(s: str) -> DataType:
Examples
--------
>>> _parse_datatype_string("int ")
- IntegerType
+ IntegerType()
>>> _parse_datatype_string("INT ")
- IntegerType
+ IntegerType()
>>> _parse_datatype_string("a: byte, b: decimal( 16 , 8 ) ")
- StructType(List(StructField(a,ByteType,true),StructField(b,DecimalType(16,8),true)))
+ StructType([StructField('a', ByteType(), True), StructField('b', DecimalType(16,8), True)])
>>> _parse_datatype_string("a DOUBLE, b STRING")
- StructType(List(StructField(a,DoubleType,true),StructField(b,StringType,true)))
+ StructType([StructField('a', DoubleType(), True), StructField('b', StringType(), True)])
>>> _parse_datatype_string("a: array< short>")
- StructType(List(StructField(a,ArrayType(ShortType,true),true)))
+ StructType([StructField('a', ArrayType(ShortType(), True), True)])
>>> _parse_datatype_string(" map<string , string > ")
- MapType(StringType,StringType,true)
+ MapType(StringType(), StringType(), True)
>>> # Error cases
>>> _parse_datatype_string("blabla") # doctest: +IGNORE_EXCEPTION_DETAIL
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@spark.apache.org
For additional commands, e-mail: commits-help@spark.apache.org