You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by "itholic (via GitHub)" <gi...@apache.org> on 2024/03/25 13:31:30 UTC

[PR] [SPARK-47543][CONNECT][PYTHON] Inferring `dict` as `MapType` from Pandas DataFrame to allow DataFrame creation. [spark]

itholic opened a new pull request, #45699:
URL: https://github.com/apache/spark/pull/45699

   ### What changes were proposed in this pull request?
   
   This PR proposes to inferring `dict` as `MapType` from Pandas DataFrame to allow DataFrame creation.
   
   This PR also introduces new config `INFER_PANDAS_DICT_AS_MAP` to reach the goal.
   
   ### Why are the changes needed?
   
   Currently the PyArrow infers the Pandas dictionary field as `StructType` instead of `MapType`:
   
   ```python
   >>> pdf = pd.DataFrame({"str_col": ['second'], "dict_col": [{'first': 0.7, 'second': 0.3}]})
   >>> pa.Schema.from_pandas(pdf)
   str_col: string
   dict_col: struct<first: double, second: double>
     child 0, first: double
     child 1, second: double
   ```
   
   The problem is that this behavior make Spark cannot handle the schema properly:
   
   ```python
   >>> sdf = spark.createDataFrame(pdf)
   >>> sdf.withColumn("test", F.col("dict_col")[F.col("str_col")]).show()
   [INVALID_EXTRACT_FIELD_TYPE] Field name should be a non-null string literal, but it's "str_col". SQLSTATE: 42000
   ```
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   No default behavior changes, but `createDataFrame` inferring `dict` as a `MapType` from Pandas DataFrame when `INFER_PANDAS_DICT_AS_MAP` set to `True`:
   
   **Before**
   ```python
   >>> pdf = pd.DataFrame({"str_col": ['second'], "dict_col": [{'first': 0.7, 'second': 0.3}]})
   >>> sdf = spark.createDataFrame(pdf)
   >>> sdf.withColumn("test", F.col("dict_col")[F.col("str_col")]).show()
   [INVALID_EXTRACT_FIELD_TYPE] Field name should be a non-null string literal, but it's "str_col". SQLSTATE: 42000
   ```
   
   **After**
   ```python
   >>> spark.conf.set("spark.sql.execution.pandas.inferPandasDictAsMap", True)
   >>> pdf = pd.DataFrame({"str_col": ['second'], "dict_col": [{'first': 0.7, 'second': 0.3}]})
   >>> sdf = spark.createDataFrame(pdf)
   >>> sdf.withColumn("test", F.col("dict_col")[F.col("str_col")]).show(truncate=False)
   +-------+-----------------------------+----+
   |str_col|dict_col                     |test|
   +-------+-----------------------------+----+
   |second |{first -> 0.7, second -> 0.3}|0.3 |
   +-------+-----------------------------+----+
   ```
   
   
   ### How was this patch tested?
   
   Added UT.
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47543][CONNECT][PYTHON] Inferring `dict` as `MapType` from Pandas DataFrame to allow DataFrame creation. [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.

HyukjinKwon commented on code in PR #45699:
URL: https://github.com/apache/spark/pull/45699#discussion_r1538565281


##########
python/pyspark/sql/connect/session.py:
##########
@@ -418,6 +426,28 @@ def createDataFrame(
             # If no schema supplied by user then get the names of columns only
             if schema is None:
                 _cols = [str(x) if not isinstance(x, str) else x for x in data.columns]
+                infer_pandas_dict_as_map = (
+                    str(self.conf.get("spark.sql.execution.pandas.inferPandasDictAsMap")).lower()
+                    == "true"
+                )
+                if infer_pandas_dict_as_map:
+                    fields = []
+                    pa_schema = pa.Schema.from_pandas(data)
+                    spark_type: Union[MapType, DataType]
+                    for field in pa_schema:
+                        field_type = field.type
+                        if isinstance(field_type, pa.StructType):
+                            if len(field_type) == 0:
+                                raise PySparkValueError(
+                                    error_class="CANNOT_INFER_EMPTY_SCHEMA",
+                                    message_parameters={},
+                                )
+                            arrow_type = field_type.field(0).type

Review Comment:
   This gets a type from the first field. Is this the same when users explicitly specify the schema? e.g., what happen if the dictionary contains different types of values?



##########
python/pyspark/sql/connect/session.py:
##########
@@ -418,6 +426,28 @@ def createDataFrame(
             # If no schema supplied by user then get the names of columns only
             if schema is None:
                 _cols = [str(x) if not isinstance(x, str) else x for x in data.columns]
+                infer_pandas_dict_as_map = (
+                    str(self.conf.get("spark.sql.execution.pandas.inferPandasDictAsMap")).lower()
+                    == "true"
+                )
+                if infer_pandas_dict_as_map:
+                    fields = []
+                    pa_schema = pa.Schema.from_pandas(data)
+                    spark_type: Union[MapType, DataType]
+                    for field in pa_schema:
+                        field_type = field.type
+                        if isinstance(field_type, pa.StructType):
+                            if len(field_type) == 0:
+                                raise PySparkValueError(
+                                    error_class="CANNOT_INFER_EMPTY_SCHEMA",
+                                    message_parameters={},
+                                )
+                            arrow_type = field_type.field(0).type
+                            spark_type = MapType(StringType(), from_arrow_type(arrow_type))
+                        else:
+                            spark_type = from_arrow_type(field_type)

Review Comment:
   This gets a type from the first field. Is this the same when users explicitly specify the schema? e.g., what happen if the dictionary contains different types of values?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47543][CONNECT][PYTHON] Inferring `dict` as `MapType` from Pandas DataFrame to allow DataFrame creation. [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.

HyukjinKwon commented on code in PR #45699:
URL: https://github.com/apache/spark/pull/45699#discussion_r1538484642


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##########
@@ -4392,6 +4392,15 @@ object SQLConf {
     .booleanConf
     .createWithDefault(false)
 
+  val INFER_PANDAS_DICT_AS_MAP = buildConf("spark.sql.execution.pandas.inferPandasDictAsMap")

Review Comment:
   Can we make `spark.sql.pyspark.inferNestedDictAsStruct.enabled` as optional, and reuse that configuration?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47543][CONNECT][PYTHON] Inferring `dict` as `MapType` from Pandas DataFrame to allow DataFrame creation. [spark]

Posted by "itholic (via GitHub)" <gi...@apache.org>.

itholic commented on code in PR #45699:
URL: https://github.com/apache/spark/pull/45699#discussion_r1538570645


##########
python/pyspark/sql/connect/session.py:
##########
@@ -418,6 +426,28 @@ def createDataFrame(
             # If no schema supplied by user then get the names of columns only
             if schema is None:
                 _cols = [str(x) if not isinstance(x, str) else x for x in data.columns]
+                infer_pandas_dict_as_map = (
+                    str(self.conf.get("spark.sql.execution.pandas.inferPandasDictAsMap")).lower()
+                    == "true"
+                )
+                if infer_pandas_dict_as_map:
+                    fields = []
+                    pa_schema = pa.Schema.from_pandas(data)
+                    spark_type: Union[MapType, DataType]
+                    for field in pa_schema:
+                        field_type = field.type
+                        if isinstance(field_type, pa.StructType):
+                            if len(field_type) == 0:
+                                raise PySparkValueError(
+                                    error_class="CANNOT_INFER_EMPTY_SCHEMA",
+                                    message_parameters={},
+                                )
+                            arrow_type = field_type.field(0).type

Review Comment:
   if the dictionary contains different types of values, Arrow automatically tries converting the values in the same type based on first item, and raise exception if it is not possible:
   
   ```python
   >>> pdf = pd.DataFrame({"str_col": ['second'], "dict_col": [{"1": 0.7, "2": "1"}]})
   >>> sdf = spark.createDataFrame(pdf)
   Traceback (most recent call last):
     File "/.../spark/python/pyspark/sql/pandas/serializers.py", line 302, in _create_array
       return pa.Array.from_pandas(
     File "pyarrow/array.pxi", line 1116, in pyarrow.lib.Array.from_pandas
     File "pyarrow/array.pxi", line 340, in pyarrow.lib.array
     File "pyarrow/array.pxi", line 86, in pyarrow.lib._ndarray_to_array
     File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
   pyarrow.lib.ArrowInvalid: Could not convert '1' with type str: tried to convert to double
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47543][CONNECT][PYTHON] Inferring `dict` as `MapType` from Pandas DataFrame to allow DataFrame creation [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.

HyukjinKwon commented on PR #45699:
URL: https://github.com/apache/spark/pull/45699#issuecomment-2024197689

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47543][CONNECT][PYTHON] Inferring `dict` as `MapType` from Pandas DataFrame to allow DataFrame creation. [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.

HyukjinKwon commented on code in PR #45699:
URL: https://github.com/apache/spark/pull/45699#discussion_r1538548393


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##########
@@ -4392,6 +4392,15 @@ object SQLConf {
     .booleanConf
     .createWithDefault(false)
 
+  val INFER_PANDAS_DICT_AS_MAP = buildConf("spark.sql.execution.pandas.inferPandasDictAsMap")

Review Comment:
   If that's the case, we don't need to make this conf optional.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47543][CONNECT][PYTHON] Inferring `dict` as `MapType` from Pandas DataFrame to allow DataFrame creation. [spark]

Posted by "itholic (via GitHub)" <gi...@apache.org>.

itholic commented on code in PR #45699:
URL: https://github.com/apache/spark/pull/45699#discussion_r1538553270


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##########
@@ -4392,6 +4392,15 @@ object SQLConf {
     .booleanConf
     .createWithDefault(false)
 
+  val INFER_PANDAS_DICT_AS_MAP = buildConf("spark.sql.execution.pandas.inferPandasDictAsMap")

Review Comment:
   Yeah, let me change to `.createWithDefault(false)`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47543][CONNECT][PYTHON] Inferring `dict` as `MapType` from Pandas DataFrame to allow DataFrame creation. [spark]

Posted by "itholic (via GitHub)" <gi...@apache.org>.

itholic commented on code in PR #45699:
URL: https://github.com/apache/spark/pull/45699#discussion_r1538570645


##########
python/pyspark/sql/connect/session.py:
##########
@@ -418,6 +426,28 @@ def createDataFrame(
             # If no schema supplied by user then get the names of columns only
             if schema is None:
                 _cols = [str(x) if not isinstance(x, str) else x for x in data.columns]
+                infer_pandas_dict_as_map = (
+                    str(self.conf.get("spark.sql.execution.pandas.inferPandasDictAsMap")).lower()
+                    == "true"
+                )
+                if infer_pandas_dict_as_map:
+                    fields = []
+                    pa_schema = pa.Schema.from_pandas(data)
+                    spark_type: Union[MapType, DataType]
+                    for field in pa_schema:
+                        field_type = field.type
+                        if isinstance(field_type, pa.StructType):
+                            if len(field_type) == 0:
+                                raise PySparkValueError(
+                                    error_class="CANNOT_INFER_EMPTY_SCHEMA",
+                                    message_parameters={},
+                                )
+                            arrow_type = field_type.field(0).type

Review Comment:
   if the dictionary contains different types of values, Arrow automatically tries converting the values in the same type and raise exception if it is not possible:
   
   ```python
   >>> pdf = pd.DataFrame({"str_col": ['second'], "dict_col": [{"1": 0.7, "2": "1"}]})
   >>> sdf = spark.createDataFrame(pdf)
   Traceback (most recent call last):
     File "/Users/haejoon.lee/Desktop/git_repos/spark/python/pyspark/sql/pandas/serializers.py", line 302, in _create_array
       return pa.Array.from_pandas(
     File "pyarrow/array.pxi", line 1116, in pyarrow.lib.Array.from_pandas
     File "pyarrow/array.pxi", line 340, in pyarrow.lib.array
     File "pyarrow/array.pxi", line 86, in pyarrow.lib._ndarray_to_array
     File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
   pyarrow.lib.ArrowInvalid: Could not convert '1' with type str: tried to convert to double
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47543][CONNECT][PYTHON] Inferring `dict` as `MapType` from Pandas DataFrame to allow DataFrame creation. [spark]

Posted by "itholic (via GitHub)" <gi...@apache.org>.

itholic commented on code in PR #45699:
URL: https://github.com/apache/spark/pull/45699#discussion_r1538529826


##########
python/pyspark/sql/connect/session.py:
##########
@@ -418,6 +426,34 @@ def createDataFrame(
             # If no schema supplied by user then get the names of columns only
             if schema is None:
                 _cols = [str(x) if not isinstance(x, str) else x for x in data.columns]
+                session = SparkSession.active()
+                infer_pandas_dict_as_map = (
+                    str(session.conf.get("spark.sql.execution.pandas.inferPandasDictAsMap")).lower()
+                    == "true"
+                )
+                if infer_pandas_dict_as_map:
+                    if data.empty:
+                        raise PySparkValueError(
+                            error_class="CANNOT_INFER_EMPTY_SCHEMA",
+                            message_parameters={},
+                        )
+                    fields = []
+                    pa_schema = pa.Schema.from_pandas(data)
+                    spark_type: Union[MapType, DataType]
+                    for field in pa_schema:
+                        field_type = field.type
+                        if isinstance(field_type, pa.StructType):
+                            pandas_dict_col = data[field.name]

Review Comment:
   Sounds good! Just updated.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47543][CONNECT][PYTHON] Inferring `dict` as `MapType` from Pandas DataFrame to allow DataFrame creation. [spark]

Posted by "itholic (via GitHub)" <gi...@apache.org>.

itholic commented on code in PR #45699:
URL: https://github.com/apache/spark/pull/45699#discussion_r1538609882


##########
python/pyspark/sql/connect/session.py:
##########
@@ -418,6 +426,28 @@ def createDataFrame(
             # If no schema supplied by user then get the names of columns only
             if schema is None:
                 _cols = [str(x) if not isinstance(x, str) else x for x in data.columns]
+                infer_pandas_dict_as_map = (
+                    str(self.conf.get("spark.sql.execution.pandas.inferPandasDictAsMap")).lower()
+                    == "true"
+                )
+                if infer_pandas_dict_as_map:

Review Comment:
   Updated!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47543][CONNECT][PYTHON] Inferring `dict` as `MapType` from Pandas DataFrame to allow DataFrame creation. [spark]

Posted by "itholic (via GitHub)" <gi...@apache.org>.

itholic commented on code in PR #45699:
URL: https://github.com/apache/spark/pull/45699#discussion_r1538562775


##########
python/pyspark/sql/tests/connect/test_connect_creation.py:
##########
@@ -554,6 +554,31 @@ def test_create_dataframe_from_pandas_with_ns_timestamp(self):
                 self.spark.createDataFrame(pdf).collect(),
             )
 
+    def test_schema_inference_from_pandas_with_dict(self):
+        from pyspark.sql.connect import functions as CF
+        from pyspark.sql import functions as F
+
+        pdf = pd.DataFrame({"str_col": ["second"], "dict_col": [{"first": 0.7, "second": 0.3}]})
+
+        self.connect.conf.set("spark.sql.execution.pandas.inferPandasDictAsMap", True)

Review Comment:
   I tried testing `with self.sql_conf` but it seems not working for some reason. Maybe we should define a config for `connector/connect/server/src/main/scala/org/apache/spark/sql/connect/config/Connect.scala`??
   
   I fixed test to manually reset the conf for now.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47543][CONNECT][PYTHON] Inferring `dict` as `MapType` from Pandas DataFrame to allow DataFrame creation. [spark]

Posted by "itholic (via GitHub)" <gi...@apache.org>.

itholic commented on code in PR #45699:
URL: https://github.com/apache/spark/pull/45699#discussion_r1538564512


##########
python/pyspark/sql/connect/session.py:
##########
@@ -418,6 +426,28 @@ def createDataFrame(
             # If no schema supplied by user then get the names of columns only
             if schema is None:
                 _cols = [str(x) if not isinstance(x, str) else x for x in data.columns]
+                infer_pandas_dict_as_map = (
+                    str(self.conf.get("spark.sql.execution.pandas.inferPandasDictAsMap")).lower()
+                    == "true"
+                )
+                if infer_pandas_dict_as_map:

Review Comment:
   Oh, I think we should make it available for vanilla PySpark as well. Let me add some more UTs for vanilla PySpark to make sure.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47543][CONNECT][PYTHON] Inferring `dict` as `MapType` from Pandas DataFrame to allow DataFrame creation. [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.

HyukjinKwon commented on code in PR #45699:
URL: https://github.com/apache/spark/pull/45699#discussion_r1538486047


##########
python/pyspark/sql/connect/session.py:
##########
@@ -418,6 +426,34 @@ def createDataFrame(
             # If no schema supplied by user then get the names of columns only
             if schema is None:
                 _cols = [str(x) if not isinstance(x, str) else x for x in data.columns]
+                session = SparkSession.active()

Review Comment:
   use `self`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47543][CONNECT][PYTHON] Inferring `dict` as `MapType` from Pandas DataFrame to allow DataFrame creation. [spark]

Posted by "itholic (via GitHub)" <gi...@apache.org>.

itholic commented on code in PR #45699:
URL: https://github.com/apache/spark/pull/45699#discussion_r1538570645


##########
python/pyspark/sql/connect/session.py:
##########
@@ -418,6 +426,28 @@ def createDataFrame(
             # If no schema supplied by user then get the names of columns only
             if schema is None:
                 _cols = [str(x) if not isinstance(x, str) else x for x in data.columns]
+                infer_pandas_dict_as_map = (
+                    str(self.conf.get("spark.sql.execution.pandas.inferPandasDictAsMap")).lower()
+                    == "true"
+                )
+                if infer_pandas_dict_as_map:
+                    fields = []
+                    pa_schema = pa.Schema.from_pandas(data)
+                    spark_type: Union[MapType, DataType]
+                    for field in pa_schema:
+                        field_type = field.type
+                        if isinstance(field_type, pa.StructType):
+                            if len(field_type) == 0:
+                                raise PySparkValueError(
+                                    error_class="CANNOT_INFER_EMPTY_SCHEMA",
+                                    message_parameters={},
+                                )
+                            arrow_type = field_type.field(0).type

Review Comment:
   if the dictionary contains different types of values, Arrow automatically tries converting the values in the same type based on first item, and raise exception if it is not possible:
   
   ```python
   >>> pdf = pd.DataFrame({"str_col": ['second'], "dict_col": [{"1": 0.7, "2": "1"}]})
   >>> sdf = spark.createDataFrame(pdf)
   Traceback (most recent call last):
     File "/Users/haejoon.lee/Desktop/git_repos/spark/python/pyspark/sql/pandas/serializers.py", line 302, in _create_array
       return pa.Array.from_pandas(
     File "pyarrow/array.pxi", line 1116, in pyarrow.lib.Array.from_pandas
     File "pyarrow/array.pxi", line 340, in pyarrow.lib.array
     File "pyarrow/array.pxi", line 86, in pyarrow.lib._ndarray_to_array
     File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
   pyarrow.lib.ArrowInvalid: Could not convert '1' with type str: tried to convert to double
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47543][CONNECT][PYTHON] Inferring `dict` as `MapType` from Pandas DataFrame to allow DataFrame creation [spark]

Posted by "itholic (via GitHub)" <gi...@apache.org>.

itholic commented on code in PR #45699:
URL: https://github.com/apache/spark/pull/45699#discussion_r1538788487


##########
python/pyspark/sql/pandas/conversion.py:
##########
@@ -600,15 +608,39 @@ def _create_from_pandas_with_arrow(
         )
         import pyarrow as pa
 
+        infer_pandas_dict_as_map = (
+            str(self.conf.get("spark.sql.execution.pandas.inferPandasDictAsMap")).lower() == "true"

Review Comment:
   Sure, let me address it



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47543][CONNECT][PYTHON] Inferring `dict` as `MapType` from Pandas DataFrame to allow DataFrame creation [spark]

Posted by "itholic (via GitHub)" <gi...@apache.org>.

itholic commented on code in PR #45699:
URL: https://github.com/apache/spark/pull/45699#discussion_r1539571833


##########
python/pyspark/sql/connect/session.py:
##########
@@ -418,6 +425,28 @@ def createDataFrame(
             # If no schema supplied by user then get the names of columns only
             if schema is None:
                 _cols = [str(x) if not isinstance(x, str) else x for x in data.columns]
+                infer_pandas_dict_as_map = (
+                    str(self.conf.get("spark.sql.execution.pandas.inferPandasDictAsMap")).lower()
+                    == "true"
+                )
+                if infer_pandas_dict_as_map:
+                    struct = StructType()
+                    pa_schema = pa.Schema.from_pandas(data)
+                    spark_type: Union[MapType, DataType]
+                    for field in pa_schema:
+                        field_type = field.type
+                        if isinstance(field_type, pa.StructType):

Review Comment:
   What does nested type cases mean you are concerned about? Maybe do you mean a `dict` within a `dict`, or a `list` within a `dict` for example? If so, the result depends entirely on Arrow's inferring logic, so it is difficult to give an exact answer for all cases. If you have a specific case that you're concerned about, let me add some testing to make sure it works.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47543][CONNECT][PYTHON] Inferring `dict` as `MapType` from Pandas DataFrame to allow DataFrame creation [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.

HyukjinKwon commented on code in PR #45699:
URL: https://github.com/apache/spark/pull/45699#discussion_r1538714280


##########
python/pyspark/sql/pandas/conversion.py:
##########
@@ -600,15 +608,39 @@ def _create_from_pandas_with_arrow(
         )
         import pyarrow as pa
 
+        infer_pandas_dict_as_map = (
+            str(self.conf.get("spark.sql.execution.pandas.inferPandasDictAsMap")).lower() == "true"

Review Comment:
   I think you should add a test in non-connect test cases, and let Spark Connect reuse the test case if possible.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47543][CONNECT][PYTHON] Inferring `dict` as `MapType` from Pandas DataFrame to allow DataFrame creation [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.

HyukjinKwon commented on code in PR #45699:
URL: https://github.com/apache/spark/pull/45699#discussion_r1538781868


##########
python/pyspark/sql/pandas/conversion.py:
##########
@@ -600,15 +608,39 @@ def _create_from_pandas_with_arrow(
         )
         import pyarrow as pa
 
+        infer_pandas_dict_as_map = (
+            str(self.conf.get("spark.sql.execution.pandas.inferPandasDictAsMap")).lower() == "true"

Review Comment:
   yeah, I think it's better to make the test reused.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47543][CONNECT][PYTHON] Inferring `dict` as `MapType` from Pandas DataFrame to allow DataFrame creation. [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.

HyukjinKwon commented on code in PR #45699:
URL: https://github.com/apache/spark/pull/45699#discussion_r1538546646


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##########
@@ -4392,6 +4392,15 @@ object SQLConf {
     .booleanConf
     .createWithDefault(false)
 
+  val INFER_PANDAS_DICT_AS_MAP = buildConf("spark.sql.execution.pandas.inferPandasDictAsMap")

Review Comment:
   ah okie.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47543][CONNECT][PYTHON] Inferring `dict` as `MapType` from Pandas DataFrame to allow DataFrame creation. [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.

HyukjinKwon commented on code in PR #45699:
URL: https://github.com/apache/spark/pull/45699#discussion_r1538564875


##########
python/pyspark/sql/connect/session.py:
##########
@@ -418,6 +426,28 @@ def createDataFrame(
             # If no schema supplied by user then get the names of columns only
             if schema is None:
                 _cols = [str(x) if not isinstance(x, str) else x for x in data.columns]
+                infer_pandas_dict_as_map = (
+                    str(self.conf.get("spark.sql.execution.pandas.inferPandasDictAsMap")).lower()
+                    == "true"
+                )
+                if infer_pandas_dict_as_map:
+                    fields = []
+                    pa_schema = pa.Schema.from_pandas(data)
+                    spark_type: Union[MapType, DataType]
+                    for field in pa_schema:
+                        field_type = field.type
+                        if isinstance(field_type, pa.StructType):
+                            if len(field_type) == 0:
+                                raise PySparkValueError(
+                                    error_class="CANNOT_INFER_EMPTY_SCHEMA",
+                                    message_parameters={},
+                                )
+                            arrow_type = field_type.field(0).type
+                            spark_type = MapType(StringType(), from_arrow_type(arrow_type))
+                        else:
+                            spark_type = from_arrow_type(field_type)

Review Comment:
   This gets a type from the first field. Is this the same when users explicitly specify the schema? e.g., what happen if the dictionary contains different types of values?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47543][CONNECT][PYTHON] Inferring `dict` as `MapType` from Pandas DataFrame to allow DataFrame creation. [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.

HyukjinKwon commented on code in PR #45699:
URL: https://github.com/apache/spark/pull/45699#discussion_r1538565726


##########
python/pyspark/sql/connect/session.py:
##########
@@ -418,6 +426,28 @@ def createDataFrame(
             # If no schema supplied by user then get the names of columns only
             if schema is None:
                 _cols = [str(x) if not isinstance(x, str) else x for x in data.columns]
+                infer_pandas_dict_as_map = (
+                    str(self.conf.get("spark.sql.execution.pandas.inferPandasDictAsMap")).lower()
+                    == "true"
+                )
+                if infer_pandas_dict_as_map:
+                    fields = []
+                    pa_schema = pa.Schema.from_pandas(data)
+                    spark_type: Union[MapType, DataType]
+                    for field in pa_schema:
+                        field_type = field.type
+                        if isinstance(field_type, pa.StructType):
+                            if len(field_type) == 0:
+                                raise PySparkValueError(
+                                    error_class="CANNOT_INFER_EMPTY_SCHEMA",
+                                    message_parameters={},
+                                )
+                            arrow_type = field_type.field(0).type

Review Comment:
   Let's at least add a test so the result is the same for both when the type is specified, and when it's not specified.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47543][CONNECT][PYTHON] Inferring `dict` as `MapType` from Pandas DataFrame to allow DataFrame creation [spark]

Posted by "itholic (via GitHub)" <gi...@apache.org>.

itholic commented on code in PR #45699:
URL: https://github.com/apache/spark/pull/45699#discussion_r1538764195


##########
python/pyspark/sql/pandas/conversion.py:
##########
@@ -600,15 +608,39 @@ def _create_from_pandas_with_arrow(
         )
         import pyarrow as pa
 
+        infer_pandas_dict_as_map = (
+            str(self.conf.get("spark.sql.execution.pandas.inferPandasDictAsMap")).lower() == "true"

Review Comment:
   Ah, I added tests into connect test suite to directly compare non-connect and connect behavior like:
   
   ```python
   self.assertEqual(
       sdf.withColumn("test", SF.col("dict_col")[SF.col("str_col")]).collect(),
       cdf.withColumn("test", CF.col("dict_col")[CF.col("str_col")]).collect(),
   )
   ```
   
   Would it be better to move those tests into non-connect and reuse it?



##########
python/pyspark/sql/pandas/conversion.py:
##########
@@ -600,15 +608,39 @@ def _create_from_pandas_with_arrow(
         )
         import pyarrow as pa
 
+        infer_pandas_dict_as_map = (
+            str(self.conf.get("spark.sql.execution.pandas.inferPandasDictAsMap")).lower() == "true"

Review Comment:
   Ah, I added tests into connect test suite because I wanted to directly compare non-connect and connect behavior like:
   
   ```python
   self.assertEqual(
       sdf.withColumn("test", SF.col("dict_col")[SF.col("str_col")]).collect(),
       cdf.withColumn("test", CF.col("dict_col")[CF.col("str_col")]).collect(),
   )
   ```
   
   Would it be better to move those tests into non-connect and reuse it?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47543][CONNECT][PYTHON] Inferring `dict` as `MapType` from Pandas DataFrame to allow DataFrame creation [spark]

Posted by "ueshin (via GitHub)" <gi...@apache.org>.

ueshin commented on code in PR #45699:
URL: https://github.com/apache/spark/pull/45699#discussion_r1552202465


##########
python/pyspark/sql/connect/session.py:
##########
@@ -418,6 +425,28 @@ def createDataFrame(
             # If no schema supplied by user then get the names of columns only
             if schema is None:
                 _cols = [str(x) if not isinstance(x, str) else x for x in data.columns]
+                infer_pandas_dict_as_map = (
+                    str(self.conf.get("spark.sql.execution.pandas.inferPandasDictAsMap")).lower()
+                    == "true"
+                )
+                if infer_pandas_dict_as_map:
+                    struct = StructType()
+                    pa_schema = pa.Schema.from_pandas(data)
+                    spark_type: Union[MapType, DataType]
+                    for field in pa_schema:
+                        field_type = field.type
+                        if isinstance(field_type, pa.StructType):
+                            if len(field_type) == 0:
+                                raise PySparkValueError(
+                                    error_class="CANNOT_INFER_EMPTY_SCHEMA",
+                                    message_parameters={},
+                                )
+                            arrow_type = field_type.field(0).type

Review Comment:
   The API `StructType.field` seems to be available since pyarrow `10.0`.
   https://arrow.apache.org/docs/10.0/python/generated/pyarrow.StructType.html
   
   We may want to bump up the minimum pyarrow version, or we should avoid this API?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47543][CONNECT][PYTHON] Inferring `dict` as `MapType` from Pandas DataFrame to allow DataFrame creation. [spark]

Posted by "itholic (via GitHub)" <gi...@apache.org>.

itholic commented on PR #45699:
URL: https://github.com/apache/spark/pull/45699#issuecomment-2018019966

   cc @HyukjinKwon @zhengruifeng @ueshin 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47543][CONNECT][PYTHON] Inferring `dict` as `MapType` from Pandas DataFrame to allow DataFrame creation. [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.

HyukjinKwon commented on code in PR #45699:
URL: https://github.com/apache/spark/pull/45699#discussion_r1538485433


##########
python/pyspark/sql/connect/session.py:
##########
@@ -418,6 +426,34 @@ def createDataFrame(
             # If no schema supplied by user then get the names of columns only
             if schema is None:
                 _cols = [str(x) if not isinstance(x, str) else x for x in data.columns]
+                session = SparkSession.active()
+                infer_pandas_dict_as_map = (
+                    str(session.conf.get("spark.sql.execution.pandas.inferPandasDictAsMap")).lower()
+                    == "true"
+                )
+                if infer_pandas_dict_as_map:
+                    if data.empty:
+                        raise PySparkValueError(
+                            error_class="CANNOT_INFER_EMPTY_SCHEMA",
+                            message_parameters={},
+                        )
+                    fields = []
+                    pa_schema = pa.Schema.from_pandas(data)
+                    spark_type: Union[MapType, DataType]
+                    for field in pa_schema:
+                        field_type = field.type
+                        if isinstance(field_type, pa.StructType):
+                            pandas_dict_col = data[field.name]

Review Comment:
   Instead of looking up the data, can you get the PyArrow type from `pa.StructType.field`? See also https://arrow.apache.org/docs/python/generated/pyarrow.StructType.html



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47543][CONNECT][PYTHON] Inferring `dict` as `MapType` from Pandas DataFrame to allow DataFrame creation. [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.

HyukjinKwon commented on code in PR #45699:
URL: https://github.com/apache/spark/pull/45699#discussion_r1538546871


##########
python/pyspark/sql/connect/session.py:
##########
@@ -418,6 +426,28 @@ def createDataFrame(
             # If no schema supplied by user then get the names of columns only
             if schema is None:
                 _cols = [str(x) if not isinstance(x, str) else x for x in data.columns]
+                infer_pandas_dict_as_map = (
+                    str(self.conf.get("spark.sql.execution.pandas.inferPandasDictAsMap")).lower()
+                    == "true"
+                )
+                if infer_pandas_dict_as_map:
+                    fields = []
+                    pa_schema = pa.Schema.from_pandas(data)
+                    spark_type: Union[MapType, DataType]
+                    for field in pa_schema:
+                        field_type = field.type
+                        if isinstance(field_type, pa.StructType):
+                            if len(field_type) == 0:

Review Comment:
   Did it work before for empty struct?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47543][CONNECT][PYTHON] Inferring `dict` as `MapType` from Pandas DataFrame to allow DataFrame creation. [spark]

Posted by "zhengruifeng (via GitHub)" <gi...@apache.org>.

zhengruifeng commented on code in PR #45699:
URL: https://github.com/apache/spark/pull/45699#discussion_r1538558146


##########
python/pyspark/sql/connect/session.py:
##########
@@ -418,6 +426,28 @@ def createDataFrame(
             # If no schema supplied by user then get the names of columns only
             if schema is None:
                 _cols = [str(x) if not isinstance(x, str) else x for x in data.columns]
+                infer_pandas_dict_as_map = (
+                    str(self.conf.get("spark.sql.execution.pandas.inferPandasDictAsMap")).lower()
+                    == "true"
+                )
+                if infer_pandas_dict_as_map:

Review Comment:
   will this code path be reused in vanilla pyspark (arrow optimization enabled)?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47543][CONNECT][PYTHON] Inferring `dict` as `MapType` from Pandas DataFrame to allow DataFrame creation [spark]

Posted by "itholic (via GitHub)" <gi...@apache.org>.

itholic commented on code in PR #45699:
URL: https://github.com/apache/spark/pull/45699#discussion_r1540514018


##########
python/pyspark/sql/pandas/conversion.py:
##########
@@ -600,15 +608,39 @@ def _create_from_pandas_with_arrow(
         )
         import pyarrow as pa
 
+        infer_pandas_dict_as_map = (
+            str(self.conf.get("spark.sql.execution.pandas.inferPandasDictAsMap")).lower() == "true"

Review Comment:
   Test moved!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47543][CONNECT][PYTHON] Inferring `dict` as `MapType` from Pandas DataFrame to allow DataFrame creation [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.

HyukjinKwon closed pull request #45699: [SPARK-47543][CONNECT][PYTHON] Inferring `dict` as `MapType` from Pandas DataFrame to allow DataFrame creation
URL: https://github.com/apache/spark/pull/45699


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47543][CONNECT][PYTHON] Inferring `dict` as `MapType` from Pandas DataFrame to allow DataFrame creation. [spark]

Posted by "itholic (via GitHub)" <gi...@apache.org>.

itholic commented on code in PR #45699:
URL: https://github.com/apache/spark/pull/45699#discussion_r1538553689


##########
python/pyspark/sql/connect/session.py:
##########
@@ -418,6 +426,28 @@ def createDataFrame(
             # If no schema supplied by user then get the names of columns only
             if schema is None:
                 _cols = [str(x) if not isinstance(x, str) else x for x in data.columns]
+                infer_pandas_dict_as_map = (
+                    str(self.conf.get("spark.sql.execution.pandas.inferPandasDictAsMap")).lower()
+                    == "true"
+                )
+                if infer_pandas_dict_as_map:
+                    fields = []
+                    pa_schema = pa.Schema.from_pandas(data)
+                    spark_type: Union[MapType, DataType]
+                    for field in pa_schema:
+                        field_type = field.type
+                        if isinstance(field_type, pa.StructType):
+                            if len(field_type) == 0:

Review Comment:
   No, this is also failed before.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47543][CONNECT][PYTHON] Inferring `dict` as `MapType` from Pandas DataFrame to allow DataFrame creation. [spark]

Posted by "itholic (via GitHub)" <gi...@apache.org>.

itholic commented on code in PR #45699:
URL: https://github.com/apache/spark/pull/45699#discussion_r1538595771


##########
python/pyspark/errors/error_classes.py:
##########
@@ -772,6 +772,11 @@
       "No active Spark session found. Please create a new Spark session before running the code."
     ]
   },
+  "NO_MATCHING_SPARK_TYPE_FOR_PYTHON_TYPE": {

Review Comment:
   Yeah, should remove in next commit



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47543][CONNECT][PYTHON] Inferring `dict` as `MapType` from Pandas DataFrame to allow DataFrame creation [spark]

Posted by "itholic (via GitHub)" <gi...@apache.org>.

itholic commented on code in PR #45699:
URL: https://github.com/apache/spark/pull/45699#discussion_r1538758678


##########
python/pyspark/sql/connect/session.py:
##########
@@ -418,6 +426,28 @@ def createDataFrame(
             # If no schema supplied by user then get the names of columns only
             if schema is None:
                 _cols = [str(x) if not isinstance(x, str) else x for x in data.columns]
+                infer_pandas_dict_as_map = (
+                    str(self.conf.get("spark.sql.execution.pandas.inferPandasDictAsMap")).lower()
+                    == "true"
+                )
+                if infer_pandas_dict_as_map:
+                    fields = []
+                    pa_schema = pa.Schema.from_pandas(data)
+                    spark_type: Union[MapType, DataType]
+                    for field in pa_schema:
+                        field_type = field.type
+                        if isinstance(field_type, pa.StructType):
+                            if len(field_type) == 0:
+                                raise PySparkValueError(
+                                    error_class="CANNOT_INFER_EMPTY_SCHEMA",
+                                    message_parameters={},
+                                )
+                            arrow_type = field_type.field(0).type

Review Comment:
   Tests added



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47543][CONNECT][PYTHON] Inferring `dict` as `MapType` from Pandas DataFrame to allow DataFrame creation. [spark]

Posted by "itholic (via GitHub)" <gi...@apache.org>.

itholic commented on code in PR #45699:
URL: https://github.com/apache/spark/pull/45699#discussion_r1538534299


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##########
@@ -4392,6 +4392,15 @@ object SQLConf {
     .booleanConf
     .createWithDefault(false)
 
+  val INFER_PANDAS_DICT_AS_MAP = buildConf("spark.sql.execution.pandas.inferPandasDictAsMap")

Review Comment:
   I'm not sure if we can reuse `spark.sql.pyspark.inferNestedDictAsStruct.enabled` for this case?
   
   Because IIRC `spark.sql.pyspark.inferNestedDictAsStruct.enabled` was defined for inferring "nested" `dict` as `StructType` whereas current case is inferring `dict` (non-nested) as `MapType`, so the purpose of two options sounds a bit different for me.
   
   For example, if someone wants to convert a nested `dict` to a `StructType` (should set an option `True`) but at the same time doesn't want to convert the `dict` to a `MapType` (should set an option `False` or `None`), shouldn't there be two separate options?
   
   Please correct me if I missed something!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [WIP][SPARK-47543][CONNECT][PYTHON] Inferring `dict` as `MapType` from Pandas DataFrame to allow DataFrame creation. [spark]

Posted by "itholic (via GitHub)" <gi...@apache.org>.

itholic commented on PR #45699:
URL: https://github.com/apache/spark/pull/45699#issuecomment-2018024404

   Let me mark it as a draft until briefly get reviewed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47543][CONNECT][PYTHON] Inferring `dict` as `MapType` from Pandas DataFrame to allow DataFrame creation [spark]

Posted by "itholic (via GitHub)" <gi...@apache.org>.

itholic commented on code in PR #45699:
URL: https://github.com/apache/spark/pull/45699#discussion_r1539583165


##########
python/pyspark/sql/connect/session.py:
##########
@@ -418,6 +425,28 @@ def createDataFrame(
             # If no schema supplied by user then get the names of columns only
             if schema is None:
                 _cols = [str(x) if not isinstance(x, str) else x for x in data.columns]
+                infer_pandas_dict_as_map = (
+                    str(self.conf.get("spark.sql.execution.pandas.inferPandasDictAsMap")).lower()
+                    == "true"
+                )
+                if infer_pandas_dict_as_map:
+                    struct = StructType()
+                    pa_schema = pa.Schema.from_pandas(data)
+                    spark_type: Union[MapType, DataType]
+                    for field in pa_schema:
+                        field_type = field.type
+                        if isinstance(field_type, pa.StructType):

Review Comment:
   At least it will provide the same result for connect and non-connect for sure since all subsequent actions are delegated to Arrow.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47543][CONNECT][PYTHON] Inferring `dict` as `MapType` from Pandas DataFrame to allow DataFrame creation [spark]

Posted by "itholic (via GitHub)" <gi...@apache.org>.

itholic commented on code in PR #45699:
URL: https://github.com/apache/spark/pull/45699#discussion_r1539583165


##########
python/pyspark/sql/connect/session.py:
##########
@@ -418,6 +425,28 @@ def createDataFrame(
             # If no schema supplied by user then get the names of columns only
             if schema is None:
                 _cols = [str(x) if not isinstance(x, str) else x for x in data.columns]
+                infer_pandas_dict_as_map = (
+                    str(self.conf.get("spark.sql.execution.pandas.inferPandasDictAsMap")).lower()
+                    == "true"
+                )
+                if infer_pandas_dict_as_map:
+                    struct = StructType()
+                    pa_schema = pa.Schema.from_pandas(data)
+                    spark_type: Union[MapType, DataType]
+                    for field in pa_schema:
+                        field_type = field.type
+                        if isinstance(field_type, pa.StructType):

Review Comment:
   At least it will provide the same result for connect and non-connect for sure.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47543][CONNECT][PYTHON] Inferring `dict` as `MapType` from Pandas DataFrame to allow DataFrame creation [spark]

Posted by "itholic (via GitHub)" <gi...@apache.org>.

itholic commented on PR #45699:
URL: https://github.com/apache/spark/pull/45699#issuecomment-2022014625

   > Shall we document the conf at Configuration in python/docs/source/user_guide/sql/type_conversions.rst?
   
   @xinrong-meng Sure, just added into the documents.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47543][CONNECT][PYTHON] Inferring `dict` as `MapType` from Pandas DataFrame to allow DataFrame creation [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.

HyukjinKwon commented on code in PR #45699:
URL: https://github.com/apache/spark/pull/45699#discussion_r1552658075


##########
python/pyspark/sql/connect/session.py:
##########
@@ -418,6 +425,28 @@ def createDataFrame(
             # If no schema supplied by user then get the names of columns only
             if schema is None:
                 _cols = [str(x) if not isinstance(x, str) else x for x in data.columns]
+                infer_pandas_dict_as_map = (
+                    str(self.conf.get("spark.sql.execution.pandas.inferPandasDictAsMap")).lower()
+                    == "true"
+                )
+                if infer_pandas_dict_as_map:
+                    struct = StructType()
+                    pa_schema = pa.Schema.from_pandas(data)
+                    spark_type: Union[MapType, DataType]
+                    for field in pa_schema:
+                        field_type = field.type
+                        if isinstance(field_type, pa.StructType):
+                            if len(field_type) == 0:
+                                raise PySparkValueError(
+                                    error_class="CANNOT_INFER_EMPTY_SCHEMA",
+                                    message_parameters={},
+                                )
+                            arrow_type = field_type.field(0).type

Review Comment:
   Let's bump up .. @itholic 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47543][CONNECT][PYTHON] Inferring `dict` as `MapType` from Pandas DataFrame to allow DataFrame creation. [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.

HyukjinKwon commented on code in PR #45699:
URL: https://github.com/apache/spark/pull/45699#discussion_r1538548273


##########
python/pyspark/sql/tests/connect/test_connect_creation.py:
##########
@@ -554,6 +554,31 @@ def test_create_dataframe_from_pandas_with_ns_timestamp(self):
                 self.spark.createDataFrame(pdf).collect(),
             )
 
+    def test_schema_inference_from_pandas_with_dict(self):
+        from pyspark.sql.connect import functions as CF

Review Comment:
   Let's add a comment see https://spark.apache.org/contributing.html
   
   ```
   def test_case(self):
       # SPARK-12345: a short description of the test
       ...
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47543][CONNECT][PYTHON] Inferring `dict` as `MapType` from Pandas DataFrame to allow DataFrame creation. [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.

HyukjinKwon commented on code in PR #45699:
URL: https://github.com/apache/spark/pull/45699#discussion_r1538548082


##########
python/pyspark/sql/tests/connect/test_connect_creation.py:
##########
@@ -554,6 +554,31 @@ def test_create_dataframe_from_pandas_with_ns_timestamp(self):
                 self.spark.createDataFrame(pdf).collect(),
             )
 
+    def test_schema_inference_from_pandas_with_dict(self):
+        from pyspark.sql.connect import functions as CF
+        from pyspark.sql import functions as F

Review Comment:
   I believe this is already imported on the top



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47543][CONNECT][PYTHON] Inferring `dict` as `MapType` from Pandas DataFrame to allow DataFrame creation. [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.

HyukjinKwon commented on code in PR #45699:
URL: https://github.com/apache/spark/pull/45699#discussion_r1538573751


##########
python/pyspark/errors/error_classes.py:
##########
@@ -772,6 +772,11 @@
       "No active Spark session found. Please create a new Spark session before running the code."
     ]
   },
+  "NO_MATCHING_SPARK_TYPE_FOR_PYTHON_TYPE": {

Review Comment:
   Seems this is not used?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47543][CONNECT][PYTHON] Inferring `dict` as `MapType` from Pandas DataFrame to allow DataFrame creation. [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.

HyukjinKwon commented on code in PR #45699:
URL: https://github.com/apache/spark/pull/45699#discussion_r1538547344


##########
python/pyspark/sql/tests/connect/test_connect_creation.py:
##########
@@ -554,6 +554,31 @@ def test_create_dataframe_from_pandas_with_ns_timestamp(self):
                 self.spark.createDataFrame(pdf).collect(),
             )
 
+    def test_schema_inference_from_pandas_with_dict(self):
+        from pyspark.sql.connect import functions as CF
+        from pyspark.sql import functions as F
+
+        pdf = pd.DataFrame({"str_col": ["second"], "dict_col": [{"first": 0.7, "second": 0.3}]})
+
+        self.connect.conf.set("spark.sql.execution.pandas.inferPandasDictAsMap", True)

Review Comment:
   can we use `with self.sql_conf`? Otherwise, this will affect the other test cases as well.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47543][CONNECT][PYTHON] Inferring `dict` as `MapType` from Pandas DataFrame to allow DataFrame creation [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.

HyukjinKwon commented on code in PR #45699:
URL: https://github.com/apache/spark/pull/45699#discussion_r1538832573


##########
python/pyspark/sql/connect/session.py:
##########
@@ -418,6 +425,28 @@ def createDataFrame(
             # If no schema supplied by user then get the names of columns only
             if schema is None:
                 _cols = [str(x) if not isinstance(x, str) else x for x in data.columns]
+                infer_pandas_dict_as_map = (
+                    str(self.conf.get("spark.sql.execution.pandas.inferPandasDictAsMap")).lower()
+                    == "true"
+                )
+                if infer_pandas_dict_as_map:
+                    struct = StructType()
+                    pa_schema = pa.Schema.from_pandas(data)
+                    spark_type: Union[MapType, DataType]
+                    for field in pa_schema:
+                        field_type = field.type
+                        if isinstance(field_type, pa.StructType):

Review Comment:
   How does this work for nested type cases?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47543][CONNECT][PYTHON] Inferring `dict` as `MapType` from Pandas DataFrame to allow DataFrame creation [spark]

Posted by "xinrong-meng (via GitHub)" <gi...@apache.org>.

xinrong-meng commented on PR #45699:
URL: https://github.com/apache/spark/pull/45699#issuecomment-2021143476

   Shall we document the conf at Configuration in python/docs/source/user_guide/sql/type_conversions.rst?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47543][CONNECT][PYTHON] Inferring `dict` as `MapType` from Pandas DataFrame to allow DataFrame creation [spark]

Posted by "itholic (via GitHub)" <gi...@apache.org>.

itholic commented on code in PR #45699:
URL: https://github.com/apache/spark/pull/45699#discussion_r1552958733


##########
python/pyspark/sql/connect/session.py:
##########
@@ -418,6 +425,28 @@ def createDataFrame(
             # If no schema supplied by user then get the names of columns only
             if schema is None:
                 _cols = [str(x) if not isinstance(x, str) else x for x in data.columns]
+                infer_pandas_dict_as_map = (
+                    str(self.conf.get("spark.sql.execution.pandas.inferPandasDictAsMap")).lower()
+                    == "true"
+                )
+                if infer_pandas_dict_as_map:
+                    struct = StructType()
+                    pa_schema = pa.Schema.from_pandas(data)
+                    spark_type: Union[MapType, DataType]
+                    for field in pa_schema:
+                        field_type = field.type
+                        if isinstance(field_type, pa.StructType):
+                            if len(field_type) == 0:
+                                raise PySparkValueError(
+                                    error_class="CANNOT_INFER_EMPTY_SCHEMA",
+                                    message_parameters={},
+                                )
+                            arrow_type = field_type.field(0).type

Review Comment:
   Thanks for catching out. Let me bump up the pyarrow version.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47543][CONNECT][PYTHON] Inferring `dict` as `MapType` from Pandas DataFrame to allow DataFrame creation [spark]

Posted by "ueshin (via GitHub)" <gi...@apache.org>.

ueshin commented on code in PR #45699:
URL: https://github.com/apache/spark/pull/45699#discussion_r1552202465


##########
python/pyspark/sql/connect/session.py:
##########
@@ -418,6 +425,28 @@ def createDataFrame(
             # If no schema supplied by user then get the names of columns only
             if schema is None:
                 _cols = [str(x) if not isinstance(x, str) else x for x in data.columns]
+                infer_pandas_dict_as_map = (
+                    str(self.conf.get("spark.sql.execution.pandas.inferPandasDictAsMap")).lower()
+                    == "true"
+                )
+                if infer_pandas_dict_as_map:
+                    struct = StructType()
+                    pa_schema = pa.Schema.from_pandas(data)
+                    spark_type: Union[MapType, DataType]
+                    for field in pa_schema:
+                        field_type = field.type
+                        if isinstance(field_type, pa.StructType):
+                            if len(field_type) == 0:
+                                raise PySparkValueError(
+                                    error_class="CANNOT_INFER_EMPTY_SCHEMA",
+                                    message_parameters={},
+                                )
+                            arrow_type = field_type.field(0).type

Review Comment:
   The API `StructType.field` seems to be available since pyarrow `10.0`.
   https://arrow.apache.org/docs/10.0/python/generated/pyarrow.StructType.html
   
   We may want to bump up the minimum pyarrow version?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org