You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/05/13 20:28:54 UTC

[GitHub] [spark] physinet opened a new pull request, #36545: [WIP][SPARK-39168] Use all values in a python list when inferring ArrayType schema

physinet opened a new pull request, #36545:
URL: https://github.com/apache/spark/pull/36545

   ### What changes were proposed in this pull request?
   This PR modifies type inference for python lists to consider all values in the list, not just the first value.
   
   
   ### Why are the changes needed?
   This enables convenient type inference in the two following cases:
   |  | previous | current |
   | --- | --- | --- |
   | `[None, 1]` | `array<void>` (raises `ValueError`) | `array<bigint>` |
   | `[{"b": 1}, {"c": 2}]` | `array<struct<b:bigint>>`  | `array<struct<b:bigint,c:bigint>>`
   
   ### Does this PR introduce _any_ user-facing change?
   Possible user-facing changes:
   * attempting to infer the schema of an array with mixed types (e.g. `["a", 1]`) may result in a TypeError
      * previously, this was inferred as an `array<string>` and produced a value `["a", "1"]`
   * fields of inferred struct types will differ if dictionaries in an array have different keys
   
   ### How was this patch tested?
   Added unit tests for various cases.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] srowen commented on a diff in pull request #36545: [SPARK-39168][PYTHON] Use all values in a python list when inferring ArrayType schema

Posted by GitBox <gi...@apache.org>.
srowen commented on code in PR #36545:
URL: https://github.com/apache/spark/pull/36545#discussion_r874838643


##########
python/pyspark/sql/session.py:
##########
@@ -570,10 +570,20 @@ def _inferSchemaFromList(
         if not data:
             raise ValueError("can not infer schema from empty dataset")
         infer_dict_as_struct = self._jconf.inferDictAsStruct()
+        infer_array_from_first_element = self._jconf.legacyInferArrayTypeFromFirstElement()

Review Comment:
   Do we really need a flag? in the cases where this matters, the schema would be wrong and produce an error; would that be desirable?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #36545: [WIP][SPARK-39168][PYTHON] Use all values in a python list when inferring ArrayType schema

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on PR #36545:
URL: https://github.com/apache/spark/pull/36545#issuecomment-1128377781

   cc @BryanCutler @viirya @ueshin FYI


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] srowen commented on a diff in pull request #36545: [SPARK-39168][PYTHON] Use all values in a python list when inferring ArrayType schema

Posted by GitBox <gi...@apache.org>.
srowen commented on code in PR #36545:
URL: https://github.com/apache/spark/pull/36545#discussion_r875147010


##########
python/pyspark/sql/session.py:
##########
@@ -570,10 +570,20 @@ def _inferSchemaFromList(
         if not data:
             raise ValueError("can not infer schema from empty dataset")
         infer_dict_as_struct = self._jconf.inferDictAsStruct()
+        infer_array_from_first_element = self._jconf.legacyInferArrayTypeFromFirstElement()

Review Comment:
   I am not super familiar with this part of the code and I don't object to the change if someone else wants to merge; just seems like it would be relatively straightforward to 'really' fix it



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #36545: [WIP][SPARK-39168][PYTHON] Use all values in a python list when inferring ArrayType schema

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on PR #36545:
URL: https://github.com/apache/spark/pull/36545#issuecomment-1126657160

   Nice PR description. Yeah, we should probably add a configuration then, please also refer to https://github.com/apache/spark/commit/2537fe8cbaf49070137d4b5bc39af078b306c4c8 for adding a SQL config


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] srowen commented on a diff in pull request #36545: [SPARK-39168][PYTHON] Use all values in a python list when inferring ArrayType schema

Posted by GitBox <gi...@apache.org>.
srowen commented on code in PR #36545:
URL: https://github.com/apache/spark/pull/36545#discussion_r875358379


##########
python/pyspark/sql/session.py:
##########
@@ -570,10 +570,20 @@ def _inferSchemaFromList(
         if not data:
             raise ValueError("can not infer schema from empty dataset")
         infer_dict_as_struct = self._jconf.inferDictAsStruct()
+        infer_array_from_first_element = self._jconf.legacyInferArrayTypeFromFirstElement()

Review Comment:
   I see, that is sounding reasonable then to me. I dislike flags, but, if a follow-up made the flag irrelevant, maybe that's OK. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon closed pull request #36545: [SPARK-39168][PYTHON] Use all values in a python list when inferring ArrayType schema

Posted by GitBox <gi...@apache.org>.
HyukjinKwon closed pull request #36545: [SPARK-39168][PYTHON] Use all values in a python list when inferring ArrayType schema
URL: https://github.com/apache/spark/pull/36545


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] physinet commented on pull request #36545: [WIP][SPARK-39168] Use all values in a python list when inferring ArrayType schema

Posted by GitBox <gi...@apache.org>.
physinet commented on PR #36545:
URL: https://github.com/apache/spark/pull/36545#issuecomment-1126456712

   - [ ] Add a unit test for ["a", 1] producing a `TypeError`
   - [ ] Consider making this behavior configurable


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] physinet commented on a diff in pull request #36545: [SPARK-39168][PYTHON] Use all values in a python list when inferring ArrayType schema

Posted by GitBox <gi...@apache.org>.
physinet commented on code in PR #36545:
URL: https://github.com/apache/spark/pull/36545#discussion_r874781396


##########
python/docs/source/migration_guide/pyspark_3.3_to_3.4.rst:
##########
@@ -0,0 +1,23 @@
+..  Licensed to the Apache Software Foundation (ASF) under one
+    or more contributor license agreements.  See the NOTICE file
+    distributed with this work for additional information
+    regarding copyright ownership.  The ASF licenses this file
+    to you under the Apache License, Version 2.0 (the
+    "License"); you may not use this file except in compliance
+    with the License.  You may obtain a copy of the License at
+
+..    http://www.apache.org/licenses/LICENSE-2.0
+
+..  Unless required by applicable law or agreed to in writing,
+    software distributed under the License is distributed on an
+    "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+    KIND, either express or implied.  See the License for the
+    specific language governing permissions and limitations
+    under the License.
+
+
+=================================
+Upgrading from PySpark 3.3 to 3.4
+=================================
+
+* In Spark 3.4, the schema of an array column is inferred by merging the schemas of all elements in the array. To restore the previous behavior where the schema is only inferred from the first element, set the spark configuration as follows: ``{"spark.sql.pyspark.legacy.inferArrayTypeFromFirstElement.enabled": True}``.

Review Comment:
   [7cde786](https://github.com/apache/spark/commit/7cde786d3cf40e6cfbfbd42cff513d94b2003c06)



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##########
@@ -3758,6 +3758,15 @@ object SQLConf {
     .booleanConf
     .createWithDefault(false)
 
+  val LEGACY_INFER_ARRAY_TYPE_FROM_FIRST_ELEMENT =
+    buildConf("spark.sql.pyspark.legacy.inferArrayTypeFromFirstElement.enabled")
+      .doc("PySpark's SparkSession.createDataFrame infers the element type of an array from all " +
+        "values in the array by default. If this config is set to true, it restores the legacy " +
+        "behavior of only inferring the type from the first array element.")
+      .version("3.3.0")

Review Comment:
   [969557f](https://github.com/apache/spark/pull/36545/commits/969557fde1157239e8c5ac912e8ca7b0606f0d6a)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a diff in pull request #36545: [WIP][SPARK-39168][PYTHON] Use all values in a python list when inferring ArrayType schema

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on code in PR #36545:
URL: https://github.com/apache/spark/pull/36545#discussion_r874336480


##########
python/docs/source/migration_guide/pyspark_3.3_to_3.4.rst:
##########
@@ -0,0 +1,23 @@
+..  Licensed to the Apache Software Foundation (ASF) under one
+    or more contributor license agreements.  See the NOTICE file
+    distributed with this work for additional information
+    regarding copyright ownership.  The ASF licenses this file
+    to you under the Apache License, Version 2.0 (the
+    "License"); you may not use this file except in compliance
+    with the License.  You may obtain a copy of the License at
+
+..    http://www.apache.org/licenses/LICENSE-2.0
+
+..  Unless required by applicable law or agreed to in writing,
+    software distributed under the License is distributed on an
+    "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+    KIND, either express or implied.  See the License for the
+    specific language governing permissions and limitations
+    under the License.
+
+
+=================================
+Upgrading from PySpark 3.3 to 3.4
+=================================
+
+* In Spark 3.4, the schema of an array column is inferred by merging the schemas of all elements in the array. To restore the previous behavior where the schema is only inferred from the first element, set the spark configuration as follows: ``{"spark.sql.pyspark.legacy.inferArrayTypeFromFirstElement.enabled": True}``.

Review Comment:
   I think we can just say set `spark.sql.pyspark.legacy.inferArrayTypeFromFirstElement.enabled` to `true` (just to be consistent with other places like https://github.com/apache/spark/blob/a6b207e00ac8add3d6b98a7f4a9f5e3406c4704a/python/docs/source/migration_guide/pyspark_3.1_to_3.2.rst#upgrading-from-pyspark-31-to-32



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] physinet commented on pull request #36545: [WIP][SPARK-39168][PYTHON] Use all values in a python list when inferring ArrayType schema

Posted by GitBox <gi...@apache.org>.
physinet commented on PR #36545:
URL: https://github.com/apache/spark/pull/36545#issuecomment-1128297953

   > We should probably add a configuration like `spark.sql.pyspark.legacy.inferFirstElementInArray.enabled` (feel free to pick other names if you have other ideas). Would also have to create a new page like https://github.com/apache/spark/blob/master/python/docs/source/migration_guide/pyspark_3.2_to_3.3.rst and describe this behavior change.
   
   Added this configuration (slightly changed the name) and created a new migration guide page.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] physinet commented on a diff in pull request #36545: [WIP][SPARK-39168][PYTHON] Use all values in a python list when inferring ArrayType schema

Posted by GitBox <gi...@apache.org>.
physinet commented on code in PR #36545:
URL: https://github.com/apache/spark/pull/36545#discussion_r874275723


##########
python/pyspark/sql/tests/test_types.py:
##########
@@ -285,6 +285,64 @@ def test_infer_nested_dict_as_struct(self):
             df = self.spark.createDataFrame(data)
             self.assertEqual(Row(f1=[Row(payment=200.5, name="A")], f2=[1, 2]), df.first())
 
+    def test_infer_array_merge_element_types(self):

Review Comment:
   Added these comments.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #36545: [SPARK-39168][PYTHON] Use all values in a python list when inferring ArrayType schema

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on PR #36545:
URL: https://github.com/apache/spark/pull/36545#issuecomment-1130899034

   Thanks for working on this @physinet, and congrats for being a Apache Spark contributor :-).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #36545: [SPARK-39168][PYTHON] Use all values in a python list when inferring ArrayType schema

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on PR #36545:
URL: https://github.com/apache/spark/pull/36545#issuecomment-1130896943

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #36545: [WIP][SPARK-39168][PYTHON] Use all values in a python list when inferring ArrayType schema

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on PR #36545:
URL: https://github.com/apache/spark/pull/36545#issuecomment-1126657800

   We should probably add a configuration like `spark.sql.pyspark.legacy.inferFirstElementInArray.enabled` (feel free to pick other names if you have other ideas).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a diff in pull request #36545: [WIP][SPARK-39168][PYTHON] Use all values in a python list when inferring ArrayType schema

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on code in PR #36545:
URL: https://github.com/apache/spark/pull/36545#discussion_r872942079


##########
python/pyspark/sql/tests/test_types.py:
##########
@@ -285,6 +285,64 @@ def test_infer_nested_dict_as_struct(self):
             df = self.spark.createDataFrame(data)
             self.assertEqual(Row(f1=[Row(payment=200.5, name="A")], f2=[1, 2]), df.first())
 
+    def test_infer_array_merge_element_types(self):

Review Comment:
   Let's add a comment like # SPARK-39168: ... (see also https://spark.apache.org/contributing.html)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] physinet commented on a diff in pull request #36545: [SPARK-39168][PYTHON] Use all values in a python list when inferring ArrayType schema

Posted by GitBox <gi...@apache.org>.
physinet commented on code in PR #36545:
URL: https://github.com/apache/spark/pull/36545#discussion_r875142772


##########
python/pyspark/sql/session.py:
##########
@@ -570,10 +570,20 @@ def _inferSchemaFromList(
         if not data:
             raise ValueError("can not infer schema from empty dataset")
         infer_dict_as_struct = self._jconf.inferDictAsStruct()
+        infer_array_from_first_element = self._jconf.legacyInferArrayTypeFromFirstElement()

Review Comment:
   This seems to be marked as a TODO (at least on the python side). https://github.com/apache/spark/blob/master/python/pyspark/sql/types.py#L1377
   
   Again, I think this should be addressed separately from this PR, but I'm happy to hold off on this PR if you think type-widening logic should be implemented first.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] physinet commented on a diff in pull request #36545: [SPARK-39168][PYTHON] Use all values in a python list when inferring ArrayType schema

Posted by GitBox <gi...@apache.org>.
physinet commented on code in PR #36545:
URL: https://github.com/apache/spark/pull/36545#discussion_r875864695


##########
python/pyspark/sql/session.py:
##########
@@ -570,10 +570,20 @@ def _inferSchemaFromList(
         if not data:
             raise ValueError("can not infer schema from empty dataset")
         infer_dict_as_struct = self._jconf.inferDictAsStruct()
+        infer_array_from_first_element = self._jconf.legacyInferArrayTypeFromFirstElement()

Review Comment:
   By the way, I think `StringType` is the only type that supports coercion: https://github.com/apache/spark/blob/master/python/pyspark/sql/types.py#L1596-L1599



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] physinet commented on pull request #36545: [WIP][SPARK-39168] Use all values in a python list when inferring ArrayType schema

Posted by GitBox <gi...@apache.org>.
physinet commented on PR #36545:
URL: https://github.com/apache/spark/pull/36545#issuecomment-1126454650

   - [ Add a unit test for ["a", 1] producing a `TypeError` ]
   - [ Consider making this behavior configurable ] 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] srowen commented on a diff in pull request #36545: [SPARK-39168][PYTHON] Use all values in a python list when inferring ArrayType schema

Posted by GitBox <gi...@apache.org>.
srowen commented on code in PR #36545:
URL: https://github.com/apache/spark/pull/36545#discussion_r875033677


##########
python/pyspark/sql/session.py:
##########
@@ -570,10 +570,20 @@ def _inferSchemaFromList(
         if not data:
             raise ValueError("can not infer schema from empty dataset")
         infer_dict_as_struct = self._jconf.inferDictAsStruct()
+        infer_array_from_first_element = self._jconf.legacyInferArrayTypeFromFirstElement()

Review Comment:
   Wait, this seems like the wrong behavior. StringType should be inferred for this column, as it is now. I don't see a reason to break this? don't we want to just find the widest applicable type?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a diff in pull request #36545: [SPARK-39168][PYTHON] Use all values in a python list when inferring ArrayType schema

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on code in PR #36545:
URL: https://github.com/apache/spark/pull/36545#discussion_r875352636


##########
python/pyspark/sql/session.py:
##########
@@ -570,10 +570,20 @@ def _inferSchemaFromList(
         if not data:
             raise ValueError("can not infer schema from empty dataset")
         infer_dict_as_struct = self._jconf.inferDictAsStruct()
+        infer_array_from_first_element = self._jconf.legacyInferArrayTypeFromFirstElement()

Review Comment:
   Yeah, that's right that it causes a behaviour change. However, the (previous) string type coercion behaviour in an element of an array is actually a mistake I believe. For example, such type coercion is not supported in regular type inference:
   
   ```python
   >>> spark.createDataFrame([{"a": "1"}, {"a" :2}])
   Traceback (most recent call last):
     ...
   TypeError: field a: Can not merge type <class 'pyspark.sql.types.StringType'> and <class 'pyspark.sql.types.LongType'>
   ```
   
   So, what this PR actually does is to match the behaviour with non-nested type inference. The switch was added for users dependent on the previous behaviour.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] physinet commented on a diff in pull request #36545: [SPARK-39168][PYTHON] Use all values in a python list when inferring ArrayType schema

Posted by GitBox <gi...@apache.org>.
physinet commented on code in PR #36545:
URL: https://github.com/apache/spark/pull/36545#discussion_r875099589


##########
python/pyspark/sql/session.py:
##########
@@ -570,10 +570,20 @@ def _inferSchemaFromList(
         if not data:
             raise ValueError("can not infer schema from empty dataset")
         infer_dict_as_struct = self._jconf.inferDictAsStruct()
+        infer_array_from_first_element = self._jconf.legacyInferArrayTypeFromFirstElement()

Review Comment:
   I can see the argument for casting to the widest applicable type, but I think that should be a separate discussion. Inferring the type of an array I think should be analogous to inferring a type over rows, which raises an error in this case:
   ```python
   >>> spark.createDataFrame([{"a": "1"}, {"a": 2}])
   ...
   TypeError: field a: Can not merge type <class 'pyspark.sql.types.StringType'> and <class 'pyspark.sql.types.LongType'>
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] physinet commented on a diff in pull request #36545: [SPARK-39168][PYTHON] Use all values in a python list when inferring ArrayType schema

Posted by GitBox <gi...@apache.org>.
physinet commented on code in PR #36545:
URL: https://github.com/apache/spark/pull/36545#discussion_r874983582


##########
python/pyspark/sql/session.py:
##########
@@ -570,10 +570,20 @@ def _inferSchemaFromList(
         if not data:
             raise ValueError("can not infer schema from empty dataset")
         infer_dict_as_struct = self._jconf.inferDictAsStruct()
+        infer_array_from_first_element = self._jconf.legacyInferArrayTypeFromFirstElement()

Review Comment:
   Previously it was allowed to have mixed types in a python list, as long as the types could be cast to the type enforced by the schema inferred from the first element:
   ```python
   >>> df = spark.createDataFrame([{"a": ["1", 2]}])
   >>> df.show()
   +------+
   |     a|
   +------+
   |[1, 2]|
   +------+
   >>> df.schema
   StructType(List(StructField(a,ArrayType(StringType,true),true)))
   ```
   With this change, creating the DataFrame causes an error:
   ```python
   >>> df = spark.createDataFrame([{"a": ["1", 2]}])
   ...
   TypeError: Unable to infer the type of the field a.
   ```
   Because of this change, I think it makes sense to have the behavior configurable.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] srowen commented on a diff in pull request #36545: [SPARK-39168][PYTHON] Use all values in a python list when inferring ArrayType schema

Posted by GitBox <gi...@apache.org>.
srowen commented on code in PR #36545:
URL: https://github.com/apache/spark/pull/36545#discussion_r875104045


##########
python/pyspark/sql/session.py:
##########
@@ -570,10 +570,20 @@ def _inferSchemaFromList(
         if not data:
             raise ValueError("can not infer schema from empty dataset")
         infer_dict_as_struct = self._jconf.inferDictAsStruct()
+        infer_array_from_first_element = self._jconf.legacyInferArrayTypeFromFirstElement()

Review Comment:
   I see it fixes some cases, by somewhat 'accidentally' correctly inferring a widening type. But it does cause some common cases to start failing, when they 'accidentally' work now (your example in this thread). It feels like the half-measure isn't worth it, as it needs a whole new flag. Is it hard to just implement logic to find the closest type for everything? that code surely already exists in the code base



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a diff in pull request #36545: [WIP][SPARK-39168][PYTHON] Use all values in a python list when inferring ArrayType schema

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on code in PR #36545:
URL: https://github.com/apache/spark/pull/36545#discussion_r874336689


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##########
@@ -3758,6 +3758,15 @@ object SQLConf {
     .booleanConf
     .createWithDefault(false)
 
+  val LEGACY_INFER_ARRAY_TYPE_FROM_FIRST_ELEMENT =
+    buildConf("spark.sql.pyspark.legacy.inferArrayTypeFromFirstElement.enabled")
+      .doc("PySpark's SparkSession.createDataFrame infers the element type of an array from all " +
+        "values in the array by default. If this config is set to true, it restores the legacy " +
+        "behavior of only inferring the type from the first array element.")
+      .version("3.3.0")

Review Comment:
   ```suggestion
         .version("3.4.0")
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #36545: [WIP][SPARK-39168][PYTHON] Use all values in a python list when inferring ArrayType schema

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on PR #36545:
URL: https://github.com/apache/spark/pull/36545#issuecomment-1126831255

   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] physinet commented on a diff in pull request #36545: [SPARK-39168][PYTHON] Use all values in a python list when inferring ArrayType schema

Posted by GitBox <gi...@apache.org>.
physinet commented on code in PR #36545:
URL: https://github.com/apache/spark/pull/36545#discussion_r875864695


##########
python/pyspark/sql/session.py:
##########
@@ -570,10 +570,20 @@ def _inferSchemaFromList(
         if not data:
             raise ValueError("can not infer schema from empty dataset")
         infer_dict_as_struct = self._jconf.inferDictAsStruct()
+        infer_array_from_first_element = self._jconf.legacyInferArrayTypeFromFirstElement()

Review Comment:
   By the way, I think `StringType` is the only type that supports coercion currently: https://github.com/apache/spark/blob/master/python/pyspark/sql/types.py#L1596-L1599



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org