You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/06/16 21:39:09 UTC

[GitHub] [spark] xinrong-databricks opened a new pull request, #36893: Support `createDataFrame` from a list of scalars

xinrong-databricks opened a new pull request, #36893:
URL: https://github.com/apache/spark/pull/36893

   
   
   ### What changes were proposed in this pull request?
   Support `createDataFrame` from a list of scalars.
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   Yes.
   1. `createDataFrame` from a list of scalars is supported.
   ```py
   >>> spark.createDataFrame([1, 2]).collect()
   [Row(_1=1), Row(_1=2)]       
   ```
   2. Consolidated error messages
   ```py
   >>> spark.createDataFrame([1, 'x'])
   Traceback (most recent call last):
   ...
   TypeError: field _1: Can not merge type <class 'pyspark.sql.types.LongType'> and <class 'pyspark.sql.types.StringType'>
   
   >>> spark.createDataFrame([1, (2,)])
   Traceback (most recent call last):
   ...
   TypeError: field _1: Can not merge type <class 'pyspark.sql.types.LongType'> and <class 'pyspark.sql.types.StructType'>
   ```
   
   
   ### How was this patch tested?
   Unit tests.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] xinrong-databricks commented on a diff in pull request #36893: [SPARK-39494][PYTHON] Support `createDataFrame` from a list of scalars when schema is not provided

Posted by GitBox <gi...@apache.org>.
xinrong-databricks commented on code in PR #36893:
URL: https://github.com/apache/spark/pull/36893#discussion_r918171922


##########
python/pyspark/sql/session.py:
##########
@@ -1023,6 +1023,20 @@ def prepare(obj: Any) -> Any:
 
         if isinstance(data, RDD):
             rdd, struct = self._createFromRDD(data.map(prepare), schema, samplingRatio)
+        elif isinstance(data, list) and schema is None:
+            # Wrap each element with a tuple if there is any scalar in the list
+
+            has_scalar = any(
+                isinstance(x, str)
+                or (
+                    not isinstance(x, Sized)
+                    and (type(x).__module__ == object.__module__)  # built-in

Review Comment:
   Accepts native Python scalars only.
   
   Supporting numpy/pandas scalars should be a follow-up.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on a diff in pull request #36893: [SPARK-39494][PYTHON] Support `createDataFrame` from a list of scalars

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on code in PR #36893:
URL: https://github.com/apache/spark/pull/36893#discussion_r900708474


##########
python/pyspark/sql/session.py:
##########
@@ -1024,7 +1024,14 @@ def prepare(obj: Any) -> Any:
         if isinstance(data, RDD):
             rdd, struct = self._createFromRDD(data.map(prepare), schema, samplingRatio)
         else:

Review Comment:
   ```suggestion
           else if isinstance(data, list):
               # Wrap each element with a tuple if there is any scalar in the list
               has_scalar = any(isinstance(x, str) or not hasattr(x, "__len__") for x in data)
               converted_data = [(x,) for x in data] if has_scalar else data
               rdd, struct = self._createFromLocal(map(prepare, converted_data), schema)
           else:
               rdd, struct = self._createFromLocal(map(prepare, data), schema)
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] github-actions[bot] commented on pull request #36893: [SPARK-39494][PYTHON] Support `createDataFrame` from a list of scalars when schema is not provided

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on PR #36893:
URL: https://github.com/apache/spark/pull/36893#issuecomment-1286306448

   We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] xinrong-databricks commented on a diff in pull request #36893: [SPARK-39494][PYTHON] Support `createDataFrame` from a list of scalars when schema is not provided

Posted by GitBox <gi...@apache.org>.
xinrong-databricks commented on code in PR #36893:
URL: https://github.com/apache/spark/pull/36893#discussion_r918165714


##########
python/pyspark/sql/tests/test_types.py:
##########
@@ -374,12 +373,6 @@ def test_negative_decimal(self):
         finally:
             self.spark.sql("set spark.sql.legacy.allowNegativeScaleOfDecimal=false")
 
-    def test_create_dataframe_from_objects(self):

Review Comment:
   Moved the test to python/pyspark/sql/tests/test_session.py.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] xinrong-databricks commented on a diff in pull request #36893: [SPARK-39494][PYTHON] Support `createDataFrame` from a list of scalars when schema is not provided

Posted by GitBox <gi...@apache.org>.
xinrong-databricks commented on code in PR #36893:
URL: https://github.com/apache/spark/pull/36893#discussion_r918171922


##########
python/pyspark/sql/session.py:
##########
@@ -1023,6 +1023,20 @@ def prepare(obj: Any) -> Any:
 
         if isinstance(data, RDD):
             rdd, struct = self._createFromRDD(data.map(prepare), schema, samplingRatio)
+        elif isinstance(data, list) and schema is None:
+            # Wrap each element with a tuple if there is any scalar in the list
+
+            has_scalar = any(
+                isinstance(x, str)
+                or (
+                    not isinstance(x, Sized)
+                    and (type(x).__module__ == object.__module__)  # built-in

Review Comment:
   Accepts native Python scalars only.
   
   Supporting numpy/pandas scalars will be implemented https://issues.apache.org/jira/browse/SPARK-39745.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] ueshin commented on a diff in pull request #36893: [SPARK-39494][PYTHON] Support `createDataFrame` from a list of scalars

Posted by GitBox <gi...@apache.org>.
ueshin commented on code in PR #36893:
URL: https://github.com/apache/spark/pull/36893#discussion_r899648682


##########
python/pyspark/sql/session.py:
##########
@@ -1024,7 +1024,18 @@ def prepare(obj: Any) -> Any:
         if isinstance(data, RDD):
             rdd, struct = self._createFromRDD(data.map(prepare), schema, samplingRatio)
         else:
-            rdd, struct = self._createFromLocal(map(prepare, data), schema)
+            if isinstance(data, list):
+                # Wrap each element with a tuple if there is any scalar in the list
+                has_scalar = False
+                for x in data:
+                    if isinstance(x, str) or not hasattr(x, "__len__"):  # Scalar
+                        has_scalar = True
+                        break

Review Comment:
   How about:
   ```py
   has_scalar = any(isinstance(x, str) or not hasattr(x, "__len__") for x in data)
   ```
   ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] xinrong-databricks commented on a diff in pull request #36893: [SPARK-39494][PYTHON] Support `createDataFrame` from a list of scalars

Posted by GitBox <gi...@apache.org>.
xinrong-databricks commented on code in PR #36893:
URL: https://github.com/apache/spark/pull/36893#discussion_r903018256


##########
python/pyspark/sql/session.py:
##########
@@ -1024,7 +1024,14 @@ def prepare(obj: Any) -> Any:
         if isinstance(data, RDD):
             rdd, struct = self._createFromRDD(data.map(prepare), schema, samplingRatio)
         else:

Review Comment:
   Thanks!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] github-actions[bot] closed pull request #36893: [SPARK-39494][PYTHON] Support `createDataFrame` from a list of scalars when schema is not provided

Posted by GitBox <gi...@apache.org>.
github-actions[bot] closed pull request #36893: [SPARK-39494][PYTHON] Support `createDataFrame` from a list of scalars when schema is not provided
URL: https://github.com/apache/spark/pull/36893


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] xinrong-databricks commented on a diff in pull request #36893: [SPARK-39494][PYTHON] Support `createDataFrame` from a list of scalars

Posted by GitBox <gi...@apache.org>.
xinrong-databricks commented on code in PR #36893:
URL: https://github.com/apache/spark/pull/36893#discussion_r903033580


##########
python/pyspark/sql/session.py:
##########
@@ -1024,7 +1024,14 @@ def prepare(obj: Any) -> Any:
         if isinstance(data, RDD):
             rdd, struct = self._createFromRDD(data.map(prepare), schema, samplingRatio)
         else:
-            rdd, struct = self._createFromLocal(map(prepare, data), schema)
+            if isinstance(data, list):
+                # Wrap each element with a tuple if there is any scalar in the list
+                has_scalar = any(isinstance(x, str) or not hasattr(x, "__len__") for x in data)

Review Comment:
   Replaced, thanks!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a diff in pull request #36893: [SPARK-39494][PYTHON] Support `createDataFrame` from a list of scalars

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on code in PR #36893:
URL: https://github.com/apache/spark/pull/36893#discussion_r901221465


##########
python/pyspark/sql/session.py:
##########
@@ -1024,7 +1024,14 @@ def prepare(obj: Any) -> Any:
         if isinstance(data, RDD):
             rdd, struct = self._createFromRDD(data.map(prepare), schema, samplingRatio)
         else:
-            rdd, struct = self._createFromLocal(map(prepare, data), schema)
+            if isinstance(data, list):
+                # Wrap each element with a tuple if there is any scalar in the list
+                has_scalar = any(isinstance(x, str) or not hasattr(x, "__len__") for x in data)

Review Comment:
   `hasattr(x, "__len__")` could be replaced to `isinstance(x, collections.abc.Sized)`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on a diff in pull request #36893: [SPARK-39494][PYTHON] Support `createDataFrame` from a list of scalars

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on code in PR #36893:
URL: https://github.com/apache/spark/pull/36893#discussion_r900708474


##########
python/pyspark/sql/session.py:
##########
@@ -1024,7 +1024,14 @@ def prepare(obj: Any) -> Any:
         if isinstance(data, RDD):
             rdd, struct = self._createFromRDD(data.map(prepare), schema, samplingRatio)
         else:

Review Comment:
   ```suggestion
           elif isinstance(data, list):
               # Wrap each element with a tuple if there is any scalar in the list
               has_scalar = any(isinstance(x, str) or not hasattr(x, "__len__") for x in data)
               converted_data = [(x,) for x in data] if has_scalar else data
               rdd, struct = self._createFromLocal(map(prepare, converted_data), schema)
           else:
               rdd, struct = self._createFromLocal(map(prepare, data), schema)
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] xinrong-databricks commented on a diff in pull request #36893: [SPARK-39494][PYTHON] Support `createDataFrame` from a list of scalars

Posted by GitBox <gi...@apache.org>.
xinrong-databricks commented on code in PR #36893:
URL: https://github.com/apache/spark/pull/36893#discussion_r899675695


##########
python/pyspark/sql/session.py:
##########
@@ -1024,7 +1024,18 @@ def prepare(obj: Any) -> Any:
         if isinstance(data, RDD):
             rdd, struct = self._createFromRDD(data.map(prepare), schema, samplingRatio)
         else:
-            rdd, struct = self._createFromLocal(map(prepare, data), schema)
+            if isinstance(data, list):
+                # Wrap each element with a tuple if there is any scalar in the list
+                has_scalar = False
+                for x in data:
+                    if isinstance(x, str) or not hasattr(x, "__len__"):  # Scalar
+                        has_scalar = True
+                        break

Review Comment:
   Looks much cleaner! I guess Python `any` should return immediately when it detects the **first** `True`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] xinrong-databricks commented on a diff in pull request #36893: [SPARK-39494][PYTHON] Support `createDataFrame` from a list of scalars

Posted by GitBox <gi...@apache.org>.
xinrong-databricks commented on code in PR #36893:
URL: https://github.com/apache/spark/pull/36893#discussion_r917160564


##########
python/pyspark/sql/session.py:
##########
@@ -1023,6 +1023,12 @@ def prepare(obj: Any) -> Any:
 
         if isinstance(data, RDD):
             rdd, struct = self._createFromRDD(data.map(prepare), schema, samplingRatio)
+        elif isinstance(data, list) and schema is None:

Review Comment:
   Respect the existing behavior when `schema` is given.
   
   Specifically, 
   ```py
   >>> spark.createDataFrame([1, 2], schema='int')
   DataFrame[value: int]
   
   >>> spark.createDataFrame([(1,), (2,)], schema='int')
   ...
   TypeError: field value: IntegerType() can not accept object (1,) in type <class 'tuple'>
   ```
   
   Adjusting the inconsistency above is out of the scope of this PR and ought to be a follow-up.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a diff in pull request #36893: [SPARK-39494][PYTHON] Support `createDataFrame` from a list of scalars when schema is not provided

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on code in PR #36893:
URL: https://github.com/apache/spark/pull/36893#discussion_r919529370


##########
python/pyspark/sql/session.py:
##########
@@ -1023,6 +1023,20 @@ def prepare(obj: Any) -> Any:
 
         if isinstance(data, RDD):
             rdd, struct = self._createFromRDD(data.map(prepare), schema, samplingRatio)
+        elif isinstance(data, list) and schema is None:
+            # Wrap each element with a tuple if there is any scalar in the list
+
+            has_scalar = any(
+                isinstance(x, str)
+                or (
+                    not isinstance(x, Sized)

Review Comment:
   How does this work in case of nested types? e.g., `spark.createDataFrame([(1,), (2,)], schema='struct<a: int>')`. We should probably check how the Scala side works, and match the behaviour.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org