You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "ueshin (via GitHub)" <gi...@apache.org> on 2023/03/11 20:16:07 UTC

[GitHub] [spark] ueshin opened a new pull request, #40382: [SPARK-42679][CONNECT][PYTHON] createDataFrame doesn't work with non-nullable schema

ueshin opened a new pull request, #40382:
URL: https://github.com/apache/spark/pull/40382

   ### What changes were proposed in this pull request?
   
   Fixes `spark.createDataFrame` to apply the given schema to work with non-nullable data types.
   
   ### Why are the changes needed?
   
   Currently `spark.createDataFrame` won't work with non-nullable schema as below:
   
   ```py
   >>> from pyspark.sql.types import *
   >>> schema_false = StructType([StructField("id", IntegerType(), False)])
   >>> spark.createDataFrame([[1]], schema=schema_false)
   Traceback (most recent call last):
   ...
   pyspark.errors.exceptions.connect.AnalysisException: [NULLABLE_COLUMN_OR_FIELD] Column or field `id` is nullable while it's required to be non-nullable.
   ```
   
   whereas it works fine with nullable schema:
   
   ```py
   >>> from pyspark.sql.types import *
   >>> schema_false = StructType([StructField("id", IntegerType(), False)])
   >>> spark.createDataFrame([[1]], schema=schema_false)
   DataFrame[id: int]
   ```
   
   ### Does this PR introduce _any_ user-facing change?
   
   `spark.createDataFrame` with non-nullable schema will work.
   
   ### How was this patch tested?
   
   Added related tests.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] ueshin commented on a diff in pull request #40382: [SPARK-42679][CONNECT][PYTHON] createDataFrame doesn't work with non-nullable schema

Posted by "ueshin (via GitHub)" <gi...@apache.org>.
ueshin commented on code in PR #40382:
URL: https://github.com/apache/spark/pull/40382#discussion_r1133181232


##########
python/pyspark/sql/connect/session.py:
##########
@@ -267,6 +269,10 @@ def createDataFrame(
                 [ser._create_batch([(c, t) for (_, c), t in zip(data.items(), arrow_types)])]
             )
 
+            if isinstance(schema, StructType):
+                assert arrow_schema is not None

Review Comment:
   It just makes `mypy` happy. 😄 
   
   btw, `to_arrow_schema` returns `pa.Schema`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] itholic commented on a diff in pull request #40382: [SPARK-42679][CONNECT][PYTHON] createDataFrame doesn't work with non-nullable schema

Posted by "itholic (via GitHub)" <gi...@apache.org>.
itholic commented on code in PR #40382:
URL: https://github.com/apache/spark/pull/40382#discussion_r1133177990


##########
python/pyspark/sql/connect/session.py:
##########
@@ -267,6 +269,10 @@ def createDataFrame(
                 [ser._create_batch([(c, t) for (_, c), t in zip(data.items(), arrow_types)])]
             )
 
+            if isinstance(schema, StructType):
+                assert arrow_schema is not None

Review Comment:
   May dumb question, but just out of curiosity: is there any chance that we get `None` for `arrow_schema`??
   
   Seems like `to_arrow_schema` function always return `pyarrow.Field` since it uses `pa.field` to create the schema? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on pull request #40382: [SPARK-42679][CONNECT][PYTHON] createDataFrame doesn't work with non-nullable schema

Posted by "zhengruifeng (via GitHub)" <gi...@apache.org>.
zhengruifeng commented on PR #40382:
URL: https://github.com/apache/spark/pull/40382#issuecomment-1465447635

   merged to master/branch-3.4


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] itholic commented on a diff in pull request #40382: [SPARK-42679][CONNECT][PYTHON] createDataFrame doesn't work with non-nullable schema

Posted by "itholic (via GitHub)" <gi...@apache.org>.
itholic commented on code in PR #40382:
URL: https://github.com/apache/spark/pull/40382#discussion_r1133183226


##########
python/pyspark/sql/connect/session.py:
##########
@@ -267,6 +269,10 @@ def createDataFrame(
                 [ser._create_batch([(c, t) for (_, c), t in zip(data.items(), arrow_types)])]
             )
 
+            if isinstance(schema, StructType):
+                assert arrow_schema is not None

Review Comment:
   Oh, yeah I actually meant that `pa.field` returns `pyarrow.Field`. Sorry for the confusion



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] itholic commented on a diff in pull request #40382: [SPARK-42679][CONNECT][PYTHON] createDataFrame doesn't work with non-nullable schema

Posted by "itholic (via GitHub)" <gi...@apache.org>.
itholic commented on code in PR #40382:
URL: https://github.com/apache/spark/pull/40382#discussion_r1133177990


##########
python/pyspark/sql/connect/session.py:
##########
@@ -267,6 +269,10 @@ def createDataFrame(
                 [ser._create_batch([(c, t) for (_, c), t in zip(data.items(), arrow_types)])]
             )
 
+            if isinstance(schema, StructType):
+                assert arrow_schema is not None

Review Comment:
   May dumb question, but just out of curiosity: is there any change that we get `None` for `arrow_schema`??
   
   Seems like `to_arrow_schema` function always return `pyarrow.Field` since it uses `pa.field` to create the schema? 



##########
python/pyspark/sql/connect/types.py:
##########
@@ -309,9 +309,12 @@ def to_arrow_type(dt: DataType) -> "pa.DataType":
     elif type(dt) == DayTimeIntervalType:
         arrow_type = pa.duration("us")
     elif type(dt) == ArrayType:
-        arrow_type = pa.list_(to_arrow_type(dt.elementType))
+        field = pa.field("element", to_arrow_type(dt.elementType), nullable=dt.containsNull)

Review Comment:
   Oh... so we didn't pass the nullable information to `pa._list`....
   
   Thanks for fixing this!!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng closed pull request #40382: [SPARK-42679][CONNECT][PYTHON] createDataFrame doesn't work with non-nullable schema

Posted by "zhengruifeng (via GitHub)" <gi...@apache.org>.
zhengruifeng closed pull request #40382: [SPARK-42679][CONNECT][PYTHON] createDataFrame doesn't work with non-nullable schema
URL: https://github.com/apache/spark/pull/40382


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org