You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "panbingkun (via GitHub)" <gi...@apache.org> on 2023/03/07 12:22:13 UTC

[GitHub] [spark] panbingkun opened a new pull request, #40316: [WIP][SPARK-42679][CONNECT] createDataFrame doesn't work with non-nullable schema

panbingkun opened a new pull request, #40316:
URL: https://github.com/apache/spark/pull/40316

   ### What changes were proposed in this pull request?
   
   
   ### Why are the changes needed?
   
   
   ### Does this PR introduce _any_ user-facing change?
   No.
   
   ### How was this patch tested?
   Pass GA.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] panbingkun commented on pull request #40316: [WIP][SPARK-42679][CONNECT] createDataFrame doesn't work with non-nullable schema

Posted by "panbingkun (via GitHub)" <gi...@apache.org>.
panbingkun commented on PR #40316:
URL: https://github.com/apache/spark/pull/40316#issuecomment-1458241313

   <img width="797" alt="image" src="https://user-images.githubusercontent.com/15246973/223446693-3c296b56-f9aa-4b70-9eb3-5bc9059ba631.png">
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] itholic commented on a diff in pull request #40316: [SPARK-42679][CONNECT] createDataFrame doesn't work with non-nullable schema

Posted by "itholic (via GitHub)" <gi...@apache.org>.
itholic commented on code in PR #40316:
URL: https://github.com/apache/spark/pull/40316#discussion_r1128906598


##########
python/pyspark/sql/tests/connect/test_connect_basic.py:
##########
@@ -2876,6 +2876,13 @@ def test_unsupported_io_functions(self):
             with self.assertRaises(NotImplementedError):
                 getattr(df.write, f)()
 
+    def test_inferred_schema(self):

Review Comment:
   Can we enrich the test by also including the test for schema_true and pandas DataFrame??
   
   such as:
   ```
   schema_true = StructType([StructField("id", IntegerType(), True)])
   cdf2 = self.connect.createDataFrame([[1]], schema=schema_true)
   sdf2 = self.spark.createDataFrame([[1]], schema=schema_true)
   self.assertEqual(cdf2.schema, sdf2.schema)
   self.assertEqual(cdf2.collect(), sdf2.collect())
   
   pdf1 = cdf1.toPandas()
   cdf3 = self.connect.createDataFrame(pdf1, cdf1.schema)
   sdf3 = self.spark.createDataFrame(pdf1, cdf1.schema)
   self.assertEqual(cdf3.schema, sdf3.schema)
   self.assertEqual(cdf3.collect(), sdf3.collect())
   
   pdf2 = cdf2.toPandas()
   cdf4 = self.connect.createDataFrame(pdf2, cdf2.schema)
   sdf4 = self.spark.createDataFrame(pdf2, cdf2.schema)
   self.assertEqual(cdf4.schema, sdf4.schema)
   self.assertEqual(cdf4.collect(), sdf4.collect())
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] panbingkun commented on a diff in pull request #40316: [SPARK-42679][CONNECT][PYTHON] createDataFrame doesn't work with non-nullable schema

Posted by "panbingkun (via GitHub)" <gi...@apache.org>.
panbingkun commented on code in PR #40316:
URL: https://github.com/apache/spark/pull/40316#discussion_r1129113211


##########
python/pyspark/sql/tests/connect/test_connect_basic.py:
##########
@@ -2876,6 +2876,13 @@ def test_unsupported_io_functions(self):
             with self.assertRaises(NotImplementedError):
                 getattr(df.write, f)()
 
+    def test_inferred_schema(self):

Review Comment:
   Done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] ueshin commented on pull request #40316: [SPARK-42679][CONNECT][PYTHON] createDataFrame doesn't work with non-nullable schema

Posted by "ueshin (via GitHub)" <gi...@apache.org>.
ueshin commented on PR #40316:
URL: https://github.com/apache/spark/pull/40316#issuecomment-1465005658

   I submitted the PR #40382.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] itholic commented on a diff in pull request #40316: [SPARK-42679][CONNECT][PYTHON] createDataFrame doesn't work with non-nullable schema

Posted by "itholic (via GitHub)" <gi...@apache.org>.
itholic commented on code in PR #40316:
URL: https://github.com/apache/spark/pull/40316#discussion_r1129370781


##########
python/pyspark/sql/tests/connect/test_connect_basic.py:
##########
@@ -2876,6 +2876,13 @@ def test_unsupported_io_functions(self):
             with self.assertRaises(NotImplementedError):
                 getattr(df.write, f)()
 
+    def test_inferred_schema(self):

Review Comment:
   Awesome!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] panbingkun commented on pull request #40316: [SPARK-42679][CONNECT][PYTHON] createDataFrame doesn't work with non-nullable schema

Posted by "panbingkun (via GitHub)" <gi...@apache.org>.
panbingkun commented on PR #40316:
URL: https://github.com/apache/spark/pull/40316#issuecomment-1464812653

   > Hi @panbingkun, Good catch and thanks for working on this! But I'm afraid I don't think this is a good direction. It won't fix the other cases. Let me take a look at this issue and submit a PR later.
   
   Okay.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] itholic commented on a diff in pull request #40316: [SPARK-42679][CONNECT][PYTHON] createDataFrame doesn't work with non-nullable schema

Posted by "itholic (via GitHub)" <gi...@apache.org>.
itholic commented on code in PR #40316:
URL: https://github.com/apache/spark/pull/40316#discussion_r1130239606


##########
python/pyspark/sql/types.py:
##########
@@ -1599,6 +1599,18 @@ def _has_nulltype(dt: DataType) -> bool:
         return isinstance(dt, NullType)
 
 
+def _has_non_nullable(dt: Optional[Union[AtomicType, StructType]]) -> bool:

Review Comment:
   How about just simply:
   
   ```python
   def _has_non_nullable(dt: DataType) -> bool:
   ```
   
   I believe it would fix the linter



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] ueshin commented on pull request #40316: [SPARK-42679][CONNECT][PYTHON] createDataFrame doesn't work with non-nullable schema

Posted by "ueshin (via GitHub)" <gi...@apache.org>.
ueshin commented on PR #40316:
URL: https://github.com/apache/spark/pull/40316#issuecomment-1464790036

   Hi @panbingkun, Good catch and thanks for working on this!
   But I'm afraid I don't think this is a good direction. It won't fix the other cases.
   Let me take a look at this issue and submit a PR later.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] panbingkun closed pull request #40316: [SPARK-42679][CONNECT][PYTHON] createDataFrame doesn't work with non-nullable schema

Posted by "panbingkun (via GitHub)" <gi...@apache.org>.
panbingkun closed pull request #40316: [SPARK-42679][CONNECT][PYTHON] createDataFrame doesn't work with non-nullable schema
URL: https://github.com/apache/spark/pull/40316


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] panbingkun commented on pull request #40316: [SPARK-42679][CONNECT][PYTHON] createDataFrame doesn't work with non-nullable schema

Posted by "panbingkun (via GitHub)" <gi...@apache.org>.
panbingkun commented on PR #40316:
URL: https://github.com/apache/spark/pull/40316#issuecomment-1459721237

   Root causeļ¼š
   The data type nullable inferred from the data is true by default
   However, on the connector server side, `to(schema: StructType)` is called, and then `Project.matchSchema (logicalPlan, schema, sparkSession. sessionState. conf)` is executed


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] panbingkun commented on pull request #40316: [SPARK-42679][CONNECT] createDataFrame doesn't work with non-nullable schema

Posted by "panbingkun (via GitHub)" <gi...@apache.org>.
panbingkun commented on PR #40316:
URL: https://github.com/apache/spark/pull/40316#issuecomment-1459182974

   cc @itholic 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] panbingkun commented on a diff in pull request #40316: [SPARK-42679][CONNECT][PYTHON] createDataFrame doesn't work with non-nullable schema

Posted by "panbingkun (via GitHub)" <gi...@apache.org>.
panbingkun commented on code in PR #40316:
URL: https://github.com/apache/spark/pull/40316#discussion_r1130275274


##########
python/pyspark/sql/types.py:
##########
@@ -1599,6 +1599,18 @@ def _has_nulltype(dt: DataType) -> bool:
         return isinstance(dt, NullType)
 
 
+def _has_non_nullable(dt: Optional[Union[AtomicType, StructType]]) -> bool:

Review Comment:
   OK, let me fix it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] itholic commented on a diff in pull request #40316: [SPARK-42679][CONNECT] createDataFrame doesn't work with non-nullable schema

Posted by "itholic (via GitHub)" <gi...@apache.org>.
itholic commented on code in PR #40316:
URL: https://github.com/apache/spark/pull/40316#discussion_r1128906598


##########
python/pyspark/sql/tests/connect/test_connect_basic.py:
##########
@@ -2876,6 +2876,13 @@ def test_unsupported_io_functions(self):
             with self.assertRaises(NotImplementedError):
                 getattr(df.write, f)()
 
+    def test_inferred_schema(self):

Review Comment:
   Can we enrich the test by also including the test for schema_true and pandas DataFrame??
   
   such as:
   ```python
   schema_true = StructType([StructField("id", IntegerType(), True)])
   cdf2 = self.connect.createDataFrame([[1]], schema=schema_true)
   sdf2 = self.spark.createDataFrame([[1]], schema=schema_true)
   self.assertEqual(cdf2.schema, sdf2.schema)
   self.assertEqual(cdf2.collect(), sdf2.collect())
   
   pdf1 = cdf1.toPandas()
   cdf3 = self.connect.createDataFrame(pdf1, cdf1.schema)
   sdf3 = self.spark.createDataFrame(pdf1, cdf1.schema)
   self.assertEqual(cdf3.schema, sdf3.schema)
   self.assertEqual(cdf3.collect(), sdf3.collect())
   
   pdf2 = cdf2.toPandas()
   cdf4 = self.connect.createDataFrame(pdf2, cdf2.schema)
   sdf4 = self.spark.createDataFrame(pdf2, cdf2.schema)
   self.assertEqual(cdf4.schema, sdf4.schema)
   self.assertEqual(cdf4.collect(), sdf4.collect())
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org