You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by "panbingkun (via GitHub)" <gi...@apache.org> on 2023/03/05 02:29:25 UTC

[GitHub] [spark] panbingkun opened a new pull request, #40280: [SPARK-42671][CONNECT] Fix bug for createDataFrame from complex type schema

panbingkun opened a new pull request, #40280:
URL: https://github.com/apache/spark/pull/40280

   ### What changes were proposed in this pull request?
   The pr aims to fix bug for createDataFrame from complex type schema.
   
   ### Why are the changes needed?
   When I code UT for `DataFrameNaFunctions`, I found:
   val schema = new StructType()
         .add("c1", new StructType()
           .add("c1-1", StringType)
           .add("c1-2", StringType))
   val data = Seq(Row(Row(null, "a2")), Row(Row("b1", "b2")), Row(null))
   val df = spark.createDataFrame(data.asJava, schema)
   df.show()
   
   Unable to work. The error message is as follows:
   <img width="657" alt="image" src="https://user-images.githubusercontent.com/15246973/222938339-77dec8c6-549b-41de-869b-8a191a0f745e.png">
   
   ### Does this PR introduce _any_ user-facing change?
   No.
   
   ### How was this patch tested?
   Add new UT
   Pass GA.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] panbingkun commented on a diff in pull request #40280: [SPARK-42671][CONNECT] Fix bug for createDataFrame from complex type schema

Posted by "panbingkun (via GitHub)" <gi...@apache.org>.

panbingkun commented on code in PR #40280:
URL: https://github.com/apache/spark/pull/40280#discussion_r1125590273


##########
connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/SparkSession.scala:
##########
@@ -115,7 +115,7 @@ class SparkSession private[sql] (
   private def createDataset[T](encoder: AgnosticEncoder[T], data: Iterator[T]): Dataset[T] = {
     newDataset(encoder) { builder =>
       val localRelationBuilder = builder.getLocalRelationBuilder
-        .setSchema(encoder.schema.catalogString)

Review Comment:
   Should I change `json` to `toDDL`, or is `json` actually OK?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] itholic commented on pull request #40280: [SPARK-42671][CONNECT] Fix bug for createDataFrame from complex type schema

Posted by "itholic (via GitHub)" <gi...@apache.org>.

itholic commented on PR #40280:
URL: https://github.com/apache/spark/pull/40280#issuecomment-1458111230

   Awesome!! Let me take a look at your PR.
   Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] panbingkun commented on pull request #40280: [SPARK-42671][CONNECT] Fix bug for createDataFrame from complex type schema

Posted by "panbingkun (via GitHub)" <gi...@apache.org>.

panbingkun commented on PR #40280:
URL: https://github.com/apache/spark/pull/40280#issuecomment-1454964971

   ### Full stack:
   
   INTERNAL: 
   [PARSE_SYNTAX_ERROR] Syntax error at or near '<': extra input '<'.(line 1, pos 6)
   
   == SQL ==
   struct<c1:struct<c1-1:string,c1-2:string>>
   ------^^^
   
   io.grpc.StatusRuntimeException: INTERNAL: 
   [PARSE_SYNTAX_ERROR] Syntax error at or near '<': extra input '<'.(line 1, pos 6)
   
   == SQL ==
   struct<c1:struct<c1-1:string,c1-2:string>>
   ------^^^
   
   	at io.grpc.Status.asRuntimeException(Status.java:535)
   	at io.grpc.stub.ClientCalls$BlockingResponseStream.hasNext(ClientCalls.java:660)
   	at org.apache.spark.sql.connect.client.SparkResult.org$apache$spark$sql$connect$client$SparkResult$$processResponses(SparkResult.scala:61)
   	at org.apache.spark.sql.connect.client.SparkResult.length(SparkResult.scala:106)
   	at org.apache.spark.sql.Dataset.$anonfun$show$2(Dataset.scala:529)
   	at org.apache.spark.sql.Dataset.$anonfun$show$2$adapted(Dataset.scala:528)
   	at org.apache.spark.sql.Dataset.withResult(Dataset.scala:2752)
   	at org.apache.spark.sql.Dataset.show(Dataset.scala:528)
   	at org.apache.spark.sql.Dataset.show(Dataset.scala:444)
   	at org.apache.spark.sql.Dataset.show(Dataset.scala:399)
   	at org.apache.spark.sql.Dataset.show(Dataset.scala:408)
   	at org.apache.spark.sql.ClientE2ETestSuite.$anonfun$new$85(ClientE2ETestSuite.scala:608)
   	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
   	at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
   	at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
   	at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
   	at org.scalatest.Transformer.apply(Transformer.scala:22)
   	at org.scalatest.Transformer.apply(Transformer.scala:20)
   	at org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:226)
   	at org.scalatest.TestSuite.withFixture(TestSuite.scala:196)
   	at org.scalatest.TestSuite.withFixture$(TestSuite.scala:195)
   	at org.scalatest.funsuite.AnyFunSuite.withFixture(AnyFunSuite.scala:1564)
   	at org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:224)
   	at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:236)
   	at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
   	at org.scalatest.funsuite.AnyFunSuiteLike.runTest(AnyFunSuiteLike.scala:236)
   	at org.scalatest.funsuite.AnyFunSuiteLike.runTest$(AnyFunSuiteLike.scala:218)
   	at org.scalatest.funsuite.AnyFunSuite.runTest(AnyFunSuite.scala:1564)
   	at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTests$1(AnyFunSuiteLike.scala:269)
   	at org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:413)
   	at scala.collection.immutable.List.foreach(List.scala:431)
   	at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
   	at org.scalatest.SuperEngine.runTestsInBranch(Engine.scala:396)
   	at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:475)
   	at org.scalatest.funsuite.AnyFunSuiteLike.runTests(AnyFunSuiteLike.scala:269)
   	at org.scalatest.funsuite.AnyFunSuiteLike.runTests$(AnyFunSuiteLike.scala:268)
   	at org.scalatest.funsuite.AnyFunSuite.runTests(AnyFunSuite.scala:1564)
   	at org.scalatest.Suite.run(Suite.scala:1114)
   	at org.scalatest.Suite.run$(Suite.scala:1096)
   	at org.scalatest.funsuite.AnyFunSuite.org$scalatest$funsuite$AnyFunSuiteLike$$super$run(AnyFunSuite.scala:1564)
   	at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$run$1(AnyFunSuiteLike.scala:273)
   	at org.scalatest.SuperEngine.runImpl(Engine.scala:535)
   	at org.scalatest.funsuite.AnyFunSuiteLike.run(AnyFunSuiteLike.scala:273)
   	at org.scalatest.funsuite.AnyFunSuiteLike.run$(AnyFunSuiteLike.scala:272)
   	at org.apache.spark.sql.ClientE2ETestSuite.org$scalatest$BeforeAndAfterAll$$super$run(ClientE2ETestSuite.scala:34)
   	at org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:213)
   	at org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210)
   	at org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208)
   	at org.apache.spark.sql.ClientE2ETestSuite.run(ClientE2ETestSuite.scala:34)
   	at org.scalatest.tools.SuiteRunner.run(SuiteRunner.scala:47)
   	at org.scalatest.tools.Runner$.$anonfun$doRunRunRunDaDoRunRun$13(Runner.scala:1321)
   	at org.scalatest.tools.Runner$.$anonfun$doRunRunRunDaDoRunRun$13$adapted(Runner.scala:1315)
   	at scala.collection.immutable.List.foreach(List.scala:431)
   	at org.scalatest.tools.Runner$.doRunRunRunDaDoRunRun(Runner.scala:1315)
   	at org.scalatest.tools.Runner$.$anonfun$runOptionallyWithPassFailReporter$24(Runner.scala:992)
   	at org.scalatest.tools.Runner$.$anonfun$runOptionallyWithPassFailReporter$24$adapted(Runner.scala:970)
   	at org.scalatest.tools.Runner$.withClassLoaderAndDispatchReporter(Runner.scala:1481)
   	at org.scalatest.tools.Runner$.runOptionallyWithPassFailReporter(Runner.scala:970)
   	at org.scalatest.tools.Runner$.run(Runner.scala:798)
   	at org.scalatest.tools.Runner.run(Runner.scala)
   	at org.jetbrains.plugins.scala.testingSupport.scalaTest.ScalaTestRunner.runScalaTest2or3(ScalaTestRunner.java:43)
   	at org.jetbrains.plugins.scala.testingSupport.scalaTest.ScalaTestRunner.main(ScalaTestRunner.java:26)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] itholic commented on pull request #40280: [SPARK-42671][CONNECT] Fix bug for createDataFrame from complex type schema

Posted by "itholic (via GitHub)" <gi...@apache.org>.

itholic commented on PR #40280:
URL: https://github.com/apache/spark/pull/40280#issuecomment-1457558427

   Thanks, @panbingkun !
   By the way, I think this issue has a pretty high priority since the default nullability of a schema is `False`.
   
   ```python
   >>> sdf = spark.range(10).schema
   self._schema: StructType([StructField('id', LongType(), False)])
   ```
   
   For example, even intuitive and simple code like creating a DataFrame from a pandas DataFrame fails as follows:
   ```python
   >>> sdf = spark.range(10)
   >>> pdf = sdf.toPandas()
   >>> spark.createDataFrame(pdf, sdf.schema)
   Traceback (most recent call last):
   ...
   pyspark.errors.exceptions.connect.AnalysisException: [NULLABLE_COLUMN_OR_FIELD] Column or field `id` is nullable while it's required to be non-nullable.
   ```
   
   Feel free to ping me anytime if you need any help!
   Thanks again for your time on investigating this :-)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] itholic commented on pull request #40280: [SPARK-42671][CONNECT] Fix bug for createDataFrame from complex type schema

Posted by "itholic (via GitHub)" <gi...@apache.org>.

itholic commented on PR #40280:
URL: https://github.com/apache/spark/pull/40280#issuecomment-1455565581

   Thanks @panbingkun for the nice fix!
   Btw, think I found another `createDataFrame` bug which is not working properly with non-nullable schema as below:
   ```python
   >>> from pyspark.sql.types import *
   >>> schema_false = StructType([StructField("id", IntegerType(), False)])
   >>> spark.createDataFrame([[1]], schema=schema_false)
   Traceback (most recent call last):
   ...
   pyspark.errors.exceptions.connect.AnalysisException: [NULLABLE_COLUMN_OR_FIELD] Column or field `id` is nullable while it's required to be non-nullable.
   ```
   whereas working find with nullable schema as below:
   ```python
   >>> schema_true = StructType([StructField("id", IntegerType(), True)])
   >>> spark.createDataFrame([[1]], schema=schema_true)
   DataFrame[id: int]
   ```
   
   Do you have any idea what might be causing this? Could you take a look at it if you're interested in? I have filed an issue at SPARK-42679.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] itholic commented on pull request #40280: [SPARK-42671][CONNECT] Fix bug for createDataFrame from complex type schema

Posted by "itholic (via GitHub)" <gi...@apache.org>.

itholic commented on PR #40280:
URL: https://github.com/apache/spark/pull/40280#issuecomment-1455567210

   Thanks @panbingkun for the nice fix!
   Btw, think I found another `createDataFrame` bug which is not working properly with non-nullable schema as below:
   ```python
   >>> from pyspark.sql.types import *
   >>> schema_false = StructType([StructField("id", IntegerType(), False)])
   >>> spark.createDataFrame([[1]], schema=schema_false)
   Traceback (most recent call last):
   ...
   pyspark.errors.exceptions.connect.AnalysisException: [NULLABLE_COLUMN_OR_FIELD] Column or field `id` is nullable while it's required to be non-nullable.
   ```
   whereas working find with nullable schema as below:
   ```python
   >>> schema_true = StructType([StructField("id", IntegerType(), True)])
   >>> spark.createDataFrame([[1]], schema=schema_true)
   DataFrame[id: int]
   ```
   
   Do you have any idea what might be causing this? Could you take a look at it if you're interested in? I have filed an issue at SPARK-42679.
   
   Also cc @hvanhovell as an original author for `createDataFrame`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] hvanhovell commented on a diff in pull request #40280: [SPARK-42671][CONNECT] Fix bug for createDataFrame from complex type schema

Posted by "hvanhovell (via GitHub)" <gi...@apache.org>.

hvanhovell commented on code in PR #40280:
URL: https://github.com/apache/spark/pull/40280#discussion_r1125800378


##########
connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/SparkSession.scala:
##########
@@ -115,7 +115,7 @@ class SparkSession private[sql] (
   private def createDataset[T](encoder: AgnosticEncoder[T], data: Iterator[T]): Dataset[T] = {
     newDataset(encoder) { builder =>
       val localRelationBuilder = builder.getLocalRelationBuilder
-        .setSchema(encoder.schema.catalogString)

Review Comment:
   json is ok



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] panbingkun commented on pull request #40280: [SPARK-42671][CONNECT] Fix bug for createDataFrame from complex type schema

Posted by "panbingkun (via GitHub)" <gi...@apache.org>.

panbingkun commented on PR #40280:
URL: https://github.com/apache/spark/pull/40280#issuecomment-1458081353

   > Thanks, @panbingkun ! By the way, I think this issue has a pretty high priority since the default nullability of a schema is `False`.
   > 
   > ```python
   > >>> sdf = spark.range(10).schema
   > self._schema: StructType([StructField('id', LongType(), False)])
   > ```
   > 
   > For example, even intuitive and simple code like creating a DataFrame from a pandas DataFrame fails as follows:
   > 
   > ```python
   > >>> sdf = spark.range(10)
   > >>> pdf = sdf.toPandas()
   > >>> spark.createDataFrame(pdf, sdf.schema)
   > Traceback (most recent call last):
   > ...
   > pyspark.errors.exceptions.connect.AnalysisException: [NULLABLE_COLUMN_OR_FIELD] Column or field `id` is nullable while it's required to be non-nullable.
   > ```
   > 
   > Please feel free to ping me anytime if you need any help! Thanks again for your time on investigating this :-)
   
   I have found root cause. Let me think about how to fix it.
   Temporary solutions: https://github.com/apache/spark/pull/40316


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] panbingkun commented on pull request #40280: [SPARK-42671][CONNECT] Fix bug for createDataFrame from complex type schema

Posted by "panbingkun (via GitHub)" <gi...@apache.org>.

panbingkun commented on PR #40280:
URL: https://github.com/apache/spark/pull/40280#issuecomment-1457349284

   > Thanks @panbingkun for the nice fix! Btw, think I found another `createDataFrame` bug which is not working properly with non-nullable schema as below:
   > 
   > ```python
   > >>> from pyspark.sql.types import *
   > >>> schema_false = StructType([StructField("id", IntegerType(), False)])
   > >>> spark.createDataFrame([[1]], schema=schema_false)
   > Traceback (most recent call last):
   > ...
   > pyspark.errors.exceptions.connect.AnalysisException: [NULLABLE_COLUMN_OR_FIELD] Column or field `id` is nullable while it's required to be non-nullable.
   > ```
   > 
   > whereas working find with nullable schema as below:
   > 
   > ```python
   > >>> schema_true = StructType([StructField("id", IntegerType(), True)])
   > >>> spark.createDataFrame([[1]], schema=schema_true)
   > DataFrame[id: int]
   > ```
   > 
   > Do you have any idea what might be causing this? Could you take a look at it if you're interested in? I have filed an issue at [SPARK-42679](https://issues.apache.org/jira/browse/SPARK-42679).
   > 
   > Also cc @hvanhovell as an original author for `createDataFrame`.
   
   Let me try to investigate it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] hvanhovell commented on a diff in pull request #40280: [SPARK-42671][CONNECT] Fix bug for createDataFrame from complex type schema

Posted by "hvanhovell (via GitHub)" <gi...@apache.org>.

hvanhovell commented on code in PR #40280:
URL: https://github.com/apache/spark/pull/40280#discussion_r1125587820


##########
connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/SparkSession.scala:
##########
@@ -115,7 +115,7 @@ class SparkSession private[sql] (
   private def createDataset[T](encoder: AgnosticEncoder[T], data: Iterator[T]): Dataset[T] = {
     newDataset(encoder) { builder =>
       val localRelationBuilder = builder.getLocalRelationBuilder
-        .setSchema(encoder.schema.catalogString)

Review Comment:
   O damn. This should have been toDDL. This is also wrong in PlanGenerationTestSuite. This also why don't like to work with strings. See: https://github.com/apache/spark/pull/40238



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] panbingkun commented on pull request #40280: [SPARK-42671][CONNECT] Fix bug for createDataFrame from complex type schema

Posted by "panbingkun (via GitHub)" <gi...@apache.org>.

panbingkun commented on PR #40280:
URL: https://github.com/apache/spark/pull/40280#issuecomment-1454967088

   cc @hvanhovell 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] hvanhovell closed pull request #40280: [SPARK-42671][CONNECT] Fix bug for createDataFrame from complex type schema

Posted by "hvanhovell (via GitHub)" <gi...@apache.org>.

hvanhovell closed pull request #40280: [SPARK-42671][CONNECT] Fix bug for createDataFrame from complex type schema
URL: https://github.com/apache/spark/pull/40280


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org