You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/10/06 00:00:09 UTC

[GitHub] [spark] HeartSaVioR opened a new pull request, #38115: [SPARK-40670][SS][PYTHON] Fix NPE in applyInPandasWithState when the input schema has "non-nullable" column(s)

HeartSaVioR opened a new pull request, #38115:
URL: https://github.com/apache/spark/pull/38115

   ### What changes were proposed in this pull request?
   
   This PR fixes a bug which occurs NPE when the input schema of applyInPandasWithState has "non-nullable" column(s).
   This PR also leaves a code comment explaining the fix. Quoting:
   
   ```
     // See processTimedOutState: we create a row which contains the actual values for grouping key,
     // but all nulls for value side by intention. This technically changes the schema of input to
     // be "nullable", hence the schema information and the internal projection of row should take
     // this into consideration. Strictly saying, it's not applied to the part of grouping key, but
     // it doesn't hurt much to we apply the same for grouping key as well.
   ```
   
   ### Why are the changes needed?
   
   There's a bug which we didn't take the non-null columns into account. This PR fixes the bug.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   New UT. The new test case failed with NPE without the fix, and succeeded with the fix.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR commented on pull request #38115: [SPARK-40670][SS][PYTHON] Fix NPE in applyInPandasWithState when the input schema has "non-nullable" column(s)

Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on PR #38115:
URL: https://github.com/apache/spark/pull/38115#issuecomment-1269392323

   Thanks! Merging to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR commented on pull request #38115: [SPARK-40670][SS][PYTHON] Fix NPE in applyInPandasWithState when the input schema has "non-nullable" column(s)

Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on PR #38115:
URL: https://github.com/apache/spark/pull/38115#issuecomment-1269390826

   https://github.com/HeartSaVioR/spark/actions/runs/3193741282
   
   ^^^ There was a flaky test suite from Kafka - I reran the build separately and confirmed all builds passed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR commented on a diff in pull request #38115: [SPARK-40670][SS][PYTHON] Fix NPE in applyInPandasWithState when the input schema has "non-nullable" column(s)

Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on code in PR #38115:
URL: https://github.com/apache/spark/pull/38115#discussion_r988473105


##########
sql/core/src/test/scala/org/apache/spark/sql/streaming/FlatMapGroupsInPandasWithStateSuite.scala:
##########
@@ -738,4 +738,89 @@ class FlatMapGroupsInPandasWithStateSuite extends StateStoreMetricsTest {
       }
     }
   }
+
+  test("SPARK-40670: applyInPandasWithState - streaming having non-null columns") {
+    // scalastyle:off assume

Review Comment:
   Looks like we have all the "style off and on" in all test cases. Maybe a separate PR?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR commented on a diff in pull request #38115: [SPARK-40670][SS][PYTHON] Fix NPE in applyInPandasWithState when the input schema has "non-nullable" column(s)

Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on code in PR #38115:
URL: https://github.com/apache/spark/pull/38115#discussion_r988472481


##########
sql/core/src/test/scala/org/apache/spark/sql/streaming/FlatMapGroupsInPandasWithStateSuite.scala:
##########
@@ -738,4 +738,89 @@ class FlatMapGroupsInPandasWithStateSuite extends StateStoreMetricsTest {
       }
     }
   }
+
+  test("SPARK-40670: applyInPandasWithState - streaming having non-null columns") {
+    // scalastyle:off assume

Review Comment:
   lol My bad.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR commented on pull request #38115: [SPARK-40670][SS][PYTHON] Fix NPE in applyInPandasWithState when the input schema has "non-nullable" column(s)

Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on PR #38115:
URL: https://github.com/apache/spark/pull/38115#issuecomment-1269124399

   cc. @HyukjinKwon Please take a look. Thanks in advance!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a diff in pull request #38115: [SPARK-40670][SS][PYTHON] Fix NPE in applyInPandasWithState when the input schema has "non-nullable" column(s)

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on code in PR #38115:
URL: https://github.com/apache/spark/pull/38115#discussion_r988473258


##########
sql/core/src/test/scala/org/apache/spark/sql/streaming/FlatMapGroupsInPandasWithStateSuite.scala:
##########
@@ -738,4 +738,89 @@ class FlatMapGroupsInPandasWithStateSuite extends StateStoreMetricsTest {
       }
     }
   }
+
+  test("SPARK-40670: applyInPandasWithState - streaming having non-null columns") {
+    // scalastyle:off assume

Review Comment:
   sure!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR closed pull request #38115: [SPARK-40670][SS][PYTHON] Fix NPE in applyInPandasWithState when the input schema has "non-nullable" column(s)

Posted by GitBox <gi...@apache.org>.
HeartSaVioR closed pull request #38115: [SPARK-40670][SS][PYTHON] Fix NPE in applyInPandasWithState when the input schema has "non-nullable" column(s)
URL: https://github.com/apache/spark/pull/38115


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR commented on a diff in pull request #38115: [SPARK-40670][SS][PYTHON] Fix NPE in applyInPandasWithState when the input schema has "non-nullable" column(s)

Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on code in PR #38115:
URL: https://github.com/apache/spark/pull/38115#discussion_r988463233


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/python/FlatMapGroupsInPandasWithStateExec.scala:
##########
@@ -83,7 +83,17 @@ case class FlatMapGroupsInPandasWithStateExec(
   private val chainedFunc = Seq(ChainedPythonFunctions(Seq(pythonFunction)))
   private lazy val (dedupAttributes, argOffsets) = resolveArgOffsets(
     groupingAttributes ++ child.output, groupingAttributes)
-  private lazy val unsafeProj = UnsafeProjection.create(dedupAttributes, child.output)
+
+  // See processTimedOutState: we create a row which contains the actual values for grouping key,
+  // but all nulls for value side by intention. This technically changes the schema of input to
+  // be "nullable", hence the schema information and the internal projection of row should take
+  // this into consideration. Strictly saying, it's not applied to the part of grouping key, but
+  // it doesn't hurt much to we apply the same for grouping key as well.

Review Comment:
   ```suggestion
     // it doesn't hurt much even if we apply the same for grouping key as well.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a diff in pull request #38115: [SPARK-40670][SS][PYTHON] Fix NPE in applyInPandasWithState when the input schema has "non-nullable" column(s)

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on code in PR #38115:
URL: https://github.com/apache/spark/pull/38115#discussion_r988465046


##########
sql/core/src/test/scala/org/apache/spark/sql/streaming/FlatMapGroupsInPandasWithStateSuite.scala:
##########
@@ -738,4 +738,89 @@ class FlatMapGroupsInPandasWithStateSuite extends StateStoreMetricsTest {
       }
     }
   }
+
+  test("SPARK-40670: applyInPandasWithState - streaming having non-null columns") {
+    // scalastyle:off assume

Review Comment:
   Oh, let's remove this. We don't have this rule in Apache Spark.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org