You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by YuvalItzchakov <gi...@git.apache.org> on 2017/08/26 17:11:33 UTC

[GitHub] spark pull request #19059: [SS] - Avoid using `return` inside `CachedKafkaCo...

GitHub user YuvalItzchakov opened a pull request:

    https://github.com/apache/spark/pull/19059

    [SS] - Avoid using `return` inside `CachedKafkaConsumer.get`

    During profiling of a structured streaming application with Kafka as the source, I came across this exception:
    
    ![Structured Streaming Kafka Exceptions](https://user-images.githubusercontent.com/3448320/29743366-4149ef78-8a99-11e7-94d6-f0cbb691134a.png)
    
    This is a 1 minute sample, which caused 106K `NonLocalReturnControl` exceptions to be thrown. 
    This happens because `CachedKafkaConsumer.get` is ran inside:
    
    `private def runUninterruptiblyIfPossible[T](body: => T): T`
    
    Where `body: => T` is the `get` method. Turning the method into a function means that in order to escape the `while` loop defined in `get` the runtime has to do dirty tricks which involve throwing the above exception.
    
    ## What changes were proposed in this pull request?
    
    Instead of using `return` (which is generally not recommended in Scala), we place the result of the `fetchData` method inside a local variable and use a boolean flag to indicate the status of fetching data, which we monitor as our predicate to the `while` loop.
    
    ## How was this patch tested?
    
    I've ran the `KafkaSourceSuite` to make sure regression passes. Since the exception isn't visible from user code, there is no way (at least that I could think of) to add this as a test to the existing suite.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/YuvalItzchakov/spark master

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/19059.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #19059
    
----
commit c20bd14a4bed34644efc11de420a1caeccea329e
Author: Yuval Itzchakov <yu...@clicktale.com>
Date:   2017-08-26T15:21:17Z

    Avoid using "return" inside `CachedKafkaConsumer.get` as it is passed to `org.apache.spark.util.UninterruptibleThread.runUninterruptibly` as a function type which causes a NonLocalReturnControl to be called for every call

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19059: [SPARK-21873][SS] - Avoid using `return` inside `CachedK...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the issue:

    https://github.com/apache/spark/pull/19059
  
    Merged to master


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19059: [SS] - Avoid using `return` inside `CachedKafkaConsumer....

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19059
  
    **[Test build #3910 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3910/testReport)** for PR 19059 at commit [`18b9301`](https://github.com/apache/spark/commit/18b9301553427a7b6c038e144f1be52949d82eb9).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19059: [SS] - Avoid using `return` inside `CachedKafkaConsumer....

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the issue:

    https://github.com/apache/spark/pull/19059
  
    By the way @YuvalItzchakov  could you make a JIRA and link it? this is small but it's a non-trivial improvement and think we should handle it as a normal issue.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19059: [SS] - Avoid using `return` inside `CachedKafkaConsumer....

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19059
  
    **[Test build #3907 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3907/testReport)** for PR 19059 at commit [`c20bd14`](https://github.com/apache/spark/commit/c20bd14a4bed34644efc11de420a1caeccea329e).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19059: [SS] - Avoid using `return` inside `CachedKafkaCo...

Posted by YuvalItzchakov <gi...@git.apache.org>.
Github user YuvalItzchakov commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19059#discussion_r135405801
  
    --- Diff: external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/CachedKafkaConsumer.scala ---
    @@ -125,8 +131,11 @@ private[kafka010] case class CachedKafkaConsumer private(
               toFetchOffset = getEarliestAvailableOffsetBetween(toFetchOffset, untilOffset)
           }
         }
    -    resetFetchedData()
    -    null
    +
    +    if (isFetchComplete) consumerRecord else {
    --- End diff --
    
    Done


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19059: [SS] - Avoid using `return` inside `CachedKafkaCo...

Posted by YuvalItzchakov <gi...@git.apache.org>.
Github user YuvalItzchakov commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19059#discussion_r135397155
  
    --- Diff: external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/CachedKafkaConsumer.scala ---
    @@ -112,9 +112,15 @@ private[kafka010] case class CachedKafkaConsumer private(
         // we will move to the next available offset within `[offset, untilOffset)` and retry.
         // If `failOnDataLoss` is `true`, the loop body will be executed only once.
         var toFetchOffset = offset
    -    while (toFetchOffset != UNKNOWN_OFFSET) {
    +    var consumerRecord: ConsumerRecord[Array[Byte], Array[Byte]] = null
    +    // We want to break out of the while loop on a successful fetch to avoid using "return"
    +    // which may causes a NonLocalReturnControl exception when this method is used as a function.
    +    var isFetchComplete = false
    +
    +    while (toFetchOffset != UNKNOWN_OFFSET && !isFetchComplete) {
           try {
    -        return fetchData(toFetchOffset, untilOffset, pollTimeoutMs, failOnDataLoss)
    +        consumerRecord = fetchData(toFetchOffset, untilOffset, pollTimeoutMs, failOnDataLoss)
    --- End diff --
    
    It may return null: https://github.com/apache/spark/blob/823baca2cb8edb62885af547d3511c9e8923cefd/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/CachedKafkaConsumer.scala#L234


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19059: [SPARK-21873][SS] - Avoid using `return` inside `CachedK...

Posted by YuvalItzchakov <gi...@git.apache.org>.
Github user YuvalItzchakov commented on the issue:

    https://github.com/apache/spark/pull/19059
  
    https://issues.apache.org/jira/browse/SPARK-21873


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19059: [SS] - Avoid using `return` inside `CachedKafkaConsumer....

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19059
  
    **[Test build #3910 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3910/testReport)** for PR 19059 at commit [`18b9301`](https://github.com/apache/spark/commit/18b9301553427a7b6c038e144f1be52949d82eb9).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19059: [SS] - Avoid using `return` inside `CachedKafkaCo...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19059#discussion_r135397054
  
    --- Diff: external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/CachedKafkaConsumer.scala ---
    @@ -112,9 +112,15 @@ private[kafka010] case class CachedKafkaConsumer private(
         // we will move to the next available offset within `[offset, untilOffset)` and retry.
         // If `failOnDataLoss` is `true`, the loop body will be executed only once.
         var toFetchOffset = offset
    -    while (toFetchOffset != UNKNOWN_OFFSET) {
    +    var consumerRecord: ConsumerRecord[Array[Byte], Array[Byte]] = null
    +    // We want to break out of the while loop on a successful fetch to avoid using "return"
    +    // which may causes a NonLocalReturnControl exception when this method is used as a function.
    +    var isFetchComplete = false
    +
    +    while (toFetchOffset != UNKNOWN_OFFSET && !isFetchComplete) {
           try {
    -        return fetchData(toFetchOffset, untilOffset, pollTimeoutMs, failOnDataLoss)
    +        consumerRecord = fetchData(toFetchOffset, untilOffset, pollTimeoutMs, failOnDataLoss)
    --- End diff --
    
    Can `fetchData` return null? if not, then the condition can just be on `consumerRecord == null`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19059: [SS] - Avoid using `return` inside `CachedKafkaConsumer....

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19059
  
    **[Test build #3907 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3907/testReport)** for PR 19059 at commit [`c20bd14`](https://github.com/apache/spark/commit/c20bd14a4bed34644efc11de420a1caeccea329e).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19059: [SS] - Avoid using `return` inside `CachedKafkaCo...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19059#discussion_r135397047
  
    --- Diff: external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/CachedKafkaConsumer.scala ---
    @@ -125,8 +131,11 @@ private[kafka010] case class CachedKafkaConsumer private(
               toFetchOffset = getEarliestAvailableOffsetBetween(toFetchOffset, untilOffset)
           }
         }
    -    resetFetchedData()
    -    null
    +
    +    if (isFetchComplete) consumerRecord else {
    --- End diff --
    
    Go ahead and put the if clause on a new line with braces


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19059: [SS] - Avoid using `return` inside `CachedKafkaCo...

Posted by YuvalItzchakov <gi...@git.apache.org>.
Github user YuvalItzchakov commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19059#discussion_r135397171
  
    --- Diff: external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/CachedKafkaConsumer.scala ---
    @@ -112,9 +112,15 @@ private[kafka010] case class CachedKafkaConsumer private(
         // we will move to the next available offset within `[offset, untilOffset)` and retry.
         // If `failOnDataLoss` is `true`, the loop body will be executed only once.
         var toFetchOffset = offset
    -    while (toFetchOffset != UNKNOWN_OFFSET) {
    +    var consumerRecord: ConsumerRecord[Array[Byte], Array[Byte]] = null
    +    // We want to break out of the while loop on a successful fetch to avoid using "return"
    +    // which may causes a NonLocalReturnControl exception when this method is used as a function.
    +    var isFetchComplete = false
    +
    +    while (toFetchOffset != UNKNOWN_OFFSET && !isFetchComplete) {
           try {
    -        return fetchData(toFetchOffset, untilOffset, pollTimeoutMs, failOnDataLoss)
    +        consumerRecord = fetchData(toFetchOffset, untilOffset, pollTimeoutMs, failOnDataLoss)
    --- End diff --
    
    Actually, this is an additional case that needs to be taken into account. Ill look into that.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19059: [SS] - Avoid using `return` inside `CachedKafkaConsumer....

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19059
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19059: [SPARK-21873][SS] - Avoid using `return` inside `...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/19059


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org