You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by jerryshao <gi...@git.apache.org> on 2015/11/10 14:30:05 UTC

[GitHub] spark pull request: [SPARK-11632][Streaming] Filter out empty part...

GitHub user jerryshao opened a pull request:

    https://github.com/apache/spark/pull/9597

    [SPARK-11632][Streaming] Filter out empty partition in KafkaRDD

    Currently empty partitions or empty `KafkaRDD` will still submit and run tasks to remotely, this is unnecessary since no data is processed. This patch fix this by filtering out the empty partition, this could potentially alleviate the scheduling overhead, since now empty partition will not generate task, even stage could be skipped if current rdd is empty. Also this could make dynamic allocation effective (no tasks generated if there's no data injected).
    
    Please help to review @tdas @koeninger and @zsxwing , thanks a lot!

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/jerryshao/apache-spark SPARK-11632

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/9597.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #9597
    
----
commit 57b2914adb3ad4929e8c81b0daac98cfd83f674c
Author: jerryshao <ss...@hortonworks.com>
Date:   2015-11-10T13:18:21Z

    Filter out empty partition in KafkaRDD

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-11632][Streaming] Filter out empty part...

Posted by jerryshao <gi...@git.apache.org>.

Github user jerryshao commented on the pull request:

    https://github.com/apache/spark/pull/9597#issuecomment-156794866
  
    Hi @koeninger , how about this change? Still keeping the mapping relations, so offset range can be retrieved through partitionId, just filter out empty partition.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-11632][Streaming] Filter out empty part...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9597#issuecomment-155419904
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-11632][Streaming] Filter out empty part...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9597#issuecomment-155420328
  
    **[Test build #45524 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/45524/consoleFull)** for PR 9597 at commit [`57b2914`](https://github.com/apache/spark/commit/57b2914adb3ad4929e8c81b0daac98cfd83f674c).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-11632][Streaming] Filter out empty part...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9597#issuecomment-156806725
  
    **[Test build #45949 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/45949/consoleFull)** for PR 9597 at commit [`30e1578`](https://github.com/apache/spark/commit/30e1578dcb2d690a5d0d48b7e6f1a7463aedc158).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-11632][Streaming] Filter out empty part...

Posted by koeninger <gi...@git.apache.org>.

Github user koeninger commented on the pull request:

    https://github.com/apache/spark/pull/9597#issuecomment-156167582
  
    This is going to break the invariant that offset ranges are 1:1 with spark partitions, which will definitely break some people's jobs in a non-obvious manner.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-11632][Streaming] Filter out empty part...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9597#issuecomment-156803004
  
    **[Test build #45949 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/45949/consoleFull)** for PR 9597 at commit [`30e1578`](https://github.com/apache/spark/commit/30e1578dcb2d690a5d0d48b7e6f1a7463aedc158).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-11632][Streaming] Filter out empty part...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9597#issuecomment-155428515
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-11632][Streaming] Filter out empty part...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9597#issuecomment-155428344
  
    **[Test build #45524 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/45524/consoleFull)** for PR 9597 at commit [`57b2914`](https://github.com/apache/spark/commit/57b2914adb3ad4929e8c81b0daac98cfd83f674c).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-11632][Streaming] Filter out empty part...

Posted by jerryshao <gi...@git.apache.org>.

Github user jerryshao commented on the pull request:

    https://github.com/apache/spark/pull/9597#issuecomment-156291634
  
    OK, get it. How about this, still keeping the mapping relation from offset Range to rdd partition, but filter out empty partition, so `offset(i)` still map to `partitions(i)`, what do you think?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-11632][Streaming] Filter out empty part...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9597#issuecomment-155419891
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-11632][Streaming] Filter out empty part...

Posted by koeninger <gi...@git.apache.org>.

Github user koeninger commented on the pull request:

    https://github.com/apache/spark/pull/9597#issuecomment-156457252
  
    Well, the implementation of HasOffsetRanges.offsetRanges in KafkaRDD is
    just the val offsetRanges provided on creation, so you're talking about a
    fair amount of change.  It's a little weird from my perspective to create
    an object with a given set of topicpartitions, then when you ask for them
    back, to get a different set.
    
    I'm also not 100% sure what practical problem this is solving.  Is this
    actually sufficient to get dynamic allocation on a stream working
    correctly?  Also, if someone has particular partitions that are empty
    frequently enough to actually get a resource benefit from dynamic
    allocation, it sounds like something is wrong with their partitioning
    scheme.
    
    In terms of user expectations, if I have a job currently that is e.g.
    reporting on the rate of messages per topicpartition, isn't it suddenly
    going to stop reporting anything in cases where it would previously report
    0?
    
    On Thu, Nov 12, 2015 at 7:46 PM, Saisai Shao <no...@github.com>
    wrote:
    
    > OK, get it. How about this, still keeping the mapping relation from offset
    > Range to rdd partition, but filter out empty partition, so offset(i)
    > still map to partitions(i), what do you think?
    >
    > —
    > Reply to this email directly or view it on GitHub
    > <https://github.com/apache/spark/pull/9597#issuecomment-156291634>.
    >



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-11632][Streaming] Filter out empty part...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9597#issuecomment-156806763
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/45949/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-11632][Streaming] Filter out empty part...

Posted by jerryshao <gi...@git.apache.org>.

Github user jerryshao commented on the pull request:

    https://github.com/apache/spark/pull/9597#issuecomment-156602165
  
    I think I will not change `offsetRanges `, I will make a simple change so you can see if it is reasonable.
    
    Yes, some partitions are empty while others have data is abnormal in Kafka (wrong partitioning scheme). But what about different topics? Also what I'm actually saying is that when all the partitions are empty (no data injected at that time period), is it better not to issue tasks, to let dynamic allocation ramp down resources?
    
    Besides dynamic allocation, basically issuing an empty task (we already know it is empty) without doing anything, from my understanding it can be avoided, and have no impact to most of the user cases (you could always find out use cases that beyond normal expectation).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-11632][Streaming] Filter out empty part...

Posted by jerryshao <gi...@git.apache.org>.

Github user jerryshao commented on the pull request:

    https://github.com/apache/spark/pull/9597#issuecomment-156282327
  
    @koeninger , can you please point out in which scenario people will use this 1:1 guarantee, how people use this 1:1 restriction?
    
    Actually I saw some users are complaining why stage is still running even without data injected，that makes their dynamic allocation fail to ramp down executors. That's the point why I submit this PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-11632][Streaming] Filter out empty part...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9597#issuecomment-155428516
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/45524/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-11632][Streaming] Filter out empty part...

Posted by koeninger <gi...@git.apache.org>.

Github user koeninger commented on the pull request:

    https://github.com/apache/spark/pull/9597#issuecomment-156822484
  
    Are you 100% sure that all uses of the partition array only use the index
    associated with the individual Partition, and not its position in the array?
    
    At the very least, you should have a test for an RDD with partially empty
    partitions and ensure that the indices for hasoffsetranges line up with the
    task context partition id.
    
    On Sun, Nov 15, 2015 at 3:48 AM, Saisai Shao <no...@github.com>
    wrote:
    
    > Hi @koeninger <https://github.com/koeninger> , how about this change?
    > Still keeping the mapping relations, so offset range can be retrieved
    > through partitionId, just filter out empty partition.
    >
    > —
    > Reply to this email directly or view it on GitHub
    > <https://github.com/apache/spark/pull/9597#issuecomment-156794866>.
    >



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-11632][Streaming] Filter out empty part...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9597#issuecomment-156806762
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-11632][Streaming] Filter out empty part...

Posted by koeninger <gi...@git.apache.org>.

Github user koeninger commented on the pull request:

    https://github.com/apache/spark/pull/9597#issuecomment-156290949
  
    https://github.com/koeninger/kafka-exactly-once/blob/master/blogpost.md#hasoffsetranges
    
    The 1:1 correspondence is also mentioned in the spark docs Kafka
    integration guide
    On Nov 12, 2015 6:46 PM, "Saisai Shao" <no...@github.com> wrote:
    
    > @koeninger <https://github.com/koeninger> , can you please point out in
    > which scenario people will use this 1:1 guarantee, how people use this 1:1
    > restriction?
    >
    > Actually I saw some users are complaining why stage is still running even
    > without data injected，that makes their dynamic allocation fail to ramp down
    > executors. That's the point why I submit this PR.
    >
    > —
    > Reply to this email directly or view it on GitHub
    > <https://github.com/apache/spark/pull/9597#issuecomment-156282327>.
    >



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org