You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by Gauravshah <gi...@git.apache.org> on 2017/02/07 20:53:34 UTC

[GitHub] spark pull request #16842: SPARK-19304 fix kinesis slow checkpoint recovery

GitHub user Gauravshah opened a pull request:

    https://github.com/apache/spark/pull/16842

    SPARK-19304 fix kinesis slow checkpoint recovery

    ## What changes were proposed in this pull request?
    added a limit to getRecords api call call in KinesisBackedBlockRdd. This helps reduce the amount of data returned by kinesis api call making the recovery considerable faster
    
    As we are storing the `fromSeqNum` & `toSeqNum` in checkpoint metadata, we can also store the number of records. Which can alter be used for api call.
    
    ## How was this patch tested?
    The patch was manually tested
    
    Apologies for any silly mistakes, opening first pull request


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/Gauravshah/spark kinesis_checkpoint_recovery_fix_2_1_0

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/16842.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #16842
    
----
commit b5e544a8ec326149b7d03773dd7abf8703ee44a2
Author: Gaurav <ga...@techtinium.com>
Date:   2017-02-07T19:21:28Z

    added limit to kinesis checkpoint backed rdd to reduce number of record loaded on aws getRecords call

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16842: [SPARK-19304] [Streaming] [Kinesis] fix kinesis slow che...

Posted by Gauravshah <gi...@git.apache.org>.
Github user Gauravshah commented on the issue:

    https://github.com/apache/spark/pull/16842
  
    @brkyvz can I do something to take it forward ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16842: [SPARK-19304] [Streaming] [Kinesis] fix kinesis slow che...

Posted by Gauravshah <gi...@git.apache.org>.
Github user Gauravshah commented on the issue:

    https://github.com/apache/spark/pull/16842
  
    will work on testcases today


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16842: [SPARK-19304] [Streaming] [Kinesis] fix kinesis s...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/16842


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16842: [SPARK-19304] [Streaming] [Kinesis] fix kinesis slow che...

Posted by brkyvz <gi...@git.apache.org>.
Github user brkyvz commented on the issue:

    https://github.com/apache/spark/pull/16842
  
    okay to test


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16842: [SPARK-19304] [Streaming] [Kinesis] fix kinesis s...

Posted by Gauravshah <gi...@git.apache.org>.
Github user Gauravshah commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16842#discussion_r102366680
  
    --- Diff: external/kinesis-asl/src/main/scala/org/apache/spark/streaming/kinesis/KinesisBackedBlockRDD.scala ---
    @@ -36,7 +36,8 @@ import org.apache.spark.util.NextIterator
     /** Class representing a range of Kinesis sequence numbers. Both sequence numbers are inclusive. */
     private[kinesis]
     case class SequenceNumberRange(
    -    streamName: String, shardId: String, fromSeqNumber: String, toSeqNumber: String)
    +    streamName: String, shardId: String, fromSeqNumber: String, toSeqNumber: String,
    +    recordCount: Int)
    --- End diff --
    
    Not sure on upgrading, since for code upgrade we need to delete the checkpoint directory and start afresh. I did run this patch and was able to serialize the limit into checkpoint, ( not a scala pro though)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16842: [SPARK-19304] [Streaming] [Kinesis] fix kinesis slow che...

Posted by brkyvz <gi...@git.apache.org>.
Github user brkyvz commented on the issue:

    https://github.com/apache/spark/pull/16842
  
    @Gauravshah Can you please comment on how much faster this PR improved your recovery time?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16842: [SPARK-19304] [Streaming] [Kinesis] fix kinesis slow che...

Posted by brkyvz <gi...@git.apache.org>.
Github user brkyvz commented on the issue:

    https://github.com/apache/spark/pull/16842
  
    @srowen Do you know if we make the field of a case class an `Option` and default it as `None`, would it still fail Java deserialization


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16842: [SPARK-19304] [Streaming] [Kinesis] fix kinesis s...

Posted by brkyvz <gi...@git.apache.org>.
Github user brkyvz commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16842#discussion_r103274162
  
    --- Diff: external/kinesis-asl/src/main/scala/org/apache/spark/streaming/kinesis/KinesisBackedBlockRDD.scala ---
    @@ -212,7 +214,7 @@ class KinesisSequenceRangeIterator(
         val getRecordsRequest = new GetRecordsRequest
         getRecordsRequest.setRequestCredentials(credentials)
         getRecordsRequest.setShardIterator(shardIterator)
    -    getRecordsRequest.setLimit(recordCount)
    +    getRecordsRequest.setLimit(Math.max(recordCount, this.maxGetRecordsLimit))
    --- End diff --
    
    this should be a `min` not a `max`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16842: [SPARK-19304] [Streaming] [Kinesis] fix kinesis s...

Posted by Gauravshah <gi...@git.apache.org>.
Github user Gauravshah commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16842#discussion_r99928115
  
    --- Diff: external/kinesis-asl/src/main/scala/org/apache/spark/streaming/kinesis/KinesisBackedBlockRDD.scala ---
    @@ -36,7 +36,8 @@ import org.apache.spark.util.NextIterator
     /** Class representing a range of Kinesis sequence numbers. Both sequence numbers are inclusive. */
     private[kinesis]
     case class SequenceNumberRange(
    -    streamName: String, shardId: String, fromSeqNumber: String, toSeqNumber: String)
    +    streamName: String, shardId: String, fromSeqNumber: String, toSeqNumber: String,
    +    recordCount: Int)
    --- End diff --
    
    http://docs.aws.amazon.com/streams/latest/dev/key-concepts.html#sequence-number


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16842: [SPARK-19304] [Streaming] [Kinesis] fix kinesis s...

Posted by brkyvz <gi...@git.apache.org>.
Github user brkyvz commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16842#discussion_r103547648
  
    --- Diff: external/kinesis-asl/src/main/scala/org/apache/spark/streaming/kinesis/KinesisBackedBlockRDD.scala ---
    @@ -36,7 +36,8 @@ import org.apache.spark.util.NextIterator
     /** Class representing a range of Kinesis sequence numbers. Both sequence numbers are inclusive. */
     private[kinesis]
     case class SequenceNumberRange(
    -    streamName: String, shardId: String, fromSeqNumber: String, toSeqNumber: String)
    +    streamName: String, shardId: String, fromSeqNumber: String, toSeqNumber: String,
    --- End diff --
    
    one parameter per line:
    
    ```scala
      streamName: String,
      shardId: String,
      ...
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16842: [SPARK-19304] [Streaming] [Kinesis] fix kinesis slow che...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16842
  
    **[Test build #3595 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3595/testReport)** for PR 16842 at commit [`c8efdcf`](https://github.com/apache/spark/commit/c8efdcf8d253d2f733cfa3a5a0e89288f4e8f1cf).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16842: [SPARK-19304] [Streaming] [Kinesis] fix kinesis slow che...

Posted by Gauravshah <gi...@git.apache.org>.
Github user Gauravshah commented on the issue:

    https://github.com/apache/spark/pull/16842
  
    @brkyvz Thank you for taking this forward.  We have batch interval of 2 minutes & takes ~1.1 minutes to process. With older code it takes 10-12 minutes to recover and with limit fix it recovers in 2.5-3 minutes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16842: [SPARK-19304] [Streaming] [Kinesis] fix kinesis slow che...

Posted by brkyvz <gi...@git.apache.org>.
Github user brkyvz commented on the issue:

    https://github.com/apache/spark/pull/16842
  
    @srowen Do you know why this hasn't kicked off any tests?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16842: [SPARK-19304] [Streaming] [Kinesis] fix kinesis s...

Posted by Gauravshah <gi...@git.apache.org>.
Github user Gauravshah commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16842#discussion_r99927422
  
    --- Diff: external/kinesis-asl/src/main/scala/org/apache/spark/streaming/kinesis/KinesisBackedBlockRDD.scala ---
    @@ -36,7 +36,8 @@ import org.apache.spark.util.NextIterator
     /** Class representing a range of Kinesis sequence numbers. Both sequence numbers are inclusive. */
     private[kinesis]
     case class SequenceNumberRange(
    -    streamName: String, shardId: String, fromSeqNumber: String, toSeqNumber: String)
    +    streamName: String, shardId: String, fromSeqNumber: String, toSeqNumber: String,
    +    recordCount: Int)
    --- End diff --
    
    Not sure of a better place to put.
    from - to != count. Kinesis seqNumber are in order but are not sequential


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16842: [SPARK-19304] [Streaming] [Kinesis] fix kinesis slow che...

Posted by Gauravshah <gi...@git.apache.org>.
Github user Gauravshah commented on the issue:

    https://github.com/apache/spark/pull/16842
  
    @srowen I assumed that you cannot update code if you want to recover from checkpoint.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16842: [SPARK-19304] [Streaming] [Kinesis] fix kinesis slow che...

Posted by brkyvz <gi...@git.apache.org>.
Github user brkyvz commented on the issue:

    https://github.com/apache/spark/pull/16842
  
    LGTM! Merging to master! Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16842: [SPARK-19304] [Streaming] [Kinesis] fix kinesis s...

Posted by Gauravshah <gi...@git.apache.org>.
Github user Gauravshah commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16842#discussion_r104282408
  
    --- Diff: external/kinesis-asl/src/main/scala/org/apache/spark/streaming/kinesis/KinesisBackedBlockRDD.scala ---
    @@ -193,9 +201,10 @@ class KinesisSequenceRangeIterator(
       /**
        * Get records starting from or after the given sequence number.
        */
    -  private def getRecords(iteratorType: ShardIteratorType, seqNum: String): Iterator[Record] = {
    +  private def getRecords(iteratorType: ShardIteratorType, seqNum: String,
    --- End diff --
    
    \U0001f44d 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16842: [SPARK-19304] [Streaming] [Kinesis] fix kinesis slow che...

Posted by Gauravshah <gi...@git.apache.org>.
Github user Gauravshah commented on the issue:

    https://github.com/apache/spark/pull/16842
  
    thanks @srowen & @brkyvz 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16842: [SPARK-19304] [Streaming] [Kinesis] fix kinesis s...

Posted by Gauravshah <gi...@git.apache.org>.
Github user Gauravshah commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16842#discussion_r99929314
  
    --- Diff: external/kinesis-asl/src/main/scala/org/apache/spark/streaming/kinesis/KinesisBackedBlockRDD.scala ---
    @@ -36,7 +36,8 @@ import org.apache.spark.util.NextIterator
     /** Class representing a range of Kinesis sequence numbers. Both sequence numbers are inclusive. */
     private[kinesis]
     case class SequenceNumberRange(
    -    streamName: String, shardId: String, fromSeqNumber: String, toSeqNumber: String)
    +    streamName: String, shardId: String, fromSeqNumber: String, toSeqNumber: String,
    +    recordCount: Int)
    --- End diff --
    
    its an input to spark checkpoint metadata. On streaming KinesisReceiver receives records creates blocks & knows about seqNumber, count. When recovering from checkpoint we read back this information from checkpoint and make aws kinesis getRecords call with fromSeqNumber & limit


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16842: [SPARK-19304] [Streaming] [Kinesis] fix kinesis slow che...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the issue:

    https://github.com/apache/spark/pull/16842
  
    Yes it would certainly change the format of the default Java serialization. It wouldn't be compatible. The fields would have different types.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16842: [SPARK-19304] [Streaming] [Kinesis] fix kinesis slow che...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16842
  
    **[Test build #3561 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3561/testReport)** for PR 16842 at commit [`b5e544a`](https://github.com/apache/spark/commit/b5e544a8ec326149b7d03773dd7abf8703ee44a2).
     * This patch **fails to build**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16842: [SPARK-19304] [Streaming] [Kinesis] fix kinesis slow che...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the issue:

    https://github.com/apache/spark/pull/16842
  
    OK I think I see. MIght still be good for @brkyvz to review.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16842: [SPARK-19304] [Streaming] [Kinesis] fix kinesis s...

Posted by brkyvz <gi...@git.apache.org>.
Github user brkyvz commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16842#discussion_r102344448
  
    --- Diff: external/kinesis-asl/src/main/scala/org/apache/spark/streaming/kinesis/KinesisBackedBlockRDD.scala ---
    @@ -36,7 +36,8 @@ import org.apache.spark.util.NextIterator
     /** Class representing a range of Kinesis sequence numbers. Both sequence numbers are inclusive. */
     private[kinesis]
     case class SequenceNumberRange(
    -    streamName: String, shardId: String, fromSeqNumber: String, toSeqNumber: String)
    +    streamName: String, shardId: String, fromSeqNumber: String, toSeqNumber: String,
    +    recordCount: Int)
    --- End diff --
    
    I'm worried this change will break checkpoint recovery, because we use Java serialization, and be a barrier to users from upgrading.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16842: [SPARK-19304] [Streaming] [Kinesis] fix kinesis s...

Posted by Gauravshah <gi...@git.apache.org>.
Github user Gauravshah commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16842#discussion_r102366702
  
    --- Diff: external/kinesis-asl/src/main/scala/org/apache/spark/streaming/kinesis/KinesisBackedBlockRDD.scala ---
    @@ -204,10 +208,11 @@ class KinesisSequenceRangeIterator(
        * to get records from Kinesis), and get the next shard iterator for next consumption.
        */
       private def getRecordsAndNextKinesisIterator(
    -      shardIterator: String): (Iterator[Record], String) = {
    +      shardIterator: String, recordCount: Int): (Iterator[Record], String) = {
         val getRecordsRequest = new GetRecordsRequest
         getRecordsRequest.setRequestCredentials(credentials)
         getRecordsRequest.setShardIterator(shardIterator)
    +    getRecordsRequest.setLimit(recordCount)
    --- End diff --
    
    \U0001f44d 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16842: [SPARK-19304] [Streaming] [Kinesis] fix kinesis slow che...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16842
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16842: [SPARK-19304] [Streaming] [Kinesis] fix kinesis slow che...

Posted by brkyvz <gi...@git.apache.org>.
Github user brkyvz commented on the issue:

    https://github.com/apache/spark/pull/16842
  
    retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16842: [SPARK-19304] [Streaming] [Kinesis] fix kinesis slow che...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16842
  
    **[Test build #3561 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3561/testReport)** for PR 16842 at commit [`b5e544a`](https://github.com/apache/spark/commit/b5e544a8ec326149b7d03773dd7abf8703ee44a2).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16842: [SPARK-19304] [Streaming] [Kinesis] fix kinesis s...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16842#discussion_r99927051
  
    --- Diff: external/kinesis-asl/src/main/scala/org/apache/spark/streaming/kinesis/KinesisBackedBlockRDD.scala ---
    @@ -36,7 +36,8 @@ import org.apache.spark.util.NextIterator
     /** Class representing a range of Kinesis sequence numbers. Both sequence numbers are inclusive. */
     private[kinesis]
     case class SequenceNumberRange(
    -    streamName: String, shardId: String, fromSeqNumber: String, toSeqNumber: String)
    +    streamName: String, shardId: String, fromSeqNumber: String, toSeqNumber: String,
    +    recordCount: Int)
    --- End diff --
    
    Why is this a property of a range -- or when would it not equal (from - to + 1)?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16842: [SPARK-19304] [Streaming] [Kinesis] fix kinesis s...

Posted by brkyvz <gi...@git.apache.org>.
Github user brkyvz commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16842#discussion_r102343964
  
    --- Diff: external/kinesis-asl/src/main/scala/org/apache/spark/streaming/kinesis/KinesisBackedBlockRDD.scala ---
    @@ -204,10 +208,11 @@ class KinesisSequenceRangeIterator(
        * to get records from Kinesis), and get the next shard iterator for next consumption.
        */
       private def getRecordsAndNextKinesisIterator(
    -      shardIterator: String): (Iterator[Record], String) = {
    +      shardIterator: String, recordCount: Int): (Iterator[Record], String) = {
         val getRecordsRequest = new GetRecordsRequest
         getRecordsRequest.setRequestCredentials(credentials)
         getRecordsRequest.setShardIterator(shardIterator)
    +    getRecordsRequest.setLimit(recordCount)
    --- End diff --
    
    if this value is greater than 10000, this will throw an error


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16842: [SPARK-19304] [Streaming] [Kinesis] fix kinesis s...

Posted by brkyvz <gi...@git.apache.org>.
Github user brkyvz commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16842#discussion_r103547760
  
    --- Diff: external/kinesis-asl/src/main/scala/org/apache/spark/streaming/kinesis/KinesisBackedBlockRDD.scala ---
    @@ -204,10 +210,11 @@ class KinesisSequenceRangeIterator(
        * to get records from Kinesis), and get the next shard iterator for next consumption.
        */
       private def getRecordsAndNextKinesisIterator(
    -      shardIterator: String): (Iterator[Record], String) = {
    +      shardIterator: String, recordCount: Int): (Iterator[Record], String) = {
    --- End diff --
    
    ditto, one param per line


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16842: [SPARK-19304] [Streaming] [Kinesis] fix kinesis slow che...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16842
  
    **[Test build #3595 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3595/testReport)** for PR 16842 at commit [`c8efdcf`](https://github.com/apache/spark/commit/c8efdcf8d253d2f733cfa3a5a0e89288f4e8f1cf).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16842: [SPARK-19304] [Streaming] [Kinesis] fix kinesis s...

Posted by brkyvz <gi...@git.apache.org>.
Github user brkyvz commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16842#discussion_r104210579
  
    --- Diff: external/kinesis-asl/src/main/scala/org/apache/spark/streaming/kinesis/KinesisBackedBlockRDD.scala ---
    @@ -193,9 +201,10 @@ class KinesisSequenceRangeIterator(
       /**
        * Get records starting from or after the given sequence number.
        */
    -  private def getRecords(iteratorType: ShardIteratorType, seqNum: String): Iterator[Record] = {
    +  private def getRecords(iteratorType: ShardIteratorType, seqNum: String,
    --- End diff --
    
    you forgot here


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16842: [SPARK-19304] [Streaming] [Kinesis] fix kinesis s...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16842#discussion_r99928041
  
    --- Diff: external/kinesis-asl/src/main/scala/org/apache/spark/streaming/kinesis/KinesisBackedBlockRDD.scala ---
    @@ -36,7 +36,8 @@ import org.apache.spark.util.NextIterator
     /** Class representing a range of Kinesis sequence numbers. Both sequence numbers are inclusive. */
     private[kinesis]
     case class SequenceNumberRange(
    -    streamName: String, shardId: String, fromSeqNumber: String, toSeqNumber: String)
    +    streamName: String, shardId: String, fromSeqNumber: String, toSeqNumber: String,
    +    recordCount: Int)
    --- End diff --
    
    OK, but is it an 'input' or 'output'? the usage below makes it look like the caller dictates how many records are in the range, but it doesn't know that ahead of time? I probably misunderstand this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16842: [WIP] [SPARK-19304] [Streaming] [Kinesis] fix kinesis sl...

Posted by Gauravshah <gi...@git.apache.org>.
Github user Gauravshah commented on the issue:

    https://github.com/apache/spark/pull/16842
  
    Jenkins, retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16842: [SPARK-19304] [Streaming] [Kinesis] fix kinesis s...

Posted by Gauravshah <gi...@git.apache.org>.
Github user Gauravshah commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16842#discussion_r103502545
  
    --- Diff: external/kinesis-asl/src/main/scala/org/apache/spark/streaming/kinesis/KinesisBackedBlockRDD.scala ---
    @@ -212,7 +214,7 @@ class KinesisSequenceRangeIterator(
         val getRecordsRequest = new GetRecordsRequest
         getRecordsRequest.setRequestCredentials(credentials)
         getRecordsRequest.setShardIterator(shardIterator)
    -    getRecordsRequest.setLimit(recordCount)
    +    getRecordsRequest.setLimit(Math.max(recordCount, this.maxGetRecordsLimit))
    --- End diff --
    
    \U0001f44d 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org