You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by tdas <gi...@git.apache.org> on 2017/03/27 21:22:37 UTC

[GitHub] spark pull request #17444: [SPARK-19876][SS] Follow up: Refactored BatchComm...

GitHub user tdas opened a pull request:

    https://github.com/apache/spark/pull/17444

    [SPARK-19876][SS] Follow up: Refactored BatchCommitLog to simplify logic

    ## What changes were proposed in this pull request?
    
    Existing logic seemingly writes null to the BatchCommitLog, even though it does additional checks to write '{}' (valid json) to the log. This PR simplifies the logic by disallowing use of `log.add(batchId, metadata)` and instead using `log.add(batchId)`. No question of specifying metadata, so no confusion related to null. 
    
    ## How was this patch tested?
    Existing tests pass. 

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/tdas/spark SPARK-19876-1

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/17444.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #17444
    
----
commit 4fa86a56f96ae120806fb75811f61128e1bd7abc
Author: Tathagata Das <ta...@gmail.com>
Date:   2017-03-27T21:22:59Z

    Refactored BatchCommitLog to simplify logic

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17444: [SPARK-19876][SS] Follow up: Refactored BatchCommitLog t...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17444
  
    **[Test build #75283 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75283/testReport)** for PR 17444 at commit [`6b884c9`](https://github.com/apache/spark/commit/6b884c927b84dd01fad1c85207baf1252ef67069).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17444: [SPARK-19876][SS] Follow up: Refactored BatchCommitLog t...

Posted by brkyvz <gi...@git.apache.org>.

Github user brkyvz commented on the issue:

    https://github.com/apache/spark/pull/17444
  
    LGTM pending tests


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17444: [SPARK-19876][SS] Follow up: Refactored BatchComm...

Posted by brkyvz <gi...@git.apache.org>.

Github user brkyvz commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17444#discussion_r108299910
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/BatchCommitLog.scala ---
    @@ -45,28 +45,37 @@ import org.apache.spark.sql.SparkSession
     class BatchCommitLog(sparkSession: SparkSession, path: String)
       extends HDFSMetadataLog[String](sparkSession, path) {
     
    +  import BatchCommitLog._
    +
    +  def add(batchId: Long): Unit = {
    +    super.add(batchId, SERIALIZED_VOID)
    +  }
    +
    +  override def add(batchId: Long, metadata: String): Boolean = {
    +    throw new UnsupportedOperationException(
    +      "BatchCommitLog does not take any metadata, use 'add(batchId)' instead")
    +  }
    +
       override protected def deserialize(in: InputStream): String = {
         // called inside a try-finally where the underlying stream is closed in the caller
         val lines = IOSource.fromInputStream(in, UTF_8.name()).getLines()
         if (!lines.hasNext) {
           throw new IllegalStateException("Incomplete log file in the offset commit log")
         }
    -    parseVersion(lines.next().trim, BatchCommitLog.VERSION)
    +    parseVersion(lines.next.trim, VERSION)
         // read metadata
    -    lines.next().trim match {
    -      case BatchCommitLog.SERIALIZED_VOID => null
    -      case metadata => metadata
    -    }
    +    val metadata = lines.next.trim
    +    assert(metadata == SERIALIZED_VOID, s"Batch commit log has unexpected metadata: $metadata ")
    --- End diff --
    
    What I tried to say is that if the message has some other fields in it, it shouldn't be fatal. We should probably log a warning saying that there may be bugs, upgrade your Spark version


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17444: [SPARK-19876][SS] Follow up: Refactored BatchCommitLog t...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17444
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17444: [SPARK-19876][SS] Follow up: Refactored BatchCommitLog t...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17444
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17444: [SPARK-19876][SS] Follow up: Refactored BatchComm...

Posted by tcondie <gi...@git.apache.org>.

Github user tcondie commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17444#discussion_r108311230
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/BatchCommitLog.scala ---
    @@ -45,33 +45,39 @@ import org.apache.spark.sql.SparkSession
     class BatchCommitLog(sparkSession: SparkSession, path: String)
       extends HDFSMetadataLog[String](sparkSession, path) {
     
    +  import BatchCommitLog._
    +
    +  def add(batchId: Long): Unit = {
    +    super.add(batchId, EMPTY_JSON)
    +  }
    +
    +  override def add(batchId: Long, metadata: String): Boolean = {
    +    throw new UnsupportedOperationException(
    --- End diff --
    
    What if we want metadata down the road? What's the problem with supporting this for cases other than null?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17444: [SPARK-19876][SS] Follow up: Refactored BatchCommitLog t...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17444
  
    **[Test build #75279 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75279/testReport)** for PR 17444 at commit [`4fa86a5`](https://github.com/apache/spark/commit/4fa86a56f96ae120806fb75811f61128e1bd7abc).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17444: [SPARK-19876][SS] Follow up: Refactored BatchCommitLog t...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17444
  
    **[Test build #75279 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75279/testReport)** for PR 17444 at commit [`4fa86a5`](https://github.com/apache/spark/commit/4fa86a56f96ae120806fb75811f61128e1bd7abc).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17444: [SPARK-19876][SS] Follow up: Refactored BatchComm...

Posted by brkyvz <gi...@git.apache.org>.

Github user brkyvz commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17444#discussion_r108299709
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/BatchCommitLog.scala ---
    @@ -45,28 +45,37 @@ import org.apache.spark.sql.SparkSession
     class BatchCommitLog(sparkSession: SparkSession, path: String)
       extends HDFSMetadataLog[String](sparkSession, path) {
     
    +  import BatchCommitLog._
    +
    +  def add(batchId: Long): Unit = {
    +    super.add(batchId, SERIALIZED_VOID)
    +  }
    +
    +  override def add(batchId: Long, metadata: String): Boolean = {
    +    throw new UnsupportedOperationException(
    +      "BatchCommitLog does not take any metadata, use 'add(batchId)' instead")
    +  }
    +
       override protected def deserialize(in: InputStream): String = {
         // called inside a try-finally where the underlying stream is closed in the caller
         val lines = IOSource.fromInputStream(in, UTF_8.name()).getLines()
         if (!lines.hasNext) {
           throw new IllegalStateException("Incomplete log file in the offset commit log")
         }
    -    parseVersion(lines.next().trim, BatchCommitLog.VERSION)
    +    parseVersion(lines.next.trim, VERSION)
         // read metadata
    -    lines.next().trim match {
    -      case BatchCommitLog.SERIALIZED_VOID => null
    -      case metadata => metadata
    -    }
    +    val metadata = lines.next.trim
    +    assert(metadata == SERIALIZED_VOID, s"Batch commit log has unexpected metadata: $metadata ")
    --- End diff --
    
    As long as the version is the same, we should be able to ignore this. If we don't want to ignore it, then in the message we should say please upgrade your Spark version or something


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17444: [SPARK-19876][SS] Follow up: Refactored BatchCommitLog t...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17444
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75279/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17444: [SPARK-19876][SS] Follow up: Refactored BatchCommitLog t...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17444
  
    **[Test build #75283 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75283/testReport)** for PR 17444 at commit [`6b884c9`](https://github.com/apache/spark/commit/6b884c927b84dd01fad1c85207baf1252ef67069).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17444: [SPARK-19876][SS] Follow up: Refactored BatchComm...

Posted by tdas <gi...@git.apache.org>.

Github user tdas commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17444#discussion_r108303640
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/BatchCommitLog.scala ---
    @@ -45,28 +45,37 @@ import org.apache.spark.sql.SparkSession
     class BatchCommitLog(sparkSession: SparkSession, path: String)
       extends HDFSMetadataLog[String](sparkSession, path) {
     
    +  import BatchCommitLog._
    +
    +  def add(batchId: Long): Unit = {
    +    super.add(batchId, SERIALIZED_VOID)
    +  }
    +
    +  override def add(batchId: Long, metadata: String): Boolean = {
    +    throw new UnsupportedOperationException(
    +      "BatchCommitLog does not take any metadata, use 'add(batchId)' instead")
    +  }
    +
       override protected def deserialize(in: InputStream): String = {
         // called inside a try-finally where the underlying stream is closed in the caller
         val lines = IOSource.fromInputStream(in, UTF_8.name()).getLines()
         if (!lines.hasNext) {
           throw new IllegalStateException("Incomplete log file in the offset commit log")
         }
    -    parseVersion(lines.next().trim, BatchCommitLog.VERSION)
    +    parseVersion(lines.next.trim, VERSION)
         // read metadata
    -    lines.next().trim match {
    -      case BatchCommitLog.SERIALIZED_VOID => null
    -      case metadata => metadata
    -    }
    +    val metadata = lines.next.trim
    +    assert(metadata == SERIALIZED_VOID, s"Batch commit log has unexpected metadata: $metadata ")
    --- End diff --
    
    Right. updating that.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17444: [SPARK-19876][SS] Follow up: Refactored BatchCommitLog t...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17444
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75283/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17444: [SPARK-19876][SS] Follow up: Refactored BatchComm...

Posted by tdas <gi...@git.apache.org>.

Github user tdas commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17444#discussion_r108313119
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/BatchCommitLog.scala ---
    @@ -45,33 +45,39 @@ import org.apache.spark.sql.SparkSession
     class BatchCommitLog(sparkSession: SparkSession, path: String)
       extends HDFSMetadataLog[String](sparkSession, path) {
     
    +  import BatchCommitLog._
    +
    +  def add(batchId: Long): Unit = {
    +    super.add(batchId, EMPTY_JSON)
    +  }
    +
    +  override def add(batchId: Long, metadata: String): Boolean = {
    +    throw new UnsupportedOperationException(
    --- End diff --
    
    If we want metadata down the road, we will refactor this class anyways to make it not HDFSMetadataLog[String] but HDFSMetadataLog[SomeCaseClass] (like OffsetSeqLog).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17444: [SPARK-19876][SS] Follow up: Refactored BatchComm...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/17444


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org