You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by wzhfy <gi...@git.apache.org> on 2017/10/30 01:49:37 UTC

[GitHub] spark pull request #19605: [SPARK-22394] [SQL] Remove redundant synchronizat...

GitHub user wzhfy opened a pull request:

    https://github.com/apache/spark/pull/19605

    [SPARK-22394] [SQL] Remove redundant synchronization for metastore access

    ## What changes were proposed in this pull request?
    
    Before Spark 2.x, synchronization for metastore access was protected at [line229 in ClientWrapper](https://github.com/apache/spark/blob/branch-1.6/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/ClientWrapper.scala#L229) (now it's at [line203 in HiveClientWrapper ](https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala#L203)). After Spark 2.x, `HiveExternalCatalog` was introduced by [SPARK-13080](https://github.com/apache/spark/pull/11293), where an extra level of synchronization was added at [line95](https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala#L95). That is, now we have two levels of synchronization: one is `HiveExternalCatalog` and the other is `IsolatedClientLoader` in `HiveClientImpl`. But since both `HiveExternalCatalog` and `IsolatedClientLoader` are shared among all spark sessions, the extra level of synchronization in `Hiv
 eExternalCatalog` is redundant, thus can be removed.
    
    ## How was this patch tested?
    
    Manual test and existing tests.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/wzhfy/spark redundant_sync

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/19605.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #19605
    
----
commit 072b27d083f2c2ed8d8bdd20caa5b0fe0ba267f6
Author: Zhenhua Wang <wa...@huawei.com>
Date:   2017-10-30T01:47:12Z

    remove redundant sync

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19605: [SPARK-22394] [SQL] Remove redundant synchronizat...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19605#discussion_r147674938
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala ---
    @@ -89,10 +89,12 @@ private[spark] class HiveExternalCatalog(conf: SparkConf, hadoopConf: Configurat
       }
     
       /**
    -   * Run some code involving `client` in a [[synchronized]] block and wrap certain
    -   * exceptions thrown in the process in [[AnalysisException]].
    +   * Run some code involving `client` and wrap certain exceptions thrown in the process in
    +   * [[AnalysisException]]. Thread-safety is guaranteed here because methods in the `client`
    +   * ([[org.apache.spark.sql.hive.client.HiveClientImpl]]) are already synchronized through
    +   * `clientLoader` in the `retryLocked` method.
        */
    -  private def withClient[T](body: => T): T = synchronized {
    +  private def withClient[T](body: => T): T = {
    --- End diff --
    
    sounds good


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19605: [SPARK-22394] [SQL] Remove redundant synchronization for...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19605
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19605: [SPARK-22394] [SQL] Remove redundant synchronization for...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19605
  
    **[Test build #83202 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83202/testReport)** for PR 19605 at commit [`072b27d`](https://github.com/apache/spark/commit/072b27d083f2c2ed8d8bdd20caa5b0fe0ba267f6).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19605: [SPARK-22394] [SQL] Remove redundant synchronizat...

Posted by wzhfy <gi...@git.apache.org>.
Github user wzhfy commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19605#discussion_r147929746
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala ---
    @@ -89,10 +89,12 @@ private[spark] class HiveExternalCatalog(conf: SparkConf, hadoopConf: Configurat
       }
     
       /**
    -   * Run some code involving `client` in a [[synchronized]] block and wrap certain
    -   * exceptions thrown in the process in [[AnalysisException]].
    +   * Run some code involving `client` and wrap certain exceptions thrown in the process in
    +   * [[AnalysisException]]. Thread-safety is guaranteed here because methods in the `client`
    +   * ([[org.apache.spark.sql.hive.client.HiveClientImpl]]) are already synchronized through
    +   * `clientLoader` in the `retryLocked` method.
        */
    -  private def withClient[T](body: => T): T = synchronized {
    +  private def withClient[T](body: => T): T = {
    --- End diff --
    
    I went through all methods in `HiveClient` having synchronization (except `addJar`):
    - `getState`  is used only in test.
    - `setOut`, `setInfo` and `setError` are only used in `SparkSQLEnv.init()`.
    - all other methods are called through `HiveExternalCatalog`.
    
    So it seems `addJar` is the only exception.
    
    To make `addJar` also go throught `HiveExternalCatalog`, we can pass `externalCatalog` instead of `client` at [line46 in HiveSessionStateBuilder](https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveSessionStateBuilder.scala#L46). But I don't know why we need to call `newSession()` at [line45](https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveSessionStateBuilder.scala#L45), where a new `HiveClientImpl` instance is created, with the same class loader and Hive client.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19605: [SPARK-22394] [SQL] Remove redundant synchronizat...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19605#discussion_r147767406
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala ---
    @@ -89,10 +89,12 @@ private[spark] class HiveExternalCatalog(conf: SparkConf, hadoopConf: Configurat
       }
     
       /**
    -   * Run some code involving `client` in a [[synchronized]] block and wrap certain
    -   * exceptions thrown in the process in [[AnalysisException]].
    +   * Run some code involving `client` and wrap certain exceptions thrown in the process in
    +   * [[AnalysisException]]. Thread-safety is guaranteed here because methods in the `client`
    +   * ([[org.apache.spark.sql.hive.client.HiveClientImpl]]) are already synchronized through
    +   * `clientLoader` in the `retryLocked` method.
        */
    -  private def withClient[T](body: => T): T = synchronized {
    +  private def withClient[T](body: => T): T = {
    --- End diff --
    
    Please check whether all the Hive client calls are through the Hive External Catalog. For example, `addJar` is an exception. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19605: [SPARK-22394] [SQL] Remove redundant synchronizat...

Posted by wzhfy <gi...@git.apache.org>.
Github user wzhfy commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19605#discussion_r147632813
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala ---
    @@ -89,10 +89,12 @@ private[spark] class HiveExternalCatalog(conf: SparkConf, hadoopConf: Configurat
       }
     
       /**
    -   * Run some code involving `client` in a [[synchronized]] block and wrap certain
    -   * exceptions thrown in the process in [[AnalysisException]].
    +   * Run some code involving `client` and wrap certain exceptions thrown in the process in
    +   * [[AnalysisException]]. Thread-safety is guaranteed here because methods in the `client`
    +   * ([[org.apache.spark.sql.hive.client.HiveClientImpl]]) are already synchronized through
    +   * `clientLoader` in the `retryLocked` method.
        */
    -  private def withClient[T](body: => T): T = synchronized {
    +  private def withClient[T](body: => T): T = {
    --- End diff --
    
    Then can we remove the synchronization of `clientLoader` in `HiveClientImpl`? If we synchronize `withClient`, then there's no need to sync each operation in `body`.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19605: [SPARK-22394] [SQL] Remove redundant synchronization for...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19605
  
    **[Test build #83202 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83202/testReport)** for PR 19605 at commit [`072b27d`](https://github.com/apache/spark/commit/072b27d083f2c2ed8d8bdd20caa5b0fe0ba267f6).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19605: [SPARK-22394] [SQL] Remove redundant synchronization for...

Posted by wzhfy <gi...@git.apache.org>.
Github user wzhfy commented on the issue:

    https://github.com/apache/spark/pull/19605
  
    cc @cloud-fan @rxin @gatorsmile 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19605: [SPARK-22394] [SQL] Remove redundant synchronizat...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19605#discussion_r148073401
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala ---
    @@ -89,10 +89,12 @@ private[spark] class HiveExternalCatalog(conf: SparkConf, hadoopConf: Configurat
       }
     
       /**
    -   * Run some code involving `client` in a [[synchronized]] block and wrap certain
    -   * exceptions thrown in the process in [[AnalysisException]].
    +   * Run some code involving `client` and wrap certain exceptions thrown in the process in
    +   * [[AnalysisException]]. Thread-safety is guaranteed here because methods in the `client`
    +   * ([[org.apache.spark.sql.hive.client.HiveClientImpl]]) are already synchronized through
    +   * `clientLoader` in the `retryLocked` method.
        */
    -  private def withClient[T](body: => T): T = synchronized {
    +  private def withClient[T](body: => T): T = {
    --- End diff --
    
    Last year, we had a discussion about whether `addJar` should be moved to HiveExternalCatalog. See https://github.com/apache/spark/pull/14883


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19605: [SPARK-22394] [SQL] Remove redundant synchronizat...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19605#discussion_r147619876
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala ---
    @@ -89,10 +89,12 @@ private[spark] class HiveExternalCatalog(conf: SparkConf, hadoopConf: Configurat
       }
     
       /**
    -   * Run some code involving `client` in a [[synchronized]] block and wrap certain
    -   * exceptions thrown in the process in [[AnalysisException]].
    +   * Run some code involving `client` and wrap certain exceptions thrown in the process in
    +   * [[AnalysisException]]. Thread-safety is guaranteed here because methods in the `client`
    +   * ([[org.apache.spark.sql.hive.client.HiveClientImpl]]) are already synchronized through
    +   * `clientLoader` in the `retryLocked` method.
        */
    -  private def withClient[T](body: => T): T = synchronized {
    +  private def withClient[T](body: => T): T = {
    --- End diff --
    
    If you check the callers of `withClient`, you can find many callers conduct multiple client-related operations in the same `body`. Removing this lock might cause some concurrency issues. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19605: [SPARK-22394] [SQL] Remove redundant synchronizat...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19605#discussion_r147675119
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala ---
    @@ -89,10 +89,12 @@ private[spark] class HiveExternalCatalog(conf: SparkConf, hadoopConf: Configurat
       }
     
       /**
    -   * Run some code involving `client` in a [[synchronized]] block and wrap certain
    -   * exceptions thrown in the process in [[AnalysisException]].
    +   * Run some code involving `client` and wrap certain exceptions thrown in the process in
    +   * [[AnalysisException]]. Thread-safety is guaranteed here because methods in the `client`
    +   * ([[org.apache.spark.sql.hive.client.HiveClientImpl]]) are already synchronized through
    +   * `clientLoader` in the `retryLocked` method.
        */
    -  private def withClient[T](body: => T): T = synchronized {
    +  private def withClient[T](body: => T): T = {
    --- End diff --
    
    but please make sure we only use the hive client here.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19605: [SPARK-22394] [SQL] Remove redundant synchronization for...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/19605
  
    Also cc @srinathshankar @JoshRosen @hvanhovell 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19605: [SPARK-22394] [SQL] Remove redundant synchronization for...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19605
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83202/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org