You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by wzhfy <gi...@git.apache.org> on 2017/10/30 01:49:37 UTC
[GitHub] spark pull request #19605: [SPARK-22394] [SQL] Remove redundant synchronizat...
GitHub user wzhfy opened a pull request:
https://github.com/apache/spark/pull/19605
[SPARK-22394] [SQL] Remove redundant synchronization for metastore access
## What changes were proposed in this pull request?
Before Spark 2.x, synchronization for metastore access was protected at [line229 in ClientWrapper](https://github.com/apache/spark/blob/branch-1.6/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/ClientWrapper.scala#L229) (now it's at [line203 in HiveClientWrapper ](https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala#L203)). After Spark 2.x, `HiveExternalCatalog` was introduced by [SPARK-13080](https://github.com/apache/spark/pull/11293), where an extra level of synchronization was added at [line95](https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala#L95). That is, now we have two levels of synchronization: one is `HiveExternalCatalog` and the other is `IsolatedClientLoader` in `HiveClientImpl`. But since both `HiveExternalCatalog` and `IsolatedClientLoader` are shared among all spark sessions, the extra level of synchronization in `Hiv
eExternalCatalog` is redundant, thus can be removed.
## How was this patch tested?
Manual test and existing tests.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/wzhfy/spark redundant_sync
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/19605.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #19605
----
commit 072b27d083f2c2ed8d8bdd20caa5b0fe0ba267f6
Author: Zhenhua Wang <wa...@huawei.com>
Date: 2017-10-30T01:47:12Z
remove redundant sync
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19605: [SPARK-22394] [SQL] Remove redundant synchronizat...
Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:
https://github.com/apache/spark/pull/19605#discussion_r147674938
--- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala ---
@@ -89,10 +89,12 @@ private[spark] class HiveExternalCatalog(conf: SparkConf, hadoopConf: Configurat
}
/**
- * Run some code involving `client` in a [[synchronized]] block and wrap certain
- * exceptions thrown in the process in [[AnalysisException]].
+ * Run some code involving `client` and wrap certain exceptions thrown in the process in
+ * [[AnalysisException]]. Thread-safety is guaranteed here because methods in the `client`
+ * ([[org.apache.spark.sql.hive.client.HiveClientImpl]]) are already synchronized through
+ * `clientLoader` in the `retryLocked` method.
*/
- private def withClient[T](body: => T): T = synchronized {
+ private def withClient[T](body: => T): T = {
--- End diff --
sounds good
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19605: [SPARK-22394] [SQL] Remove redundant synchronization for...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19605
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19605: [SPARK-22394] [SQL] Remove redundant synchronization for...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19605
**[Test build #83202 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83202/testReport)** for PR 19605 at commit [`072b27d`](https://github.com/apache/spark/commit/072b27d083f2c2ed8d8bdd20caa5b0fe0ba267f6).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19605: [SPARK-22394] [SQL] Remove redundant synchronizat...
Posted by wzhfy <gi...@git.apache.org>.
Github user wzhfy commented on a diff in the pull request:
https://github.com/apache/spark/pull/19605#discussion_r147929746
--- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala ---
@@ -89,10 +89,12 @@ private[spark] class HiveExternalCatalog(conf: SparkConf, hadoopConf: Configurat
}
/**
- * Run some code involving `client` in a [[synchronized]] block and wrap certain
- * exceptions thrown in the process in [[AnalysisException]].
+ * Run some code involving `client` and wrap certain exceptions thrown in the process in
+ * [[AnalysisException]]. Thread-safety is guaranteed here because methods in the `client`
+ * ([[org.apache.spark.sql.hive.client.HiveClientImpl]]) are already synchronized through
+ * `clientLoader` in the `retryLocked` method.
*/
- private def withClient[T](body: => T): T = synchronized {
+ private def withClient[T](body: => T): T = {
--- End diff --
I went through all methods in `HiveClient` having synchronization (except `addJar`):
- `getState` is used only in test.
- `setOut`, `setInfo` and `setError` are only used in `SparkSQLEnv.init()`.
- all other methods are called through `HiveExternalCatalog`.
So it seems `addJar` is the only exception.
To make `addJar` also go throught `HiveExternalCatalog`, we can pass `externalCatalog` instead of `client` at [line46 in HiveSessionStateBuilder](https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveSessionStateBuilder.scala#L46). But I don't know why we need to call `newSession()` at [line45](https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveSessionStateBuilder.scala#L45), where a new `HiveClientImpl` instance is created, with the same class loader and Hive client.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19605: [SPARK-22394] [SQL] Remove redundant synchronizat...
Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on a diff in the pull request:
https://github.com/apache/spark/pull/19605#discussion_r147767406
--- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala ---
@@ -89,10 +89,12 @@ private[spark] class HiveExternalCatalog(conf: SparkConf, hadoopConf: Configurat
}
/**
- * Run some code involving `client` in a [[synchronized]] block and wrap certain
- * exceptions thrown in the process in [[AnalysisException]].
+ * Run some code involving `client` and wrap certain exceptions thrown in the process in
+ * [[AnalysisException]]. Thread-safety is guaranteed here because methods in the `client`
+ * ([[org.apache.spark.sql.hive.client.HiveClientImpl]]) are already synchronized through
+ * `clientLoader` in the `retryLocked` method.
*/
- private def withClient[T](body: => T): T = synchronized {
+ private def withClient[T](body: => T): T = {
--- End diff --
Please check whether all the Hive client calls are through the Hive External Catalog. For example, `addJar` is an exception.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19605: [SPARK-22394] [SQL] Remove redundant synchronizat...
Posted by wzhfy <gi...@git.apache.org>.
Github user wzhfy commented on a diff in the pull request:
https://github.com/apache/spark/pull/19605#discussion_r147632813
--- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala ---
@@ -89,10 +89,12 @@ private[spark] class HiveExternalCatalog(conf: SparkConf, hadoopConf: Configurat
}
/**
- * Run some code involving `client` in a [[synchronized]] block and wrap certain
- * exceptions thrown in the process in [[AnalysisException]].
+ * Run some code involving `client` and wrap certain exceptions thrown in the process in
+ * [[AnalysisException]]. Thread-safety is guaranteed here because methods in the `client`
+ * ([[org.apache.spark.sql.hive.client.HiveClientImpl]]) are already synchronized through
+ * `clientLoader` in the `retryLocked` method.
*/
- private def withClient[T](body: => T): T = synchronized {
+ private def withClient[T](body: => T): T = {
--- End diff --
Then can we remove the synchronization of `clientLoader` in `HiveClientImpl`? If we synchronize `withClient`, then there's no need to sync each operation in `body`.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19605: [SPARK-22394] [SQL] Remove redundant synchronization for...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19605
**[Test build #83202 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83202/testReport)** for PR 19605 at commit [`072b27d`](https://github.com/apache/spark/commit/072b27d083f2c2ed8d8bdd20caa5b0fe0ba267f6).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19605: [SPARK-22394] [SQL] Remove redundant synchronization for...
Posted by wzhfy <gi...@git.apache.org>.
Github user wzhfy commented on the issue:
https://github.com/apache/spark/pull/19605
cc @cloud-fan @rxin @gatorsmile
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19605: [SPARK-22394] [SQL] Remove redundant synchronizat...
Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on a diff in the pull request:
https://github.com/apache/spark/pull/19605#discussion_r148073401
--- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala ---
@@ -89,10 +89,12 @@ private[spark] class HiveExternalCatalog(conf: SparkConf, hadoopConf: Configurat
}
/**
- * Run some code involving `client` in a [[synchronized]] block and wrap certain
- * exceptions thrown in the process in [[AnalysisException]].
+ * Run some code involving `client` and wrap certain exceptions thrown in the process in
+ * [[AnalysisException]]. Thread-safety is guaranteed here because methods in the `client`
+ * ([[org.apache.spark.sql.hive.client.HiveClientImpl]]) are already synchronized through
+ * `clientLoader` in the `retryLocked` method.
*/
- private def withClient[T](body: => T): T = synchronized {
+ private def withClient[T](body: => T): T = {
--- End diff --
Last year, we had a discussion about whether `addJar` should be moved to HiveExternalCatalog. See https://github.com/apache/spark/pull/14883
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19605: [SPARK-22394] [SQL] Remove redundant synchronizat...
Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on a diff in the pull request:
https://github.com/apache/spark/pull/19605#discussion_r147619876
--- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala ---
@@ -89,10 +89,12 @@ private[spark] class HiveExternalCatalog(conf: SparkConf, hadoopConf: Configurat
}
/**
- * Run some code involving `client` in a [[synchronized]] block and wrap certain
- * exceptions thrown in the process in [[AnalysisException]].
+ * Run some code involving `client` and wrap certain exceptions thrown in the process in
+ * [[AnalysisException]]. Thread-safety is guaranteed here because methods in the `client`
+ * ([[org.apache.spark.sql.hive.client.HiveClientImpl]]) are already synchronized through
+ * `clientLoader` in the `retryLocked` method.
*/
- private def withClient[T](body: => T): T = synchronized {
+ private def withClient[T](body: => T): T = {
--- End diff --
If you check the callers of `withClient`, you can find many callers conduct multiple client-related operations in the same `body`. Removing this lock might cause some concurrency issues.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19605: [SPARK-22394] [SQL] Remove redundant synchronizat...
Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:
https://github.com/apache/spark/pull/19605#discussion_r147675119
--- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala ---
@@ -89,10 +89,12 @@ private[spark] class HiveExternalCatalog(conf: SparkConf, hadoopConf: Configurat
}
/**
- * Run some code involving `client` in a [[synchronized]] block and wrap certain
- * exceptions thrown in the process in [[AnalysisException]].
+ * Run some code involving `client` and wrap certain exceptions thrown in the process in
+ * [[AnalysisException]]. Thread-safety is guaranteed here because methods in the `client`
+ * ([[org.apache.spark.sql.hive.client.HiveClientImpl]]) are already synchronized through
+ * `clientLoader` in the `retryLocked` method.
*/
- private def withClient[T](body: => T): T = synchronized {
+ private def withClient[T](body: => T): T = {
--- End diff --
but please make sure we only use the hive client here.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19605: [SPARK-22394] [SQL] Remove redundant synchronization for...
Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the issue:
https://github.com/apache/spark/pull/19605
Also cc @srinathshankar @JoshRosen @hvanhovell
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19605: [SPARK-22394] [SQL] Remove redundant synchronization for...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19605
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83202/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org