You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by zuotingbing <gi...@git.apache.org> on 2017/12/20 08:08:33 UTC

[GitHub] spark pull request #20029: [SPARK-22793][SQL]Memory leak in Spark Thrift Ser...

GitHub user zuotingbing opened a pull request:

    https://github.com/apache/spark/pull/20029

    [SPARK-22793][SQL]Memory leak in Spark Thrift Server

    # What changes were proposed in this pull request?
    1. Start HiveThriftServer2.
    2. Connect to thriftserver through beeline.
    3. Close the beeline.
    4. repeat step2 and step 3 for several times.
    we found there are many directories never be dropped under the path `hive.exec.local.scratchdir` and `hive.exec.scratchdir`, as we know the scratchdir has been added to deleteOnExit when it be created. So it means that the cache size of FileSystem `deleteOnExit` will keep increasing until JVM terminated.
    
    In addition, we use `jmap -histo:live [PID]`
    to printout the size of objects in HiveThriftServer2 Process, we can find the object `org.apache.spark.sql.hive.client.HiveClientImpl` and `org.apache.hadoop.hive.ql.session.SessionState` keep increasing even though we closed all the beeline connections, which may caused the leak of Memory.
    
    # How was this patch tested?
    manual tests
    
    This PR follw-up the https://github.com/apache/spark/pull/19989

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/zuotingbing/spark SPARK-22793

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/20029.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #20029
    
----
commit 2b1e166f4f43b300d272fc7ce1d9d7997f7ae3cd
Author: zuotingbing <zu...@...>
Date:   2017-12-20T07:52:21Z

    [SPARK-22793][SQL]Memory leak in Spark Thrift Server

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20029: [SPARK-22793][SQL]Memory leak in Spark Thrift Server

Posted by liufengdb <gi...@git.apache.org>.

Github user liufengdb commented on the issue:

    https://github.com/apache/spark/pull/20029
  
    By [this line](https://github.com/apache/spark/blob/master/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLSessionManager.scala#L78), yes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20029: [SPARK-22793][SQL]Memory leak in Spark Thrift Server

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20029
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85736/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20029: [SPARK-22793][SQL]Memory leak in Spark Thrift Server

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on the issue:

    https://github.com/apache/spark/pull/20029
  
    I'm asking you to respond to https://github.com/apache/spark/pull/19989#issuecomment-351985114


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20029: [SPARK-22793][SQL]Memory leak in Spark Thrift Server

Posted by mgaido91 <gi...@git.apache.org>.

Github user mgaido91 commented on the issue:

    https://github.com/apache/spark/pull/20029
  
    > The hiveClient created for the resourceLoader is only used to addJar, which is, in turn, to add Jar to the shared IsolatedClientLoader. Then we can just use the shared hive client for this purpose.
    
    @liufengdb does it mean that we are creating more than one SparkSession in the thriftserver?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20029: [SPARK-22793][SQL]Memory leak in Spark Thrift Server

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on the issue:

    https://github.com/apache/spark/pull/20029
  
    This indeed is the primary change as it's open vs master. https://github.com/apache/spark/pull/19989 had some concerns about whether this affects correctness though?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20029: [SPARK-22793][SQL]Memory leak in Spark Thrift Server

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/20029
  
    > we can find the object org.apache.spark.sql.hive.client.HiveClientImpl and org.apache.hadoop.hive.ql.session.SessionState keep increasing
    
    Can you check the GC root and explain why they are increasing? The fix looks not correct to me as we should create new session.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20029: [SPARK-22793][SQL]Memory leak in Spark Thrift Server

Posted by zuotingbing <gi...@git.apache.org>.

Github user zuotingbing commented on the issue:

    https://github.com/apache/spark/pull/20029
  
    Thanks @srowen , so whom could i ping to make sure this change has no side effects? 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20029: [SPARK-22793][SQL]Memory leak in Spark Thrift Server

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20029
  
    **[Test build #85736 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85736/testReport)** for PR 20029 at commit [`2b1e166`](https://github.com/apache/spark/commit/2b1e166f4f43b300d272fc7ce1d9d7997f7ae3cd).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20029: [SPARK-22793][SQL]Memory leak in Spark Thrift Server

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/20029
  
    > The hiveClient created for the resourceLoader is only used to addJar, which is, in turn, to add Jar to the shared IsolatedClientLoader. Then we can just use the shared hive client for this purpose.
    
    Shouldn't `addJar` be session-based? At least seems in Hive it is: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Cli#LanguageManualCli-HiveResources
    
    Although looks like in `SessionResourceLoader` for `SessionState`, `addJar` isn't session-based too. So at least seems we have consistent behavior.
    
    
    
    



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20029: [SPARK-22793][SQL]Memory leak in Spark Thrift Server

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20029
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20029: [SPARK-22793][SQL]Memory leak in Spark Thrift Server

Posted by zuotingbing <gi...@git.apache.org>.

Github user zuotingbing commented on the issue:

    https://github.com/apache/spark/pull/20029
  
    Could you please to check this PR? Thanks @liufengdb


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20029: [SPARK-22793][SQL]Memory leak in Spark Thrift Server

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/20029
  
    `addJar ` is cross-session. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20029: [SPARK-22793][SQL]Memory leak in Spark Thrift Server

Posted by liufengdb <gi...@git.apache.org>.

Github user liufengdb commented on the issue:

    https://github.com/apache/spark/pull/20029
  
    lgtm!


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20029: [SPARK-22793][SQL]Memory leak in Spark Thrift Ser...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/20029


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20029: [SPARK-22793][SQL]Memory leak in Spark Thrift Server

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/20029
  
    ok to test


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20029: [SPARK-22793][SQL]Memory leak in Spark Thrift Server

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20029
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20029: [SPARK-22793][SQL]Memory leak in Spark Thrift Server

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/20029
  
    cc @liufengdb 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20029: [SPARK-22793][SQL]Memory leak in Spark Thrift Server

Posted by mgaido91 <gi...@git.apache.org>.

Github user mgaido91 commented on the issue:

    https://github.com/apache/spark/pull/20029
  
    What it seems is never closed by your analysis is the client used to interact with the metastore. This might be a problem which we are not aware of in normal SQL applications, since we have only one client in those cases.
    
    What you are doing in your fix is avoiding creating a client for each `HiveSessionBuilder`, thus:
    
     1. this would mean that we are creating more than one `SessionBuilder`, ie. more than one `SparkSession`, which is not true as far as I know.
     2. any session would share the same client to connect to the metastore, which is wrong IMHO.
    
    Please let me know if I misunderstood or I was wrong with something.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20029: [SPARK-22793][SQL]Memory leak in Spark Thrift Server

Posted by zuotingbing <gi...@git.apache.org>.

Github user zuotingbing commented on the issue:

    https://github.com/apache/spark/pull/20029
  
    It seems each time when connect to thrift server through beeline, the `SessionState.start(state)` will be called two times. one is in `HiveSessionImpl:open` , another is in `HiveClientImpl.newSession()` for `sql("use default")` . When close the beeline connection, only close the HiveSession with `HiveSessionImpl.close()`, but the object of `HiveClientImpl.newSession()` will be left over.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20029: [SPARK-22793][SQL]Memory leak in Spark Thrift Server

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20029
  
    **[Test build #85736 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85736/testReport)** for PR 20029 at commit [`2b1e166`](https://github.com/apache/spark/commit/2b1e166f4f43b300d272fc7ce1d9d7997f7ae3cd).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20029: [SPARK-22793][SQL]Memory leak in Spark Thrift Server

Posted by liufengdb <gi...@git.apache.org>.

Github user liufengdb commented on the issue:

https://github.com/apache/spark/pull/20029

@zuotingbing I took a close look at the related code and thought the issue you raised is valid:

1. The hiveClient created for the [resourceLoader](https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveSessionStateBuilder.scala#L45) is only used to addJar, which is, in turn, to add Jar to the shared [`IsolatedClientLoader`](https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala#L817). Then we can just use the shared hive client for this purpose.

2. Another possible reason to use a new hive client is to run [this hive statement](https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala#L818). But I think it just some leftovers from old spark and should be removed. So overall it is fined to use the shared `client` from `HiveExternalCatalog` without creating a new hive client.

3. Currrently, there are no ways to cleanup the resource created by a [new session of SQLContext/SparkSession](https://github.com/apache/spark/blob/master/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLSessionManager.scala#L78). I couldn't understand the design tradeoff behind [this](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala#L716) (@srowen ). So it is not easy to remove the temp dirs when a session is closed.

4. To what extent, does spark need these scratch dirs? Is it possible we can make this step optional, if it is not used for all the deployment modes?

---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20029: [SPARK-22793][SQL]Memory leak in Spark Thrift Server

Posted by zuotingbing <gi...@git.apache.org>.

Github user zuotingbing commented on the issue:

    https://github.com/apache/spark/pull/20029
  
    `override protected lazy val resourceLoader: HiveSessionResourceLoader = {
        val client: HiveClient = externalCatalog.client.newSession()
        new HiveSessionResourceLoader(session, client)
      }`
    Is it necessary to create a new HiveClient here since there has a hiveClient in externalCatalog already?
    If it is necessary, we need to supply a method to close the hiveClient which be created here and in this method we alsO need to clean up the scratchdir(`hdfsSessionPath` and `localSessionPath`)  which are created by HiveClientImpl.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org