You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by rdblue <gi...@git.apache.org> on 2018/08/02 19:32:55 UTC

[GitHub] spark pull request #21977: SPARK-25004: Add spark.executor.pyspark.memory li...

GitHub user rdblue opened a pull request:

    https://github.com/apache/spark/pull/21977

    SPARK-25004: Add spark.executor.pyspark.memory limit.

    ## What changes were proposed in this pull request?
    
    This adds `spark.executor.pyspark.memory` to configure Python's address space limit, [`resource.RLIMIT_AS`](https://docs.python.org/3/library/resource.html#resource.RLIMIT_AS). Limiting Python's address space allows Python to participate in memory management. In practice, we see fewer cases of Python taking too much memory because it doesn't know to run garbage collection. This results in YARN killing fewer containers. This also improves error messages so users know that Python is consuming too much memory:
    
    ```
      File "build/bdist.linux-x86_64/egg/package/library.py", line 265, in fe_engineer
        fe_eval_rec.update(f(src_rec_prep, mat_rec_prep))
      File "build/bdist.linux-x86_64/egg/package/library.py", line 163, in fe_comp
        comparisons = EvaluationUtils.leven_list_compare(src_rec_prep.get(item, []), mat_rec_prep.get(item, []))
      File "build/bdist.linux-x86_64/egg/package/evaluationutils.py", line 25, in leven_list_compare
        permutations = sorted(permutations, reverse=True)
      MemoryError
    ```
    
    The new pyspark memory setting is used to increase requested YARN container memory, instead of sharing overhead memory between python and off-heap JVM activity.
    
    ## How was this patch tested?
    
    Tested memory limits in our YARN cluster and verified that MemoryError is thrown.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/rdblue/spark SPARK-25004-add-python-memory-limit

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/21977.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #21977
    
----
commit 19cd9c5cce4420729074a0976b129889d70fd56c
Author: Ryan Blue <bl...@...>
Date:   2018-05-09T18:34:50Z

    SPARK-25004: Add spark.executor.pyspark.memory limit.

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94251/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21977: SPARK-25004: Add spark.executor.pyspark.memory li...

Posted by felixcheung <gi...@git.apache.org>.

Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21977#discussion_r207716893
  
    --- Diff: resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala ---
    @@ -133,10 +133,17 @@ private[yarn] class YarnAllocator(
       // Additional memory overhead.
       protected val memoryOverhead: Int = sparkConf.get(EXECUTOR_MEMORY_OVERHEAD).getOrElse(
         math.max((MEMORY_OVERHEAD_FACTOR * executorMemory).toInt, MEMORY_OVERHEAD_MIN)).toInt
    +  protected val pysparkWorkerMemory: Int = if (sparkConf.get(IS_PYTHON_APP)) {
    +    sparkConf.get(PYSPARK_EXECUTOR_MEMORY).map(_.toInt).getOrElse(0)
    --- End diff --
    
    or just use 0 in worker.py too


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/1812/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21977: SPARK-25004: Add spark.executor.pyspark.memory li...

Posted by holdenk <gi...@git.apache.org>.

Github user holdenk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21977#discussion_r208325737
  
    --- Diff: python/pyspark/worker.py ---
    @@ -259,6 +260,26 @@ def main(infile, outfile):
                                  "PYSPARK_DRIVER_PYTHON are correctly set.") %
                                 ("%d.%d" % sys.version_info[:2], version))
     
    +        # set up memory limits
    +        memory_limit_mb = int(os.environ.get('PYSPARK_EXECUTOR_MEMORY_MB', "-1"))
    +        total_memory = resource.RLIMIT_AS
    +        try:
    +            (total_memory_limit, max_total_memory) = resource.getrlimit(total_memory)
    +            msg = "Current mem: {0} of max {1}\n".format(total_memory_limit, max_total_memory)
    +            sys.stderr.write()
    +
    +            if memory_limit_mb > 0 and total_memory_limit < 0:
    --- End diff --
    
    That makes sense. What about if we only set the limit if it was lower than the current limit? (e.g. I could see a container system setting a limit based on an assumption which doesn't hold once Spark is in the mix and if we come up with a lower limit we could apply it)?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: [SPARK-25004][CORE] Add spark.executor.pyspark.memory li...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    **[Test build #95225 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95225/testReport)** for PR 21977 at commit [`bb8fecb`](https://github.com/apache/spark/commit/bb8fecb19e2a231061f95a327a162e48df48e9cd).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by rdblue <gi...@git.apache.org>.

Github user rdblue commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    > So if users don't set this conf, the behavior is the same as before, right?
    
    Yes.
    
    > My concern is that meaning of the overhead parameter becomes pretty confusing.
    
    I think it's easier to reason about, but that's because I think that most users don't really know what the overhead is for besides python anyway.
    
    > More general brainstorming -- I suppose there is no way to give python hint to gc more often? This is sort of like moving from the UnifiedMemoryManager back to the Static one, as now you put in a hard barrier.
    
    There are other python memory limits, like heap size. We found that the heap size limit isn't enforced and YARN would still kill the process for exceeding its allocation. We also didn't see MemoryError, which is what signals to the user that Python is responsible. So I think this is the best we can do for now. We might be able to play around with other limits to get it to gc more often, but we need the hard limit to keep the errors in python and not in YARN.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21977: SPARK-25004: Add spark.executor.pyspark.memory li...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21977#discussion_r207409771
  
    --- Diff: python/pyspark/worker.py ---
    @@ -259,6 +260,26 @@ def main(infile, outfile):
                                  "PYSPARK_DRIVER_PYTHON are correctly set.") %
                                 ("%d.%d" % sys.version_info[:2], version))
     
    +        # set up memory limits
    +        memory_limit_mb = int(os.environ.get('PYSPARK_EXECUTOR_MEMORY_MB', "-1"))
    +        total_memory = resource.RLIMIT_AS
    +        try:
    +            (total_memory_limit, max_total_memory) = resource.getrlimit(total_memory)
    +            msg = "Current mem: {0} of max {1}\n".format(total_memory_limit, max_total_memory)
    +            sys.stderr.write()
    --- End diff --
    
    Forget to output `msg` here?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by rdblue <gi...@git.apache.org>.

Github user rdblue commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    You're right on the line number. Maybe it was that I hadn't done a full rebuild in this branch locally before running the test. I'll look into the other error if that's consistent in the Jenkins tests.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    test cases?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    **[Test build #95064 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95064/testReport)** for PR 21977 at commit [`505f2eb`](https://github.com/apache/spark/commit/505f2eb09d60c695a80c7f62bde9a19a0e677357).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21977: SPARK-25004: Add spark.executor.pyspark.memory li...

Posted by rdblue <gi...@git.apache.org>.

Github user rdblue commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21977#discussion_r208310790
  
    --- Diff: core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala ---
    @@ -60,14 +61,26 @@ private[spark] object PythonEvalType {
      */
     private[spark] abstract class BasePythonRunner[IN, OUT](
         funcs: Seq[ChainedPythonFunctions],
    -    bufferSize: Int,
    -    reuseWorker: Boolean,
         evalType: Int,
    -    argOffsets: Array[Array[Int]])
    +    argOffsets: Array[Array[Int]],
    +    conf: SparkConf)
       extends Logging {
     
       require(funcs.length == argOffsets.length, "argOffsets should have the same length as funcs")
     
    +  private val bufferSize = conf.getInt("spark.buffer.size", 65536)
    +  private val reuseWorker = conf.getBoolean("spark.python.worker.reuse", true)
    +  private val memoryMb = {
    +    val allocation = conf.get(PYSPARK_EXECUTOR_MEMORY)
    +    if (reuseWorker) {
    --- End diff --
    
    That `useDaemon` flag controls whether worker.py is called directly from Spark or from a daemon process that forks new workers. The daemon process is used because [forking from a huge JVM is more expensive](https://github.com/apache/spark/blob/628c7b517969c4a7ccb26ea67ab3dd61266073ca/core/src/main/scala/org/apache/spark/api/python/PythonWorkerFactory.scala#L102-L104).
    
    This isn't related to REUSE_WORKER, which is [used by the forked worker process](https://github.com/apache/spark/blob/628c7b517969c4a7ccb26ea67ab3dd61266073ca/python/pyspark/daemon.py#L112). If reuse is set, then the process [keeps the worker alive](https://github.com/apache/spark/blob/628c7b517969c4a7ccb26ea67ab3dd61266073ca/python/pyspark/daemon.py#L173) and will call its main method again.
    
    But, it looks like my understanding of `spark.python.worker.reuse` is still incorrect given that configuration's description: "Reuse Python worker or not. If yes, it will use a fixed number of Python workers, does not need to fork() a Python process for every task." Sounds like reuse means that workers stay around for multiple tasks, not that concurrent tasks run in the same worker.
    
    I'll do some more digging to see what determines the number of workers. My guess is the number of cores, in which case this should always divide by the number of cores.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21977: SPARK-25004: Add spark.executor.pyspark.memory li...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21977#discussion_r209209021
  
    --- Diff: core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala ---
    @@ -60,14 +61,20 @@ private[spark] object PythonEvalType {
      */
     private[spark] abstract class BasePythonRunner[IN, OUT](
         funcs: Seq[ChainedPythonFunctions],
    -    bufferSize: Int,
    -    reuseWorker: Boolean,
         evalType: Int,
    -    argOffsets: Array[Array[Int]])
    +    argOffsets: Array[Array[Int]],
    +    conf: SparkConf)
       extends Logging {
     
       require(funcs.length == argOffsets.length, "argOffsets should have the same length as funcs")
     
    +  private val bufferSize = conf.getInt("spark.buffer.size", 65536)
    +  private val reuseWorker = conf.getBoolean("spark.python.worker.reuse", true)
    +  // each python worker gets an equal part of the allocation. the worker pool will grow to the
    +  // number of concurrent tasks, which is determined by the number of cores in this executor.
    +  private val memoryMb = conf.get(PYSPARK_EXECUTOR_MEMORY)
    +      .map(_ / conf.getInt("spark.executor.cores", 1))
    --- End diff --
    
    tiny nit: indentation


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94750/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2146/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    **[Test build #94191 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94191/testReport)** for PR 21977 at commit [`2fb3d79`](https://github.com/apache/spark/commit/2fb3d790dac74a8db6fffac7f34c89880b53e11f).
     * This patch **fails to build**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by ifilonenko <gi...@git.apache.org>.

Github user ifilonenko commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    This seems very applicable to add to Kubernetes as well. We already increased the DEFAULT_MEMORY_OVERHEAD to account for memory issues that arise with users forgetting to increase the memory overhead. Could this be expanded for that cluster manager as well, I'd be happy to help with appending to this PR (or in a followup) to include that. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21977: SPARK-25004: Add spark.executor.pyspark.memory li...

Posted by felixcheung <gi...@git.apache.org>.

Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21977#discussion_r208120759
  
    --- Diff: core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala ---
    @@ -60,14 +61,26 @@ private[spark] object PythonEvalType {
      */
     private[spark] abstract class BasePythonRunner[IN, OUT](
         funcs: Seq[ChainedPythonFunctions],
    -    bufferSize: Int,
    -    reuseWorker: Boolean,
         evalType: Int,
    -    argOffsets: Array[Array[Int]])
    +    argOffsets: Array[Array[Int]],
    +    conf: SparkConf)
       extends Logging {
     
       require(funcs.length == argOffsets.length, "argOffsets should have the same length as funcs")
     
    +  private val bufferSize = conf.getInt("spark.buffer.size", 65536)
    +  private val reuseWorker = conf.getBoolean("spark.python.worker.reuse", true)
    +  private val memoryMb = {
    +    val allocation = conf.get(PYSPARK_EXECUTOR_MEMORY)
    +    if (reuseWorker) {
    --- End diff --
    
    maybe here https://github.com/apache/spark/blob/628c7b517969c4a7ccb26ea67ab3dd61266073ca/core/src/main/scala/org/apache/spark/api/python/PythonWorkerFactory.scala#L89?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21977: SPARK-25004: Add spark.executor.pyspark.memory li...

Posted by holdenk <gi...@git.apache.org>.

Github user holdenk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21977#discussion_r207595892
  
    --- Diff: python/pyspark/worker.py ---
    @@ -259,6 +260,26 @@ def main(infile, outfile):
                                  "PYSPARK_DRIVER_PYTHON are correctly set.") %
                                 ("%d.%d" % sys.version_info[:2], version))
     
    +        # set up memory limits
    +        memory_limit_mb = int(os.environ.get('PYSPARK_EXECUTOR_MEMORY_MB', "-1"))
    +        total_memory = resource.RLIMIT_AS
    +        try:
    +            (total_memory_limit, max_total_memory) = resource.getrlimit(total_memory)
    +            msg = "Current mem: {0} of max {1}\n".format(total_memory_limit, max_total_memory)
    +            sys.stderr.write()
    +
    +            if memory_limit_mb > 0 and total_memory_limit < 0:
    --- End diff --
    
    So the logic of this block appears to be the user has requested a memory limit and Python does not have a memory limit set. If the user has requested a different memory limit than the one set though, regardless of if there is a current memory limit, would it make sense to set?
    
    Also possible I've misunderstood the rlmit return values here. 
    
    That being said even if that is the behaviour we want, should we use `resource.RLIM_INFINITY` to check if its unlimited?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95064/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    **[Test build #94804 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94804/testReport)** for PR 21977 at commit [`6810fc7`](https://github.com/apache/spark/commit/6810fc7f30131750a1a472467b7984a2c9150a9b).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95175/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21977: SPARK-25004: Add spark.executor.pyspark.memory li...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21977#discussion_r207409267
  
    --- Diff: resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala ---
    @@ -333,7 +340,7 @@ private[spark] class Client(
         val maxMem = newAppResponse.getMaximumResourceCapability().getMemory()
         logInfo("Verifying our application has not requested more than the maximum " +
           s"memory capability of the cluster ($maxMem MB per container)")
    -    val executorMem = executorMemory + executorMemoryOverhead
    +    val executorMem = executorMemory + executorMemoryOverhead + pysparkWorkerMemory
         if (executorMem > maxMem) {
           throw new IllegalArgumentException(s"Required executor memory ($executorMemory" +
             s"+$executorMemoryOverhead MB) is above the max threshold ($maxMem MB) of this cluster! " +
    --- End diff --
    
    Should add `pysparkWorkerMemory` here too.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2340/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21977: SPARK-25004: Add spark.executor.pyspark.memory li...

Posted by rdblue <gi...@git.apache.org>.

Github user rdblue commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21977#discussion_r207722402
  
    --- Diff: resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala ---
    @@ -133,10 +133,17 @@ private[yarn] class YarnAllocator(
       // Additional memory overhead.
       protected val memoryOverhead: Int = sparkConf.get(EXECUTOR_MEMORY_OVERHEAD).getOrElse(
         math.max((MEMORY_OVERHEAD_FACTOR * executorMemory).toInt, MEMORY_OVERHEAD_MIN)).toInt
    +  protected val pysparkWorkerMemory: Int = if (sparkConf.get(IS_PYTHON_APP)) {
    +    sparkConf.get(PYSPARK_EXECUTOR_MEMORY).map(_.toInt).getOrElse(0)
    --- End diff --
    
    -1 in worker.py signals that it isn't set. Here, we use an Option instead. 0 is the correct size of the allocation to add to YARN resource requests.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/1714/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21977: SPARK-25004: Add spark.executor.pyspark.memory li...

Posted by felixcheung <gi...@git.apache.org>.

Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21977#discussion_r207716813
  
    --- Diff: python/pyspark/worker.py ---
    @@ -259,6 +260,26 @@ def main(infile, outfile):
                                  "PYSPARK_DRIVER_PYTHON are correctly set.") %
                                 ("%d.%d" % sys.version_info[:2], version))
     
    +        # set up memory limits
    +        memory_limit_mb = int(os.environ.get('PYSPARK_EXECUTOR_MEMORY_MB', "-1"))
    +        total_memory = resource.RLIMIT_AS
    +        try:
    +            (total_memory_limit, max_total_memory) = resource.getrlimit(total_memory)
    +            msg = "Current mem: {0} of max {1}\n".format(total_memory_limit, max_total_memory)
    +            sys.stderr.write(msg)
    +
    +            if memory_limit_mb > 0 and total_memory_limit == resource.RLIM_INFINITY:
    +                # convert to bytes
    +                total_memory_limit = memory_limit_mb * 1024 * 1024
    +
    +                msg = "Setting mem to {0} of max {1}\n".format(total_memory_limit, max_total_memory)
    +                sys.stderr.write(msg)
    +                resource.setrlimit(total_memory, (total_memory_limit, total_memory_limit))
    +
    +        except (resource.error, OSError) as e:
    +            # not all systems support resource limits, so warn instead of failing
    +            sys.stderr.write("WARN: Failed to set memory limit: {0}\n".format(e))
    --- End diff --
    
    catch ValueError also in the case hard limit can't be set (if it's otherwise set)


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21977: SPARK-25004: Add spark.executor.pyspark.memory li...

Posted by holdenk <gi...@git.apache.org>.

Github user holdenk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21977#discussion_r209329689
  
    --- Diff: python/pyspark/worker.py ---
    @@ -259,6 +260,26 @@ def main(infile, outfile):
                                  "PYSPARK_DRIVER_PYTHON are correctly set.") %
                                 ("%d.%d" % sys.version_info[:2], version))
     
    +        # set up memory limits
    +        memory_limit_mb = int(os.environ.get('PYSPARK_EXECUTOR_MEMORY_MB', "-1"))
    +        total_memory = resource.RLIMIT_AS
    +        try:
    +            (total_memory_limit, max_total_memory) = resource.getrlimit(total_memory)
    +            msg = "Current mem: {0} of max {1}\n".format(total_memory_limit, max_total_memory)
    +            print(msg, file=sys.stderr)
    +
    +            if memory_limit_mb > 0 and total_memory_limit == resource.RLIM_INFINITY:
    --- End diff --
    
    From our discussion in https://github.com/apache/spark/pull/21977#discussion_r208339172 I thought we were going to do this if there was no limit or the limit requested was lower?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21977: [SPARK-25004][CORE] Add spark.executor.pyspark.me...

Posted by holdenk <gi...@git.apache.org>.

Github user holdenk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21977#discussion_r212760057
  
    --- Diff: resource-managers/yarn/src/test/scala/org/apache/spark/deploy/yarn/BaseYarnClusterSuite.scala ---
    @@ -161,6 +162,11 @@ abstract class BaseYarnClusterSuite
         }
         extraJars.foreach(launcher.addJar)
     
    +    if (outFile.isDefined) {
    --- End diff --
    
    If you do a foreach then the `.get` goes away and the code could be a little cleaner, but it's pretty minor.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94186/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by holdenk <gi...@git.apache.org>.

Github user holdenk commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    I'd be happy to. I've got a live review tomorrow I'll take a look at this tomorrow.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    **[Test build #94065 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94065/testReport)** for PR 21977 at commit [`121d3b5`](https://github.com/apache/spark/commit/121d3b51f6cf06906e188dc7fd07976bbca1bb6a).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    **[Test build #94066 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94066/testReport)** for PR 21977 at commit [`c072dcb`](https://github.com/apache/spark/commit/c072dcb271ec0366be63e47dee82c9abcaaf0f17).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2494/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21977: [SPARK-25004][CORE] Add spark.executor.pyspark.me...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21977#discussion_r213192210
  
    --- Diff: core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala ---
    @@ -62,14 +63,20 @@ private[spark] object PythonEvalType {
      */
     private[spark] abstract class BasePythonRunner[IN, OUT](
         funcs: Seq[ChainedPythonFunctions],
    -    bufferSize: Int,
    -    reuseWorker: Boolean,
         evalType: Int,
         argOffsets: Array[Array[Int]])
       extends Logging {
     
       require(funcs.length == argOffsets.length, "argOffsets should have the same length as funcs")
     
    +  private val conf = SparkEnv.get.conf
    +  private val bufferSize = conf.getInt("spark.buffer.size", 65536)
    +  private val reuseWorker = conf.getBoolean("spark.python.worker.reuse", true)
    +  // each python worker gets an equal part of the allocation. the worker pool will grow to the
    +  // number of concurrent tasks, which is determined by the number of cores in this executor.
    +  private val memoryMb = conf.get(PYSPARK_EXECUTOR_MEMORY)
    +      .map(_ / conf.getInt("spark.executor.cores", 1))
    --- End diff --
    
    @rdblue, I fixed the site to refer databricks's guide. mind fixing this one if there are more changes to be pushed?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21977: SPARK-25004: Add spark.executor.pyspark.memory li...

Posted by rdblue <gi...@git.apache.org>.

Github user rdblue commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21977#discussion_r207690569
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/python/AggregateInPandasExec.scala ---
    @@ -81,6 +82,17 @@ case class AggregateInPandasExec(
     
         val bufferSize = inputRDD.conf.getInt("spark.buffer.size", 65536)
         val reuseWorker = inputRDD.conf.getBoolean("spark.python.worker.reuse", defaultValue = true)
    +    val memoryMb = {
    --- End diff --
    
    I went ahead with the refactor.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21977: SPARK-25004: Add spark.executor.pyspark.memory li...

Posted by rdblue <gi...@git.apache.org>.

Github user rdblue commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21977#discussion_r209707560
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/python/AggregateInPandasExec.scala ---
    @@ -137,13 +135,12 @@ case class AggregateInPandasExec(
     
           val columnarBatchIter = new ArrowPythonRunner(
             pyFuncs,
    -        bufferSize,
    -        reuseWorker,
             PythonEvalType.SQL_GROUPED_AGG_PANDAS_UDF,
             argOffsets,
             aggInputSchema,
             sessionLocalTimeZone,
    -        pythonRunnerConf).compute(projectedRowIter, context.partitionId(), context)
    +        pythonRunnerConf,
    +        sparkContext.conf).compute(projectedRowIter, context.partitionId(), context)
    --- End diff --
    
    I'm updating it to `SparkEnv.get.conf`, which is what I used in other places. Thanks for catching this.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21977: [SPARK-25004][CORE] Add spark.executor.pyspark.me...

Posted by holdenk <gi...@git.apache.org>.

Github user holdenk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21977#discussion_r212757824
  
    --- Diff: docs/configuration.md ---
    @@ -179,6 +179,15 @@ of the most common options to set are:
         (e.g. <code>2g</code>, <code>8g</code>).
       </td>
     </tr>
    +<tr>
    + <td><code>spark.executor.pyspark.memory</code></td>
    +  <td>Not set</td>
    +  <td>
    +    The amount of memory to be allocated to PySpark in each executor, in MiB
    --- End diff --
    
    We should probably mention that this is added to the executor memory request in Yarn mode.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94189/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21977: SPARK-25004: Add spark.executor.pyspark.memory li...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21977#discussion_r209374683
  
    --- Diff: python/pyspark/worker.py ---
    @@ -259,6 +260,26 @@ def main(infile, outfile):
                                  "PYSPARK_DRIVER_PYTHON are correctly set.") %
                                 ("%d.%d" % sys.version_info[:2], version))
     
    +        # set up memory limits
    +        memory_limit_mb = int(os.environ.get('PYSPARK_EXECUTOR_MEMORY_MB', "-1"))
    +        total_memory = resource.RLIMIT_AS
    +        try:
    +            (total_memory_limit, max_total_memory) = resource.getrlimit(total_memory)
    +            msg = "Current mem: {0} of max {1}\n".format(total_memory_limit, max_total_memory)
    +            print(msg, file=sys.stderr)
    +
    +            if memory_limit_mb > 0 and total_memory_limit == resource.RLIM_INFINITY:
    +                # convert to bytes
    +                total_memory_limit = memory_limit_mb * 1024 * 1024
    +
    +                msg = "Setting mem to {0} of max {1}\n".format(total_memory_limit, max_total_memory)
    +                print(msg, file=sys.stderr)
    +                resource.setrlimit(total_memory, (total_memory_limit, total_memory_limit))
    --- End diff --
    
    Here the hard limit is intended to be `total_memory_limit` or it should be `max_total_memory`?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21977: SPARK-25004: Add spark.executor.pyspark.memory li...

Posted by holdenk <gi...@git.apache.org>.

Github user holdenk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21977#discussion_r208051279
  
    --- Diff: core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala ---
    @@ -60,14 +61,26 @@ private[spark] object PythonEvalType {
      */
     private[spark] abstract class BasePythonRunner[IN, OUT](
         funcs: Seq[ChainedPythonFunctions],
    -    bufferSize: Int,
    -    reuseWorker: Boolean,
         evalType: Int,
    -    argOffsets: Array[Array[Int]])
    +    argOffsets: Array[Array[Int]],
    +    conf: SparkConf)
       extends Logging {
     
       require(funcs.length == argOffsets.length, "argOffsets should have the same length as funcs")
     
    +  private val bufferSize = conf.getInt("spark.buffer.size", 65536)
    +  private val reuseWorker = conf.getBoolean("spark.python.worker.reuse", true)
    +  private val memoryMb = {
    +    val allocation = conf.get(PYSPARK_EXECUTOR_MEMORY)
    +    if (reuseWorker) {
    --- End diff --
    
    So I'm still not sure about this part. Did you have a chance to look at the code path for where we decide to fork or not?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21977: SPARK-25004: Add spark.executor.pyspark.memory li...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21977#discussion_r212499194
  
    --- Diff: core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala ---
    @@ -60,14 +61,20 @@ private[spark] object PythonEvalType {
      */
     private[spark] abstract class BasePythonRunner[IN, OUT](
         funcs: Seq[ChainedPythonFunctions],
    -    bufferSize: Int,
    -    reuseWorker: Boolean,
         evalType: Int,
    -    argOffsets: Array[Array[Int]])
    +    argOffsets: Array[Array[Int]],
    +    conf: SparkConf)
       extends Logging {
     
       require(funcs.length == argOffsets.length, "argOffsets should have the same length as funcs")
     
    +  private val bufferSize = conf.getInt("spark.buffer.size", 65536)
    +  private val reuseWorker = conf.getBoolean("spark.python.worker.reuse", true)
    +  // each python worker gets an equal part of the allocation. the worker pool will grow to the
    +  // number of concurrent tasks, which is determined by the number of cores in this executor.
    +  private val memoryMb = conf.get(PYSPARK_EXECUTOR_MEMORY)
    +      .map(_ / conf.getInt("spark.executor.cores", 1))
    --- End diff --
    
    > I don't think that the DataBricks style guide applies to Apache projects.
    
    I sent an email to dev mailing list - http://apache-spark-developers-list.1001551.n3.nabble.com/Porting-or-explicitly-linking-project-style-in-Apache-Spark-based-on-https-github-com-databricks-scae-td24790.html
    
    I was thinking 2 indents for continuation lines are more common in the codebase and thought better follow this.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21977: SPARK-25004: Add spark.executor.pyspark.memory li...

Posted by rdblue <gi...@git.apache.org>.

Github user rdblue commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21977#discussion_r207636504
  
    --- Diff: core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala ---
    @@ -51,6 +52,17 @@ private[spark] class PythonRDD(
       val bufferSize = conf.getInt("spark.buffer.size", 65536)
       val reuseWorker = conf.getBoolean("spark.python.worker.reuse", true)
     
    +  val memoryMb = {
    --- End diff --
    
    I thought the comments below were clear: if a single worker is reused, it gets the entire allocation. If each core starts its own worker, each one gets an equal share.
    
    If `reuseWorker` is actually ignored, then this needs to be updated.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by BryanCutler <gi...@git.apache.org>.

Github user BryanCutler commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    @holdenk pyarrow uses a C++ based memory pool, so I'm not sure how that exactly works with rlimit but I ran some tests and looks like an error is thrown when the limit is set.
    
    **with setrlimit**
    ```python
    >>> import pyarrow as pa
    >>> import resource
    >>> resource.setrlimit(resource.RLIMIT_AS, (1000 * 1024 * 1024, 1000 * 1024 * 1024))
    >>> a = list(range(1 << 20))
    >>> b = [pa.array(a) for i in range(10)]
    >>> c = [pa.array(a) for i in range(10)]
    >>> pa.total_allocated_bytes()
    170393600
    >>> d = [pa.array(a) for i in range(100)]
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "<stdin>", line 1, in <listcomp>
      File "pyarrow/array.pxi", line 186, in pyarrow.lib.array
      File "pyarrow/array.pxi", line 26, in pyarrow.lib._sequence_to_array
      File "pyarrow/error.pxi", line 85, in pyarrow.lib.check_status
    pyarrow.lib.ArrowMemoryError: malloc of size 8388608 failed
    ```
    
    **no limit**
    ```python
    >>> import pyarrow as pa
    >>> a = list(range(1 << 20))
    >>> b = [pa.array(a) for i in range(10)]
    >>> c = [pa.array(a) for i in range(10)]
    >>> pa.total_allocated_bytes()
    170393600
    >>> d = [pa.array(a) for i in range(100)]
    >>> pa.total_allocated_bytes()
    1022361600
    ```
    
    One thing I wasn't expecting is it seems like importing pyarrow and it's shared libraries after setting rlimit can fail if it is set too low, and it is not a clean failure - is this expected?
    
    ```python
    >>> import resource
    >>> resource.setrlimit(resource.RLIMIT_AS, (100 * 1024 * 1024, 100 * 1024 * 1024))
    >>> import pyarrow
    Traceback (most recent call last):
      File "/home/bryan/miniconda2/envs/pa010py35/lib/python3.5/site-packages/numpy/core/__init__.py", line 16, in <module>
        from . import multiarray
    ImportError: libopenblas.so.0: failed to map segment from shared object
    ```


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    **[Test build #94057 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94057/testReport)** for PR 21977 at commit [`8846bdc`](https://github.com/apache/spark/commit/8846bdc261670d7e9808229be6219445a6a427a6).
     * This patch **fails Python style tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21977: SPARK-25004: Add spark.executor.pyspark.memory li...

Posted by ueshin <gi...@git.apache.org>.

Github user ueshin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21977#discussion_r208449418
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/python/AggregateInPandasExec.scala ---
    @@ -137,13 +135,12 @@ case class AggregateInPandasExec(
     
           val columnarBatchIter = new ArrowPythonRunner(
             pyFuncs,
    -        bufferSize,
    -        reuseWorker,
             PythonEvalType.SQL_GROUPED_AGG_PANDAS_UDF,
             argOffsets,
             aggInputSchema,
             sessionLocalTimeZone,
    -        pythonRunnerConf).compute(projectedRowIter, context.partitionId(), context)
    +        pythonRunnerConf,
    +        sparkContext.conf).compute(projectedRowIter, context.partitionId(), context)
    --- End diff --
    
    Seems like this is in executor side, but can we get `sparkContext`?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    **[Test build #94230 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94230/testReport)** for PR 21977 at commit [`3616f60`](https://github.com/apache/spark/commit/3616f607d3a20a6f4a27e63ee42642ef1ccc0366).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by rdblue <gi...@git.apache.org>.

Github user rdblue commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    @squito, the last Jenkins test had a different error message and two of the tests didn't show the stdout/stderr output. I updated the other tests to show that output, and hopefully we get a run with a consistent error message.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: [SPARK-25004][CORE] Add spark.executor.pyspark.memory li...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    **[Test build #95313 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95313/testReport)** for PR 21977 at commit [`0b275cf`](https://github.com/apache/spark/commit/0b275cfea7d83cdf61802da30c4a7604be8900e4).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2213/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/1922/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21977: [SPARK-25004][CORE] Add spark.executor.pyspark.me...

Posted by rdblue <gi...@git.apache.org>.

Github user rdblue commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21977#discussion_r212782657
  
    --- Diff: resource-managers/yarn/src/test/scala/org/apache/spark/deploy/yarn/BaseYarnClusterSuite.scala ---
    @@ -161,6 +162,11 @@ abstract class BaseYarnClusterSuite
         }
         extraJars.foreach(launcher.addJar)
     
    +    if (outFile.isDefined) {
    --- End diff --
    
    Like I said, I think `foreach` is a bad practice with options, so I'd rather not change to use it. I'd be happy to change this to a pattern match if you think it is really desirable to get rid of the `.get`.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21977: SPARK-25004: Add spark.executor.pyspark.memory li...

Posted by rdblue <gi...@git.apache.org>.

Github user rdblue commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21977#discussion_r211763717
  
    --- Diff: resource-managers/yarn/src/test/scala/org/apache/spark/deploy/yarn/BaseYarnClusterSuite.scala ---
    @@ -161,6 +162,11 @@ abstract class BaseYarnClusterSuite
         }
         extraJars.foreach(launcher.addJar)
     
    +    if (outFile.isDefined) {
    --- End diff --
    
    I think that `foreach` on an option is unclear (looks like a loop) and should be avoided unless it really simplifies the logic, which it doesn't do here.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94710/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    **[Test build #94229 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94229/testReport)** for PR 21977 at commit [`0c0ff92`](https://github.com/apache/spark/commit/0c0ff92693945f3c5ae63f60af94b88281be0c32).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21977: [SPARK-25004][CORE] Add spark.executor.pyspark.me...

Posted by rdblue <gi...@git.apache.org>.

Github user rdblue commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21977#discussion_r212714476
  
    --- Diff: core/src/main/scala/org/apache/spark/internal/config/package.scala ---
    @@ -114,6 +114,10 @@ package object config {
         .checkValue(_ >= 0, "The off-heap memory size must not be negative")
         .createWithDefault(0)
     
    +  private[spark] val PYSPARK_EXECUTOR_MEMORY = ConfigBuilder("spark.executor.pyspark.memory")
    --- End diff --
    
    Yes, it should. I'll fix it.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    **[Test build #94386 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94386/testReport)** for PR 21977 at commit [`ee750ef`](https://github.com/apache/spark/commit/ee750efae806ea958a6a5a327799dafe6a0b3e64).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by rdblue <gi...@git.apache.org>.

Github user rdblue commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    Anyone able to reproduce the YARN test failures? When I run the tests, a different set fails and seems unrelated to python. This shouldn't change the behavior of any tests because the environment variable isn't set so python's behavior shouldn't change.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: [SPARK-25004][CORE] Add spark.executor.pyspark.memory li...

Posted by vanzin <gi...@git.apache.org>.

Github user vanzin commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    Test failure seems completely unrelated. Merging to master.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by holdenk <gi...@git.apache.org>.

Github user holdenk commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    @rdblue I'll take a look at that test on Monday during one of my streams after I get something else done :)


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    **[Test build #95167 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95167/testReport)** for PR 21977 at commit [`fcee94c`](https://github.com/apache/spark/commit/fcee94c22bebf1d55ba8ab9091b4bcb9852717b2).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    Build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by holdenk <gi...@git.apache.org>.

Github user holdenk commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    Awesome, thanks for taking the time @rdblue to dig into that. Really excited to get these limits in soon! (maybe this Friday during my weekly live code review time if it happens to line up :)).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    **[Test build #94251 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94251/testReport)** for PR 21977 at commit [`a5a788b`](https://github.com/apache/spark/commit/a5a788b3dbb3c6285f903fda740fe3f69ab275a9).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/1705/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94066/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/1698/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    **[Test build #94191 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94191/testReport)** for PR 21977 at commit [`2fb3d79`](https://github.com/apache/spark/commit/2fb3d790dac74a8db6fffac7f34c89880b53e11f).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by rdblue <gi...@git.apache.org>.

Github user rdblue commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    @gatorsmile, I started [YarnPySparkSuite](https://gist.github.com/rdblue/9848a00f49eaad6126fbbcfa1b039e19) but the YARN tests don't create python worker processes so the tests don't work. I need to find out how to force YARN to create workers in order to write tests. If you have any input that would help, please let me know.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21977: SPARK-25004: Add spark.executor.pyspark.memory li...

Posted by rdblue <gi...@git.apache.org>.

Github user rdblue commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21977#discussion_r209702097
  
    --- Diff: python/pyspark/worker.py ---
    @@ -259,6 +260,26 @@ def main(infile, outfile):
                                  "PYSPARK_DRIVER_PYTHON are correctly set.") %
                                 ("%d.%d" % sys.version_info[:2], version))
     
    +        # set up memory limits
    +        memory_limit_mb = int(os.environ.get('PYSPARK_EXECUTOR_MEMORY_MB', "-1"))
    +        total_memory = resource.RLIMIT_AS
    +        try:
    +            (total_memory_limit, max_total_memory) = resource.getrlimit(total_memory)
    +            msg = "Current mem: {0} of max {1}\n".format(total_memory_limit, max_total_memory)
    +            print(msg, file=sys.stderr)
    +
    +            if memory_limit_mb > 0 and total_memory_limit == resource.RLIM_INFINITY:
    +                # convert to bytes
    +                total_memory_limit = memory_limit_mb * 1024 * 1024
    +
    +                msg = "Setting mem to {0} of max {1}\n".format(total_memory_limit, max_total_memory)
    +                print(msg, file=sys.stderr)
    +                resource.setrlimit(total_memory, (total_memory_limit, total_memory_limit))
    --- End diff --
    
    This sets both to total_memory_limit.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    Is this PR close to getting merged? Or do we have some problems that are hard to solve?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21977: [SPARK-25004][CORE] Add spark.executor.pyspark.me...

Posted by holdenk <gi...@git.apache.org>.

Github user holdenk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21977#discussion_r212757958
  
    --- Diff: docs/configuration.md ---
    @@ -179,6 +179,15 @@ of the most common options to set are:
         (e.g. <code>2g</code>, <code>8g</code>).
       </td>
     </tr>
    +<tr>
    + <td><code>spark.executor.pyspark.memory</code></td>
    +  <td>Not set</td>
    +  <td>
    +    The amount of memory to be allocated to PySpark in each executor, in MiB
    +    unless otherwise specified.  If set, PySpark memory for an executor will be
    +    limited to this amount. If not set, Spark will not limit Python's memory use.
    --- End diff --
    
    Maybe mention that in this case (unset) it's up to the user to keep Python + system processes in the overhead %.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: [SPARK-25004][CORE] Add spark.executor.pyspark.memory li...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    **[Test build #95225 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95225/testReport)** for PR 21977 at commit [`bb8fecb`](https://github.com/apache/spark/commit/bb8fecb19e2a231061f95a327a162e48df48e9cd).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: [SPARK-25004][CORE] Add spark.executor.pyspark.memory li...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    **[Test build #95313 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95313/testReport)** for PR 21977 at commit [`0b275cf`](https://github.com/apache/spark/commit/0b275cfea7d83cdf61802da30c4a7604be8900e4).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    **[Test build #94059 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94059/testReport)** for PR 21977 at commit [`3992cb2`](https://github.com/apache/spark/commit/3992cb2470f7fbb6c9e7fe5159287df12f9800ba).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    **[Test build #95167 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95167/testReport)** for PR 21977 at commit [`fcee94c`](https://github.com/apache/spark/commit/fcee94c22bebf1d55ba8ab9091b4bcb9852717b2).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21977: SPARK-25004: Add spark.executor.pyspark.memory li...

Posted by rdblue <gi...@git.apache.org>.

Github user rdblue commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21977#discussion_r209703555
  
    --- Diff: python/pyspark/worker.py ---
    @@ -259,6 +260,26 @@ def main(infile, outfile):
                                  "PYSPARK_DRIVER_PYTHON are correctly set.") %
                                 ("%d.%d" % sys.version_info[:2], version))
     
    +        # set up memory limits
    +        memory_limit_mb = int(os.environ.get('PYSPARK_EXECUTOR_MEMORY_MB', "-1"))
    +        total_memory = resource.RLIMIT_AS
    +        try:
    +            (total_memory_limit, max_total_memory) = resource.getrlimit(total_memory)
    +            msg = "Current mem: {0} of max {1}\n".format(total_memory_limit, max_total_memory)
    +            print(msg, file=sys.stderr)
    +
    +            if memory_limit_mb > 0 and total_memory_limit == resource.RLIM_INFINITY:
    --- End diff --
    
    Yes, I just hadn't updated this yet.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2496/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21977: [SPARK-25004][CORE] Add spark.executor.pyspark.me...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/21977


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21977: SPARK-25004: Add spark.executor.pyspark.memory li...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21977#discussion_r207414776
  
    --- Diff: core/src/main/scala/org/apache/spark/internal/config/package.scala ---
    @@ -114,6 +114,10 @@ package object config {
         .checkValue(_ >= 0, "The off-heap memory size must not be negative")
         .createWithDefault(0)
     
    +  private[spark] val PYSPARK_EXECUTOR_MEMORY = ConfigBuilder("spark.executor.pyspark.memory")
    +      .bytesConf(ByteUnit.MiB)
    +      .createOptional
    --- End diff --
    
    tiny nit: indentation ..


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94191/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    **[Test build #94059 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94059/testReport)** for PR 21977 at commit [`3992cb2`](https://github.com/apache/spark/commit/3992cb2470f7fbb6c9e7fe5159287df12f9800ba).
     * This patch **fails Scala style tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/1785/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21977: SPARK-25004: Add spark.executor.pyspark.memory li...

Posted by holdenk <gi...@git.apache.org>.

Github user holdenk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21977#discussion_r207692117
  
    --- Diff: core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala ---
    @@ -51,6 +52,17 @@ private[spark] class PythonRDD(
       val bufferSize = conf.getInt("spark.buffer.size", 65536)
       val reuseWorker = conf.getBoolean("spark.python.worker.reuse", true)
     
    +  val memoryMb = {
    --- End diff --
    
    I think there might be a misunderstanding on what `reuseWorker` means perhaps. The workers will be reused but the decision on if we fork in Python or not is based on if we are in Windows or not. How about we both go and read the code path there and see if we reach the same understanding? I could be off too.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by rdblue <gi...@git.apache.org>.

Github user rdblue commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    @squito, I updated `YarnClusterSuite` in d58ad7a to capture the output of the child processes to find out what is causing the test failures. I think it is related to the commit you reviewed, 15fc2372269159ea2556b028d4eb8860c4108650, where the accumulator was updated. Here's the error:
    
    ```
      18/08/13 14:29:33 INFO SparkContext: Successfully stopped SparkContext                                                                                       
      Traceback (most recent call last):                                                                                        
        File "/home/blue-unencrypted/workspace/spark/resource-managers/yarn/target/tmp/spark-8370887e-9415-4f51-b08d-a6285e0384f5/test.py", line 11, in <module>
          sc = SparkContext(conf=SparkConf())                                                                                             
        File "/home/blue-unencrypted/workspace/spark/python/lib/pyspark.zip/pyspark/context.py", line 118, in __init__                    
        File "/home/blue-unencrypted/workspace/spark/python/lib/pyspark.zip/pyspark/context.py", line 188, in _do_init                     
        File "/home/blue-unencrypted/workspace/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1525, in __call__          
        File "/home/blue-unencrypted/workspace/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 332, in get_return_value          
      py4j.protocol.Py4JError: An error occurred while calling None.org.apache.spark.api.python.PythonAccumulatorV2. Trace:                     
      py4j.Py4JException: Constructor org.apache.spark.api.python.PythonAccumulatorV2([class java.lang.String, class java.lang.Integer]) does not exist
            at py4j.reflection.ReflectionEngine.getConstructor(ReflectionEngine.java:179)                                                       
            at py4j.reflection.ReflectionEngine.getConstructor(ReflectionEngine.java:196)                                                                                                
            at py4j.Gateway.invoke(Gateway.java:237)                                                                                       
            at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)                                                                                                            
            at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)                                                    
            at py4j.GatewayConnection.run(GatewayConnection.java:238)                                                            
            at java.lang.Thread.run(Thread.java:748)
    ```
    
    Can you tell why that is failing after the changes I've made here? Is this a bug with this PR or is it an existing problem from that commit?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    **[Test build #95108 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95108/testReport)** for PR 21977 at commit [`505f2eb`](https://github.com/apache/spark/commit/505f2eb09d60c695a80c7f62bde9a19a0e677357).
     * This patch **fails Spark unit tests**.
     * This patch **does not merge cleanly**.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    **[Test build #95165 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95165/testReport)** for PR 21977 at commit [`a38eac3`](https://github.com/apache/spark/commit/a38eac32e309a753a92ab2d74047b3cfd0382f56).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/1703/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    **[Test build #94386 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94386/testReport)** for PR 21977 at commit [`ee750ef`](https://github.com/apache/spark/commit/ee750efae806ea958a6a5a327799dafe6a0b3e64).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/1784/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: [SPARK-25004][CORE] Add spark.executor.pyspark.memory li...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    **[Test build #94071 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94071/testReport)** for PR 21977 at commit [`0d6deaf`](https://github.com/apache/spark/commit/0d6deafddebacf3aae04d2488f19bb3b2d6a500d).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21977: SPARK-25004: Add spark.executor.pyspark.memory li...

Posted by rdblue <gi...@git.apache.org>.

Github user rdblue commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21977#discussion_r207635841
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/python/AggregateInPandasExec.scala ---
    @@ -81,6 +82,17 @@ case class AggregateInPandasExec(
     
         val bufferSize = inputRDD.conf.getInt("spark.buffer.size", 65536)
         val reuseWorker = inputRDD.conf.getBoolean("spark.python.worker.reuse", defaultValue = true)
    +    val memoryMb = {
    --- End diff --
    
    The other configuration options are already duplicated, so I was trying to make as few changes as possible.
    
    Since there are several duplicated options, I think it makes more sense to pass the SparkConf through to PythonRunner so it can extract its own configuration.
    
    @holdenk, would you like this refactor done in this PR, or should I do it in a follow-up?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by rdblue <gi...@git.apache.org>.

Github user rdblue commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    @holdenk, I attempted to write a YARN unit test for this, but evidently the MiniYARNCluster doesn't run python workers. The task is run, but a worker is never started. If you have any idea how to fix this, I think we could have an easy test. Here's what I have so far: https://gist.github.com/rdblue/9848a00f49eaad6126fbbcfa1b039e19


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by rdblue <gi...@git.apache.org>.

Github user rdblue commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    This is close. The Java and Scala tests were passing and I think I fixed the remaining issue for the Python tests. Unfortunately, Scala tests are failing again and I was trying to run tests a couple of times to see if it was unrelated.
    
    I'll rebase and make one more change that @vanzin suggested today, then hopefully we will get a good test run.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: [SPARK-25004][CORE] Add spark.executor.pyspark.memory li...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95225/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21977: SPARK-25004: Add spark.executor.pyspark.memory li...

Posted by holdenk <gi...@git.apache.org>.

Github user holdenk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21977#discussion_r207601410
  
    --- Diff: core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala ---
    @@ -51,6 +52,17 @@ private[spark] class PythonRDD(
       val bufferSize = conf.getInt("spark.buffer.size", 65536)
       val reuseWorker = conf.getBoolean("spark.python.worker.reuse", true)
     
    +  val memoryMb = {
    --- End diff --
    
    It's been awhile since I spent a lot of time thinking about how we launch our python worker processes. Maybe it would make sense to add a comment here explaining the logic a bit more? Based on the documentation in `PythonWorkerFactory` it appears we do the fork/not-fork decision not based on if reuseworker is set but instead on if we're in Windows or not. Is that the logic that this block was attempting to handle?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94065/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    **[Test build #94703 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94703/testReport)** for PR 21977 at commit [`ac0da5b`](https://github.com/apache/spark/commit/ac0da5b6f37815ac9edf3575bc9a9149959bc6be).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by rdblue <gi...@git.apache.org>.

Github user rdblue commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    @BryanCutler, thanks for taking a look at this.
    
    Despite the problems you hit when the limit was set too low, I think we do want to use that limit. It was the most reliable one from our testing because it successfully prevented YARN from killing the executor -- that's because it limits the entire address space, including libraries. I think we can mitigate the confusion when loading libraries by setting a reasonable default, maybe 1g.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    **[Test build #94186 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94186/testReport)** for PR 21977 at commit [`68e9141`](https://github.com/apache/spark/commit/68e9141f14ccb68847ea30cda98c5e4f9b74ad52).
     * This patch **fails Scala style tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94076/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by rdblue <gi...@git.apache.org>.

Github user rdblue commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    Posting this as a comment instead of in a thread so it doesn't get lost. In response to @holdenk's question about memory allocation to workers:
    
    That `useDaemon` flag controls whether worker.py is called directly from Spark or from a daemon process that forks new workers. The daemon process is used because [forking from a huge JVM is more expensive](https://github.com/apache/spark/blob/628c7b517969c4a7ccb26ea67ab3dd61266073ca/core/src/main/scala/org/apache/spark/api/python/PythonWorkerFactory.scala#L102-L104).
    
    This isn't related to REUSE_WORKER, which is [used by the forked worker process](https://github.com/apache/spark/blob/628c7b517969c4a7ccb26ea67ab3dd61266073ca/python/pyspark/daemon.py#L112). If reuse is set, then the process [keeps the worker alive](https://github.com/apache/spark/blob/628c7b517969c4a7ccb26ea67ab3dd61266073ca/python/pyspark/daemon.py#L173) and will call its main method again.
    
    But, it looks like my understanding of `spark.python.worker.reuse` is still incorrect given that configuration's description: "Reuse Python worker or not. If yes, it will use a fixed number of Python workers, does not need to fork() a Python process for every task." Sounds like reuse means that workers stay around for multiple tasks, not that concurrent tasks run in the same worker.
    
    I'll do some more digging to see what determines the number of workers. My guess is the number of cores, in which case this should always divide by the number of cores.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by squito <gi...@git.apache.org>.

Github user squito commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    it fails consistently for me locally too, with your branch, but with this failure:
    
    ```
     [info]     File "/Users/irashid/github/pub/spark/target/tmp/spark-7c0a388c-1413-4215-9a4d-c590edec929c/test.py", line 15, in <modul
    e>
    [info]       cnt = rdd.count()
    [info]     File "/Users/irashid/github/pub/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1075, in count
    [info]     File "/Users/irashid/github/pub/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1066, in sum
    [info]     File "/Users/irashid/github/pub/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 937, in fold
    [info]     File "/Users/irashid/github/pub/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 836, in collect
    [info]     File "/Users/irashid/github/pub/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
    [info]     File "/Users/irashid/github/pub/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
    [info]   py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
    [info]   : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failu
    re: Lost task 0.3 in stage 0.0 (TID 6, 192.168.86.45, executor 1): org.apache.spark.SparkException: This RDD lacks a SparkContext. 
    It could happen in the following cases: 
    [info]   (1) RDD transformations and actions are NOT invoked by the driver, but inside of other transformations; for example, rdd1.
    map(x => rdd2.values.count() * x) is invalid because the values transformation and count action cannot be performed inside of the r
    dd1.map transformation. For more information, see SPARK-5063.
    [info]   (2) When a Spark Streaming job recovers from checkpoint, this exception will be hit if a reference to an RDD not defined b
    y the streaming job is used in DStream operations. For more information, See SPARK-13758.
    [info]          at org.apache.spark.rdd.RDD.org$apache$spark$rdd$RDD$$sc(RDD.scala:90)
    [info]          at org.apache.spark.rdd.RDD.conf(RDD.scala:107)
    [info]          at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:61)
    [info]          at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
    [info]          at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
    [info]          at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
    [info]          at org.apache.spark.scheduler.Task.run(Task.scala:128)
    [info]          at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:367)
    [info]          at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    [info]          at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    [info]          at java.lang.Thread.run(Thread.java:745)
    ```
    
    the stack trace you pasted above is a little weird, because one of the lines it mentions doesn't seem right (though maybe you had some other debug code in there?): https://github.com/rdblue/spark/blob/SPARK-25004-add-python-memory-limit/python/pyspark/context.py?utf8=%E2%9C%93#L188


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: [SPARK-25004][CORE] Add spark.executor.pyspark.memory li...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95313/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21977: SPARK-25004: Add spark.executor.pyspark.memory li...

Posted by rdblue <gi...@git.apache.org>.

Github user rdblue commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21977#discussion_r208302473
  
    --- Diff: core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala ---
    @@ -60,14 +61,26 @@ private[spark] object PythonEvalType {
      */
     private[spark] abstract class BasePythonRunner[IN, OUT](
         funcs: Seq[ChainedPythonFunctions],
    -    bufferSize: Int,
    -    reuseWorker: Boolean,
         evalType: Int,
    -    argOffsets: Array[Array[Int]])
    +    argOffsets: Array[Array[Int]],
    +    conf: SparkConf)
       extends Logging {
     
       require(funcs.length == argOffsets.length, "argOffsets should have the same length as funcs")
     
    +  private val bufferSize = conf.getInt("spark.buffer.size", 65536)
    +  private val reuseWorker = conf.getBoolean("spark.python.worker.reuse", true)
    +  private val memoryMb = {
    +    val allocation = conf.get(PYSPARK_EXECUTOR_MEMORY)
    +    if (reuseWorker) {
    --- End diff --
    
    Thanks, Felix!


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    @rdblue Is this for YARN only?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    cc @jiangxb1987 @cloud-fan @jerryshao @vanzin 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    **[Test build #94057 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94057/testReport)** for PR 21977 at commit [`8846bdc`](https://github.com/apache/spark/commit/8846bdc261670d7e9808229be6219445a6a427a6).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by jerryshao <gi...@git.apache.org>.

Github user jerryshao commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    What about R, do we also need a similar setting for R? I was thinking that with project hydrogen, more and more external processes will be run inside the Spark's executor (MPP), all these external processes require additional memory, can we make it more general?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    Why not using `resource.RLIMIT_RSS`?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94057/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/1823/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by mccheah <gi...@git.apache.org>.

Github user mccheah commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    Does this have applications in the other cluster managers that are considering overhead memory, like Kuberrnetes and Mesos?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21977: [SPARK-25004][CORE] Add spark.executor.pyspark.me...

Posted by rdblue <gi...@git.apache.org>.

Github user rdblue commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21977#discussion_r213035238
  
    --- Diff: resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala ---
    @@ -91,6 +91,13 @@ private[spark] class Client(
       private val executorMemoryOverhead = sparkConf.get(EXECUTOR_MEMORY_OVERHEAD).getOrElse(
         math.max((MEMORY_OVERHEAD_FACTOR * executorMemory).toLong, MEMORY_OVERHEAD_MIN)).toInt
     
    +  private val isPython = sparkConf.get(IS_PYTHON_APP)
    --- End diff --
    
    @holdenk, can you point me to that repo? I'd love to have a look at how you do mixed pipelines.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21977: SPARK-25004: Add spark.executor.pyspark.memory li...

Posted by felixcheung <gi...@git.apache.org>.

Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21977#discussion_r207726903
  
    --- Diff: resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala ---
    @@ -133,10 +133,17 @@ private[yarn] class YarnAllocator(
       // Additional memory overhead.
       protected val memoryOverhead: Int = sparkConf.get(EXECUTOR_MEMORY_OVERHEAD).getOrElse(
         math.max((MEMORY_OVERHEAD_FACTOR * executorMemory).toInt, MEMORY_OVERHEAD_MIN)).toInt
    +  protected val pysparkWorkerMemory: Int = if (sparkConf.get(IS_PYTHON_APP)) {
    +    sparkConf.get(PYSPARK_EXECUTOR_MEMORY).map(_.toInt).getOrElse(0)
    --- End diff --
    
    got it


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by squito <gi...@git.apache.org>.

Github user squito commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    @rdblue looking ...


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95039/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21977: SPARK-25004: Add spark.executor.pyspark.memory li...

Posted by vanzin <gi...@git.apache.org>.

Github user vanzin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21977#discussion_r212701416
  
    --- Diff: core/src/main/scala/org/apache/spark/internal/config/package.scala ---
    @@ -114,6 +114,10 @@ package object config {
         .checkValue(_ >= 0, "The off-heap memory size must not be negative")
         .createWithDefault(0)
     
    +  private[spark] val PYSPARK_EXECUTOR_MEMORY = ConfigBuilder("spark.executor.pyspark.memory")
    --- End diff --
    
    Argh, should have noticed this before. Should this be added to `configuration.md`?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21977: SPARK-25004: Add spark.executor.pyspark.memory li...

Posted by holdenk <gi...@git.apache.org>.

Github user holdenk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21977#discussion_r207596709
  
    --- Diff: resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala ---
    @@ -333,7 +340,7 @@ private[spark] class Client(
         val maxMem = newAppResponse.getMaximumResourceCapability().getMemory()
         logInfo("Verifying our application has not requested more than the maximum " +
           s"memory capability of the cluster ($maxMem MB per container)")
    -    val executorMem = executorMemory + executorMemoryOverhead
    +    val executorMem = executorMemory + executorMemoryOverhead + pysparkWorkerMemory
         if (executorMem > maxMem) {
           throw new IllegalArgumentException(s"Required executor memory ($executorMemory" +
             s"+$executorMemoryOverhead MB) is above the max threshold ($maxMem MB) of this cluster! " +
    --- End diff --
    
    Maybe just switch it to use the total `$executorMem` instead?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    Build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    **[Test build #95064 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95064/testReport)** for PR 21977 at commit [`505f2eb`](https://github.com/apache/spark/commit/505f2eb09d60c695a80c7f62bde9a19a0e677357).
     * This patch **fails Spark unit tests**.
     * This patch **does not merge cleanly**.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by rdblue <gi...@git.apache.org>.

Github user rdblue commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    @holdenk, what could be the cause?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21977: SPARK-25004: Add spark.executor.pyspark.memory li...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21977#discussion_r209209726
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/python/AggregateInPandasExec.scala ---
    @@ -137,13 +135,12 @@ case class AggregateInPandasExec(
     
           val columnarBatchIter = new ArrowPythonRunner(
             pyFuncs,
    -        bufferSize,
    -        reuseWorker,
             PythonEvalType.SQL_GROUPED_AGG_PANDAS_UDF,
             argOffsets,
             aggInputSchema,
             sessionLocalTimeZone,
    -        pythonRunnerConf).compute(projectedRowIter, context.partitionId(), context)
    +        pythonRunnerConf,
    +        sparkContext.conf).compute(projectedRowIter, context.partitionId(), context)
    --- End diff --
    
    Yea, same question.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by rdblue <gi...@git.apache.org>.

Github user rdblue commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    Retest this please.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21977: SPARK-25004: Add spark.executor.pyspark.memory li...

Posted by rdblue <gi...@git.apache.org>.

Github user rdblue commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21977#discussion_r208091782
  
    --- Diff: core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala ---
    @@ -60,14 +61,26 @@ private[spark] object PythonEvalType {
      */
     private[spark] abstract class BasePythonRunner[IN, OUT](
         funcs: Seq[ChainedPythonFunctions],
    -    bufferSize: Int,
    -    reuseWorker: Boolean,
         evalType: Int,
    -    argOffsets: Array[Array[Int]])
    +    argOffsets: Array[Array[Int]],
    +    conf: SparkConf)
       extends Logging {
     
       require(funcs.length == argOffsets.length, "argOffsets should have the same length as funcs")
     
    +  private val bufferSize = conf.getInt("spark.buffer.size", 65536)
    +  private val reuseWorker = conf.getBoolean("spark.python.worker.reuse", true)
    +  private val memoryMb = {
    +    val allocation = conf.get(PYSPARK_EXECUTOR_MEMORY)
    +    if (reuseWorker) {
    --- End diff --
    
    No, I'm not sure where that is. Is it on the python side? If you can point me to it, I'll have a closer look.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21977: SPARK-25004: Add spark.executor.pyspark.memory li...

Posted by felixcheung <gi...@git.apache.org>.

Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21977#discussion_r207726897
  
    --- Diff: python/pyspark/worker.py ---
    @@ -259,6 +260,26 @@ def main(infile, outfile):
                                  "PYSPARK_DRIVER_PYTHON are correctly set.") %
                                 ("%d.%d" % sys.version_info[:2], version))
     
    +        # set up memory limits
    +        memory_limit_mb = int(os.environ.get('PYSPARK_EXECUTOR_MEMORY_MB', "-1"))
    +        total_memory = resource.RLIMIT_AS
    +        try:
    +            (total_memory_limit, max_total_memory) = resource.getrlimit(total_memory)
    +            msg = "Current mem: {0} of max {1}\n".format(total_memory_limit, max_total_memory)
    +            print(msg, file=sys.stderr)
    +
    +            if memory_limit_mb > 0 and total_memory_limit == resource.RLIM_INFINITY:
    +                # convert to bytes
    +                total_memory_limit = memory_limit_mb * 1024 * 1024
    +
    +                msg = "Setting mem to {0} of max {1}\n".format(total_memory_limit, max_total_memory)
    +                print(msg, file=sys.stderr)
    +                resource.setrlimit(total_memory, (total_memory_limit, total_memory_limit))
    +
    +        except (resource.error, OSError, ValueError) as e:
    +            # not all systems support resource limits, so warn instead of failing
    +            sys.stderr.write("WARN: Failed to set memory limit: {0}\n".format(e))
    --- End diff --
    
    ditto here
    `print(msg, file=sys.stderr)`?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21977: SPARK-25004: Add spark.executor.pyspark.memory li...

Posted by felixcheung <gi...@git.apache.org>.

Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21977#discussion_r207716877
  
    --- Diff: resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala ---
    @@ -133,10 +133,17 @@ private[yarn] class YarnAllocator(
       // Additional memory overhead.
       protected val memoryOverhead: Int = sparkConf.get(EXECUTOR_MEMORY_OVERHEAD).getOrElse(
         math.max((MEMORY_OVERHEAD_FACTOR * executorMemory).toInt, MEMORY_OVERHEAD_MIN)).toInt
    +  protected val pysparkWorkerMemory: Int = if (sparkConf.get(IS_PYTHON_APP)) {
    +    sparkConf.get(PYSPARK_EXECUTOR_MEMORY).map(_.toInt).getOrElse(0)
    --- End diff --
    
    nit: default to -1 to be consistent?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94703/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94980/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21977: SPARK-25004: Add spark.executor.pyspark.memory li...

Posted by holdenk <gi...@git.apache.org>.

Github user holdenk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21977#discussion_r209331503
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/python/AggregateInPandasExec.scala ---
    @@ -137,13 +135,12 @@ case class AggregateInPandasExec(
     
           val columnarBatchIter = new ArrowPythonRunner(
             pyFuncs,
    -        bufferSize,
    -        reuseWorker,
             PythonEvalType.SQL_GROUPED_AGG_PANDAS_UDF,
             argOffsets,
             aggInputSchema,
             sessionLocalTimeZone,
    -        pythonRunnerConf).compute(projectedRowIter, context.partitionId(), context)
    +        pythonRunnerConf,
    +        sparkContext.conf).compute(projectedRowIter, context.partitionId(), context)
    --- End diff --
    
    No, we can't normally. The tests should fail, or we need another test for this part of the Arrow code. cc @BryanCutler  (or I've misunderstood something). Whatever variables are needed from the conf need to be extracted outside of the `map` (since while RDD also has a `conf` it depends on the `sc` which is only defined on the driver.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2151/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by rdblue <gi...@git.apache.org>.

Github user rdblue commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    @holdenk, can you help review this since it is related to PySpark?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21977: [SPARK-25004][CORE] Add spark.executor.pyspark.me...

Posted by rdblue <gi...@git.apache.org>.

Github user rdblue commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21977#discussion_r213407352
  
    --- Diff: core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala ---
    @@ -62,14 +63,20 @@ private[spark] object PythonEvalType {
      */
     private[spark] abstract class BasePythonRunner[IN, OUT](
         funcs: Seq[ChainedPythonFunctions],
    -    bufferSize: Int,
    -    reuseWorker: Boolean,
         evalType: Int,
         argOffsets: Array[Array[Int]])
       extends Logging {
     
       require(funcs.length == argOffsets.length, "argOffsets should have the same length as funcs")
     
    +  private val conf = SparkEnv.get.conf
    +  private val bufferSize = conf.getInt("spark.buffer.size", 65536)
    +  private val reuseWorker = conf.getBoolean("spark.python.worker.reuse", true)
    +  // each python worker gets an equal part of the allocation. the worker pool will grow to the
    +  // number of concurrent tasks, which is determined by the number of cores in this executor.
    +  private val memoryMb = conf.get(PYSPARK_EXECUTOR_MEMORY)
    +      .map(_ / conf.getInt("spark.executor.cores", 1))
    --- End diff --
    
    Sure, thanks for taking the time to clarify it.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    cc @BryanCutler and @icexelloss too since we recently discussed about memory issue.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    **[Test build #95165 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95165/testReport)** for PR 21977 at commit [`a38eac3`](https://github.com/apache/spark/commit/a38eac32e309a753a92ab2d74047b3cfd0382f56).
     * This patch **fails Scala style tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    **[Test build #94750 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94750/testReport)** for PR 21977 at commit [`990b513`](https://github.com/apache/spark/commit/990b513fe968c94e03242c1fecfa8e9948d416ae).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    **[Test build #94186 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94186/testReport)** for PR 21977 at commit [`68e9141`](https://github.com/apache/spark/commit/68e9141f14ccb68847ea30cda98c5e4f9b74ad52).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21977: SPARK-25004: Add spark.executor.pyspark.memory li...

Posted by holdenk <gi...@git.apache.org>.

Github user holdenk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21977#discussion_r209330276
  
    --- Diff: resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala ---
    @@ -91,6 +91,13 @@ private[spark] class Client(
       private val executorMemoryOverhead = sparkConf.get(EXECUTOR_MEMORY_OVERHEAD).getOrElse(
         math.max((MEMORY_OVERHEAD_FACTOR * executorMemory).toLong, MEMORY_OVERHEAD_MIN)).toInt
     
    +  private val isPython = sparkConf.get(IS_PYTHON_APP)
    --- End diff --
    
    This is interesting, my one concern here is probably a little esoteric, for mixed language pipelines this might not behave as desired. I'd suggest maybe a JIRA and a note in the config param that it only applies to Python apps not mixed


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: [SPARK-25004][CORE] Add spark.executor.pyspark.memory li...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2502/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21977: [SPARK-25004][CORE] Add spark.executor.pyspark.me...

Posted by holdenk <gi...@git.apache.org>.

Github user holdenk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21977#discussion_r213187429
  
    --- Diff: resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala ---
    @@ -91,6 +91,13 @@ private[spark] class Client(
       private val executorMemoryOverhead = sparkConf.get(EXECUTOR_MEMORY_OVERHEAD).getOrElse(
         math.max((MEMORY_OVERHEAD_FACTOR * executorMemory).toLong, MEMORY_OVERHEAD_MIN)).toInt
     
    +  private val isPython = sparkConf.get(IS_PYTHON_APP)
    --- End diff --
    
    Sure, one of them is https://github.com/sparklingpandas/sparklingml


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21977: SPARK-25004: Add spark.executor.pyspark.memory li...

Posted by rdblue <gi...@git.apache.org>.

Github user rdblue commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21977#discussion_r207722434
  
    --- Diff: python/pyspark/worker.py ---
    @@ -259,6 +260,26 @@ def main(infile, outfile):
                                  "PYSPARK_DRIVER_PYTHON are correctly set.") %
                                 ("%d.%d" % sys.version_info[:2], version))
     
    +        # set up memory limits
    +        memory_limit_mb = int(os.environ.get('PYSPARK_EXECUTOR_MEMORY_MB', "-1"))
    +        total_memory = resource.RLIMIT_AS
    +        try:
    +            (total_memory_limit, max_total_memory) = resource.getrlimit(total_memory)
    +            msg = "Current mem: {0} of max {1}\n".format(total_memory_limit, max_total_memory)
    +            sys.stderr.write(msg)
    --- End diff --
    
    Fixed. Thanks for the suggestion!


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    **[Test build #94076 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94076/testReport)** for PR 21977 at commit [`a70720b`](https://github.com/apache/spark/commit/a70720b275aa92d7b66281e942569d0d5d42068d).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/1782/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    **[Test build #94703 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94703/testReport)** for PR 21977 at commit [`ac0da5b`](https://github.com/apache/spark/commit/ac0da5b6f37815ac9edf3575bc9a9149959bc6be).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    **[Test build #94710 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94710/testReport)** for PR 21977 at commit [`d58ad7a`](https://github.com/apache/spark/commit/d58ad7ad361cfe7f108ec5b8940ffa88ffaedcf5).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95165/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21977: SPARK-25004: Add spark.executor.pyspark.memory li...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21977#discussion_r209823163
  
    --- Diff: core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala ---
    @@ -60,14 +61,20 @@ private[spark] object PythonEvalType {
      */
     private[spark] abstract class BasePythonRunner[IN, OUT](
         funcs: Seq[ChainedPythonFunctions],
    -    bufferSize: Int,
    -    reuseWorker: Boolean,
         evalType: Int,
    -    argOffsets: Array[Array[Int]])
    +    argOffsets: Array[Array[Int]],
    +    conf: SparkConf)
       extends Logging {
     
       require(funcs.length == argOffsets.length, "argOffsets should have the same length as funcs")
     
    +  private val bufferSize = conf.getInt("spark.buffer.size", 65536)
    +  private val reuseWorker = conf.getBoolean("spark.python.worker.reuse", true)
    +  // each python worker gets an equal part of the allocation. the worker pool will grow to the
    +  // number of concurrent tasks, which is determined by the number of cores in this executor.
    +  private val memoryMb = conf.get(PYSPARK_EXECUTOR_MEMORY)
    +      .map(_ / conf.getInt("spark.executor.cores", 1))
    --- End diff --
    
    Eh, actually, I believe it uses 2 space indentation in general(https://spark.apache.org/contributing.html / https://github.com/databricks/scala-style-guide#spacing-and-indentation) and I am pretty sure 2 spaces are more common.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21977: SPARK-25004: Add spark.executor.pyspark.memory li...

Posted by rdblue <gi...@git.apache.org>.

Github user rdblue commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21977#discussion_r209707290
  
    --- Diff: core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala ---
    @@ -60,14 +61,20 @@ private[spark] object PythonEvalType {
      */
     private[spark] abstract class BasePythonRunner[IN, OUT](
         funcs: Seq[ChainedPythonFunctions],
    -    bufferSize: Int,
    -    reuseWorker: Boolean,
         evalType: Int,
    -    argOffsets: Array[Array[Int]])
    +    argOffsets: Array[Array[Int]],
    +    conf: SparkConf)
       extends Logging {
     
       require(funcs.length == argOffsets.length, "argOffsets should have the same length as funcs")
     
    +  private val bufferSize = conf.getInt("spark.buffer.size", 65536)
    +  private val reuseWorker = conf.getBoolean("spark.python.worker.reuse", true)
    +  // each python worker gets an equal part of the allocation. the worker pool will grow to the
    +  // number of concurrent tasks, which is determined by the number of cores in this executor.
    +  private val memoryMb = conf.get(PYSPARK_EXECUTOR_MEMORY)
    +      .map(_ / conf.getInt("spark.executor.cores", 1))
    --- End diff --
    
    I think this is correct for a continuation line.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21977: SPARK-25004: Add spark.executor.pyspark.memory li...

Posted by rdblue <gi...@git.apache.org>.

Github user rdblue commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21977#discussion_r207630342
  
    --- Diff: core/src/main/scala/org/apache/spark/internal/config/package.scala ---
    @@ -114,6 +114,10 @@ package object config {
         .checkValue(_ >= 0, "The off-heap memory size must not be negative")
         .createWithDefault(0)
     
    +  private[spark] val PYSPARK_EXECUTOR_MEMORY = ConfigBuilder("spark.executor.pyspark.memory")
    +      .bytesConf(ByteUnit.MiB)
    +      .createOptional
    --- End diff --
    
    Fixed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    **[Test build #94066 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94066/testReport)** for PR 21977 at commit [`c072dcb`](https://github.com/apache/spark/commit/c072dcb271ec0366be63e47dee82c9abcaaf0f17).
     * This patch **fails to build**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by rdblue <gi...@git.apache.org>.

Github user rdblue commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    Yes, this is for YARN only. I've also opened follow-up issues for Mesos and Kubernetes integration.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    **[Test build #94229 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94229/testReport)** for PR 21977 at commit [`0c0ff92`](https://github.com/apache/spark/commit/0c0ff92693945f3c5ae63f60af94b88281be0c32).
     * This patch **fails Scala style tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21977: SPARK-25004: Add spark.executor.pyspark.memory li...

Posted by holdenk <gi...@git.apache.org>.

Github user holdenk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21977#discussion_r207597180
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/python/AggregateInPandasExec.scala ---
    @@ -81,6 +82,17 @@ case class AggregateInPandasExec(
     
         val bufferSize = inputRDD.conf.getInt("spark.buffer.size", 65536)
         val reuseWorker = inputRDD.conf.getBoolean("spark.python.worker.reuse", defaultValue = true)
    +    val memoryMb = {
    --- End diff --
    
    This is minor, but this code block is repeated, would it make sense to factor out?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    **[Test build #95039 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95039/testReport)** for PR 21977 at commit [`e16103b`](https://github.com/apache/spark/commit/e16103bb74df1e7444d20c59a8c687cc5531dd31).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by squito <gi...@git.apache.org>.

Github user squito commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    > We've found that python requires a lot less memory than it actually uses because it doesn't know when to GC
    
    yes, totally agree, sorry I wasn't clear in my initial comment -- overall I think this is a great idea!
    
    > If we made python memory a subset of overhead, then we would see a lot more people misconfiguring jobs that don't use python when they copy another job's settings. This way we can avoid requesting this memory if the job isn't PySpark. I also think it is more clear to allocate memory to the JVM, python, and overhead separately. That way executor memory and python executor memory are similar and you don't have to remember which one requires you to bump up overhead as well.
    
    while I agree with this to some extent, when users copy configs they already get memory horribly wrong, they really just need to understand what their job is doing. My concern is that meaning of the overhead parameter becomes pretty confusing.  Its (offheap JVM) + (any external process), unless you have this new python conf set, in which case its offheap JVM + (any external process other than python), though yarn still monitors based on everything combined.  Maybe thats unavoidable.
    
    So if users don't set this conf, the behavior is the same as before, right?  And when they want to take advantage of it, they change their confs to just move memory from the overhead to the new conf?  I think I'm OK with it then, I thought this was doing something else on the first read.
    
    More general brainstorming -- I suppose there is no way to give python hint to gc more often?  This is sort of like moving from the UnifiedMemoryManager back to the Static one, as now you put in a hard barrier.  Seems worth it anyway, just thinking about what this means.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21977: [SPARK-25004][CORE] Add spark.executor.pyspark.me...

Posted by rdblue <gi...@git.apache.org>.

Github user rdblue commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21977#discussion_r213121178
  
    --- Diff: docs/configuration.md ---
    @@ -179,6 +179,15 @@ of the most common options to set are:
         (e.g. <code>2g</code>, <code>8g</code>).
       </td>
     </tr>
    +<tr>
    + <td><code>spark.executor.pyspark.memory</code></td>
    +  <td>Not set</td>
    +  <td>
    +    The amount of memory to be allocated to PySpark in each executor, in MiB
    --- End diff --
    
    I've added "When PySpark is run in YARN, this memory is added to executor resource requests."


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2447/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21977: SPARK-25004: Add spark.executor.pyspark.memory li...

Posted by vanzin <gi...@git.apache.org>.

Github user vanzin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21977#discussion_r211752765
  
    --- Diff: resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala ---
    @@ -91,6 +91,13 @@ private[spark] class Client(
       private val executorMemoryOverhead = sparkConf.get(EXECUTOR_MEMORY_OVERHEAD).getOrElse(
         math.max((MEMORY_OVERHEAD_FACTOR * executorMemory).toLong, MEMORY_OVERHEAD_MIN)).toInt
     
    +  private val isPython = sparkConf.get(IS_PYTHON_APP)
    --- End diff --
    
    That's not really documented but as Holden says, it exists. Livy does that - but Livy actually goes ahead and sets the internal `spark.yarn.isPython` property, so it would actually take advantage of this code...
    
    Not sure how others do it, but all the ways I thought on how to expose this as an option were pretty hacky, so I think it's ok to leave things like this for now.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21977: SPARK-25004: Add spark.executor.pyspark.memory li...

Posted by rdblue <gi...@git.apache.org>.

Github user rdblue commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21977#discussion_r209703452
  
    --- Diff: resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala ---
    @@ -91,6 +91,13 @@ private[spark] class Client(
       private val executorMemoryOverhead = sparkConf.get(EXECUTOR_MEMORY_OVERHEAD).getOrElse(
         math.max((MEMORY_OVERHEAD_FACTOR * executorMemory).toLong, MEMORY_OVERHEAD_MIN)).toInt
     
    +  private val isPython = sparkConf.get(IS_PYTHON_APP)
    --- End diff --
    
    Is there documentation on how to create mixed-language pipelines? Clearly, all you need is a PythonRDD in your plan, but I thought it was non-trivial to create those from a Scala job.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21977: SPARK-25004: Add spark.executor.pyspark.memory li...

Posted by rezasafi <gi...@git.apache.org>.

Github user rezasafi commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21977#discussion_r209364194
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/python/AggregateInPandasExec.scala ---
    @@ -137,13 +135,12 @@ case class AggregateInPandasExec(
     
           val columnarBatchIter = new ArrowPythonRunner(
             pyFuncs,
    -        bufferSize,
    -        reuseWorker,
             PythonEvalType.SQL_GROUPED_AGG_PANDAS_UDF,
             argOffsets,
             aggInputSchema,
             sessionLocalTimeZone,
    -        pythonRunnerConf).compute(projectedRowIter, context.partitionId(), context)
    +        pythonRunnerConf,
    +        sparkContext.conf).compute(projectedRowIter, context.partitionId(), context)
    --- End diff --
    
    The conf is accessible through org.apache.spark.SparkEnv.get.conf on the executors AFAIK


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    **[Test build #94710 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94710/testReport)** for PR 21977 at commit [`d58ad7a`](https://github.com/apache/spark/commit/d58ad7ad361cfe7f108ec5b8940ffa88ffaedcf5).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94071/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21977: [SPARK-25004][CORE] Add spark.executor.pyspark.me...

Posted by holdenk <gi...@git.apache.org>.

Github user holdenk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21977#discussion_r212759411
  
    --- Diff: resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala ---
    @@ -91,6 +91,13 @@ private[spark] class Client(
       private val executorMemoryOverhead = sparkConf.get(EXECUTOR_MEMORY_OVERHEAD).getOrElse(
         math.max((MEMORY_OVERHEAD_FACTOR * executorMemory).toLong, MEMORY_OVERHEAD_MIN)).toInt
     
    +  private val isPython = sparkConf.get(IS_PYTHON_APP)
    --- End diff --
    
    Interesting, I'll add this to my example mixed pipeline repo so folks can see this hack.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by rdblue <gi...@git.apache.org>.

Github user rdblue commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    Looks like the test problems were caused by accessing the SparkConf through either SparkContext or SparkSession on the executor side. The Scala tests are passing and I've fixed a couple more references to the conf in Arrow code that should fix the remaining python failures.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2181/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21977: SPARK-25004: Add spark.executor.pyspark.memory li...

Posted by rdblue <gi...@git.apache.org>.

Github user rdblue commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21977#discussion_r211763465
  
    --- Diff: core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala ---
    @@ -60,14 +61,20 @@ private[spark] object PythonEvalType {
      */
     private[spark] abstract class BasePythonRunner[IN, OUT](
         funcs: Seq[ChainedPythonFunctions],
    -    bufferSize: Int,
    -    reuseWorker: Boolean,
         evalType: Int,
    -    argOffsets: Array[Array[Int]])
    +    argOffsets: Array[Array[Int]],
    +    conf: SparkConf)
    --- End diff --
    
    You're right, we could do that. Originally, I thought it was a good idea to pass in the right config, but that's not possible. If we use the config, we must use `SparkEnv.get.conf`. The alternative is to go back to spreading the config logic into every RDD or Exec node to use the conf on the driver and serialize it to make it available on the executors, which is ugly.
    
    I'd prefer just using SparkEnv.get.conf in PythonRunner.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    **[Test build #94063 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94063/testReport)** for PR 21977 at commit [`8f993ba`](https://github.com/apache/spark/commit/8f993ba93453c5fbb565853ade26a372556bc04b).
     * This patch **fails Python style tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    **[Test build #94076 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94076/testReport)** for PR 21977 at commit [`a70720b`](https://github.com/apache/spark/commit/a70720b275aa92d7b66281e942569d0d5d42068d).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21977: SPARK-25004: Add spark.executor.pyspark.memory li...

Posted by rdblue <gi...@git.apache.org>.

Github user rdblue commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21977#discussion_r210317086
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/python/BatchEvalPythonExec.scala ---
    @@ -69,7 +67,7 @@ case class BatchEvalPythonExec(udfs: Seq[PythonUDF], output: Seq[Attribute], chi
     
         // Output iterator for results from Python.
         val outputIterator = new PythonUDFRunner(
    -        funcs, bufferSize, reuseWorker, PythonEvalType.SQL_BATCHED_UDF, argOffsets)
    +        funcs, PythonEvalType.SQL_BATCHED_UDF, argOffsets, sparkContext.conf)
    --- End diff --
    
    Good catch, I'll fix it.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    cc @ueshin 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/1699/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21977: SPARK-25004: Add spark.executor.pyspark.memory li...

Posted by rdblue <gi...@git.apache.org>.

Github user rdblue commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21977#discussion_r207629997
  
    --- Diff: python/pyspark/worker.py ---
    @@ -259,6 +260,26 @@ def main(infile, outfile):
                                  "PYSPARK_DRIVER_PYTHON are correctly set.") %
                                 ("%d.%d" % sys.version_info[:2], version))
     
    +        # set up memory limits
    +        memory_limit_mb = int(os.environ.get('PYSPARK_EXECUTOR_MEMORY_MB', "-1"))
    +        total_memory = resource.RLIMIT_AS
    +        try:
    +            (total_memory_limit, max_total_memory) = resource.getrlimit(total_memory)
    +            msg = "Current mem: {0} of max {1}\n".format(total_memory_limit, max_total_memory)
    +            sys.stderr.write()
    --- End diff --
    
    Fixed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21977: SPARK-25004: Add spark.executor.pyspark.memory li...

Posted by rdblue <gi...@git.apache.org>.

Github user rdblue commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21977#discussion_r207722412
  
    --- Diff: python/pyspark/worker.py ---
    @@ -259,6 +260,26 @@ def main(infile, outfile):
                                  "PYSPARK_DRIVER_PYTHON are correctly set.") %
                                 ("%d.%d" % sys.version_info[:2], version))
     
    +        # set up memory limits
    +        memory_limit_mb = int(os.environ.get('PYSPARK_EXECUTOR_MEMORY_MB', "-1"))
    +        total_memory = resource.RLIMIT_AS
    +        try:
    +            (total_memory_limit, max_total_memory) = resource.getrlimit(total_memory)
    +            msg = "Current mem: {0} of max {1}\n".format(total_memory_limit, max_total_memory)
    +            sys.stderr.write(msg)
    +
    +            if memory_limit_mb > 0 and total_memory_limit == resource.RLIM_INFINITY:
    +                # convert to bytes
    +                total_memory_limit = memory_limit_mb * 1024 * 1024
    +
    +                msg = "Setting mem to {0} of max {1}\n".format(total_memory_limit, max_total_memory)
    +                sys.stderr.write(msg)
    +                resource.setrlimit(total_memory, (total_memory_limit, total_memory_limit))
    +
    +        except (resource.error, OSError) as e:
    +            # not all systems support resource limits, so warn instead of failing
    +            sys.stderr.write("WARN: Failed to set memory limit: {0}\n".format(e))
    --- End diff --
    
    Fixed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21977: SPARK-25004: Add spark.executor.pyspark.memory li...

Posted by rdblue <gi...@git.apache.org>.

Github user rdblue commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21977#discussion_r207637946
  
    --- Diff: python/pyspark/worker.py ---
    @@ -259,6 +260,26 @@ def main(infile, outfile):
                                  "PYSPARK_DRIVER_PYTHON are correctly set.") %
                                 ("%d.%d" % sys.version_info[:2], version))
     
    +        # set up memory limits
    +        memory_limit_mb = int(os.environ.get('PYSPARK_EXECUTOR_MEMORY_MB', "-1"))
    +        total_memory = resource.RLIMIT_AS
    +        try:
    +            (total_memory_limit, max_total_memory) = resource.getrlimit(total_memory)
    +            msg = "Current mem: {0} of max {1}\n".format(total_memory_limit, max_total_memory)
    +            sys.stderr.write()
    +
    +            if memory_limit_mb > 0 and total_memory_limit < 0:
    --- End diff --
    
    I've updated to use `resource.RLIM_INFINITY`.
    
    I think this should only set the resource limit if it isn't already set. It is unlikely that it's already set because this is during worker initialization, but the intent is to not cause harm if a higher-level system (i.e. container provider) has already set the limit.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    **[Test build #94063 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94063/testReport)** for PR 21977 at commit [`8f993ba`](https://github.com/apache/spark/commit/8f993ba93453c5fbb565853ade26a372556bc04b).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94804/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    **[Test build #94750 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94750/testReport)** for PR 21977 at commit [`990b513`](https://github.com/apache/spark/commit/990b513fe968c94e03242c1fecfa8e9948d416ae).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21977: SPARK-25004: Add spark.executor.pyspark.memory li...

Posted by rdblue <gi...@git.apache.org>.

Github user rdblue commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21977#discussion_r209992878
  
    --- Diff: core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala ---
    @@ -60,14 +61,20 @@ private[spark] object PythonEvalType {
      */
     private[spark] abstract class BasePythonRunner[IN, OUT](
         funcs: Seq[ChainedPythonFunctions],
    -    bufferSize: Int,
    -    reuseWorker: Boolean,
         evalType: Int,
    -    argOffsets: Array[Array[Int]])
    +    argOffsets: Array[Array[Int]],
    +    conf: SparkConf)
       extends Logging {
     
       require(funcs.length == argOffsets.length, "argOffsets should have the same length as funcs")
     
    +  private val bufferSize = conf.getInt("spark.buffer.size", 65536)
    +  private val reuseWorker = conf.getBoolean("spark.python.worker.reuse", true)
    +  // each python worker gets an equal part of the allocation. the worker pool will grow to the
    +  // number of concurrent tasks, which is determined by the number of cores in this executor.
    +  private val memoryMb = conf.get(PYSPARK_EXECUTOR_MEMORY)
    +      .map(_ / conf.getInt("spark.executor.cores", 1))
    --- End diff --
    
    The Spark docs say to use 2 spaces for an indent, which this does. This also uses 2 indents for continuation lines. Continuation lines aren't covered in the Spark docs other than for lines with function parameters -- where 2 indents are required -- but it is fairly common to do this. I've seen both in Spark code.
    
    I don't think that the DataBricks style guide applies to Apache projects.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21977: SPARK-25004: Add spark.executor.pyspark.memory li...

Posted by vanzin <gi...@git.apache.org>.

Github user vanzin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21977#discussion_r211748871
  
    --- Diff: resource-managers/yarn/src/test/scala/org/apache/spark/deploy/yarn/BaseYarnClusterSuite.scala ---
    @@ -161,6 +162,11 @@ abstract class BaseYarnClusterSuite
         }
         extraJars.foreach(launcher.addJar)
     
    +    if (outFile.isDefined) {
    --- End diff --
    
    `outFile.foreach`?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    Build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95167/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    **[Test build #95175 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95175/testReport)** for PR 21977 at commit [`fcee94c`](https://github.com/apache/spark/commit/fcee94c22bebf1d55ba8ab9091b4bcb9852717b2).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95050/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21977: SPARK-25004: Add spark.executor.pyspark.memory li...

Posted by rdblue <gi...@git.apache.org>.

Github user rdblue commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21977#discussion_r208339172
  
    --- Diff: python/pyspark/worker.py ---
    @@ -259,6 +260,26 @@ def main(infile, outfile):
                                  "PYSPARK_DRIVER_PYTHON are correctly set.") %
                                 ("%d.%d" % sys.version_info[:2], version))
     
    +        # set up memory limits
    +        memory_limit_mb = int(os.environ.get('PYSPARK_EXECUTOR_MEMORY_MB', "-1"))
    +        total_memory = resource.RLIMIT_AS
    +        try:
    +            (total_memory_limit, max_total_memory) = resource.getrlimit(total_memory)
    +            msg = "Current mem: {0} of max {1}\n".format(total_memory_limit, max_total_memory)
    +            sys.stderr.write()
    +
    +            if memory_limit_mb > 0 and total_memory_limit < 0:
    --- End diff --
    
    Works for me. I'll update this.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21977: [SPARK-25004][CORE] Add spark.executor.pyspark.me...

Posted by holdenk <gi...@git.apache.org>.

Github user holdenk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21977#discussion_r213186832
  
    --- Diff: resource-managers/yarn/src/test/scala/org/apache/spark/deploy/yarn/BaseYarnClusterSuite.scala ---
    @@ -161,6 +162,11 @@ abstract class BaseYarnClusterSuite
         }
         extraJars.foreach(launcher.addJar)
     
    +    if (outFile.isDefined) {
    --- End diff --
    
    I think the pattern match would be better than the get.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    **[Test build #94054 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94054/testReport)** for PR 21977 at commit [`19cd9c5`](https://github.com/apache/spark/commit/19cd9c5cce4420729074a0976b129889d70fd56c).
     * This patch **fails Scala style tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: [SPARK-25004][CORE] Add spark.executor.pyspark.memory li...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2539/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    Seems okay.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    **[Test build #95108 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95108/testReport)** for PR 21977 at commit [`505f2eb`](https://github.com/apache/spark/commit/505f2eb09d60c695a80c7f62bde9a19a0e677357).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/1813/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by felixcheung <gi...@git.apache.org>.

Github user felixcheung commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    build error
    ```
    [error] /home/jenkins/workspace/SparkPullRequestBuilder@2/sql/core/src/main/scala/org/apache/spark/sql/execution/python/ArrowEvalPythonExec.scala:88: method sparkContext in class SparkPlan cannot be accessed in org.apache.spark.sql.execution.SparkPlan
    [error]  Access to protected method sparkContext not permitted because
    [error]  prefix type org.apache.spark.sql.execution.SparkPlan does not conform to
    [error]  class ArrowEvalPythonExec in package python where the access take place
    [error]       child.sparkContext.conf).compute(batchIter, context.partitionId(), context)
    [error]             ^
    ```


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21977: SPARK-25004: Add spark.executor.pyspark.memory li...

Posted by felixcheung <gi...@git.apache.org>.

Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21977#discussion_r207716854
  
    --- Diff: python/pyspark/worker.py ---
    @@ -259,6 +260,26 @@ def main(infile, outfile):
                                  "PYSPARK_DRIVER_PYTHON are correctly set.") %
                                 ("%d.%d" % sys.version_info[:2], version))
     
    +        # set up memory limits
    +        memory_limit_mb = int(os.environ.get('PYSPARK_EXECUTOR_MEMORY_MB', "-1"))
    +        total_memory = resource.RLIMIT_AS
    +        try:
    +            (total_memory_limit, max_total_memory) = resource.getrlimit(total_memory)
    +            msg = "Current mem: {0} of max {1}\n".format(total_memory_limit, max_total_memory)
    +            sys.stderr.write(msg)
    --- End diff --
    
    seems like the pattern `print(msg, file=sys.stderr)` is used here


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    Does this work by setting some container configs? Maybe we can apply this to k8s later, cc @liyinan926


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    **[Test build #95039 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95039/testReport)** for PR 21977 at commit [`e16103b`](https://github.com/apache/spark/commit/e16103bb74df1e7444d20c59a8c687cc5531dd31).
     * This patch **fails Scala style tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2389/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21977: [SPARK-25004][CORE] Add spark.executor.pyspark.me...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21977#discussion_r213882447
  
    --- Diff: core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala ---
    @@ -62,14 +63,20 @@ private[spark] object PythonEvalType {
      */
     private[spark] abstract class BasePythonRunner[IN, OUT](
         funcs: Seq[ChainedPythonFunctions],
    -    bufferSize: Int,
    -    reuseWorker: Boolean,
         evalType: Int,
         argOffsets: Array[Array[Int]])
       extends Logging {
     
       require(funcs.length == argOffsets.length, "argOffsets should have the same length as funcs")
     
    +  private val conf = SparkEnv.get.conf
    +  private val bufferSize = conf.getInt("spark.buffer.size", 65536)
    +  private val reuseWorker = conf.getBoolean("spark.python.worker.reuse", true)
    +  // each python worker gets an equal part of the allocation. the worker pool will grow to the
    +  // number of concurrent tasks, which is determined by the number of cores in this executor.
    +  private val memoryMb = conf.get(PYSPARK_EXECUTOR_MEMORY)
    +      .map(_ / conf.getInt("spark.executor.cores", 1))
    --- End diff --
    
    Oh, it's fine. I meant to fix them together if there are more changes to push. Not a big deal.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21977: SPARK-25004: Add spark.executor.pyspark.memory li...

Posted by holdenk <gi...@git.apache.org>.

Github user holdenk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21977#discussion_r210274748
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/python/BatchEvalPythonExec.scala ---
    @@ -69,7 +67,7 @@ case class BatchEvalPythonExec(udfs: Seq[PythonUDF], output: Seq[Attribute], chi
     
         // Output iterator for results from Python.
         val outputIterator = new PythonUDFRunner(
    -        funcs, bufferSize, reuseWorker, PythonEvalType.SQL_BATCHED_UDF, argOffsets)
    +        funcs, PythonEvalType.SQL_BATCHED_UDF, argOffsets, sparkContext.conf)
    --- End diff --
    
    I _think_ this code is executed on the workers, which means you can't access the SparkContext here which could explain the local test failures.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21977: [SPARK-25004][CORE] Add spark.executor.pyspark.me...

Posted by rdblue <gi...@git.apache.org>.

Github user rdblue commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21977#discussion_r213777162
  
    --- Diff: core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala ---
    @@ -62,14 +63,20 @@ private[spark] object PythonEvalType {
      */
     private[spark] abstract class BasePythonRunner[IN, OUT](
         funcs: Seq[ChainedPythonFunctions],
    -    bufferSize: Int,
    -    reuseWorker: Boolean,
         evalType: Int,
         argOffsets: Array[Array[Int]])
       extends Logging {
     
       require(funcs.length == argOffsets.length, "argOffsets should have the same length as funcs")
     
    +  private val conf = SparkEnv.get.conf
    +  private val bufferSize = conf.getInt("spark.buffer.size", 65536)
    +  private val reuseWorker = conf.getBoolean("spark.python.worker.reuse", true)
    +  // each python worker gets an equal part of the allocation. the worker pool will grow to the
    +  // number of concurrent tasks, which is determined by the number of cores in this executor.
    +  private val memoryMb = conf.get(PYSPARK_EXECUTOR_MEMORY)
    +      .map(_ / conf.getInt("spark.executor.cores", 1))
    --- End diff --
    
    @HyukjinKwon, sorry but it looks like this was merged before I could push a commit to update it.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21977: SPARK-25004: Add spark.executor.pyspark.memory li...

Posted by rdblue <gi...@git.apache.org>.

Github user rdblue commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21977#discussion_r207747975
  
    --- Diff: python/pyspark/worker.py ---
    @@ -259,6 +260,26 @@ def main(infile, outfile):
                                  "PYSPARK_DRIVER_PYTHON are correctly set.") %
                                 ("%d.%d" % sys.version_info[:2], version))
     
    +        # set up memory limits
    +        memory_limit_mb = int(os.environ.get('PYSPARK_EXECUTOR_MEMORY_MB', "-1"))
    +        total_memory = resource.RLIMIT_AS
    +        try:
    +            (total_memory_limit, max_total_memory) = resource.getrlimit(total_memory)
    +            msg = "Current mem: {0} of max {1}\n".format(total_memory_limit, max_total_memory)
    +            print(msg, file=sys.stderr)
    +
    +            if memory_limit_mb > 0 and total_memory_limit == resource.RLIM_INFINITY:
    +                # convert to bytes
    +                total_memory_limit = memory_limit_mb * 1024 * 1024
    +
    +                msg = "Setting mem to {0} of max {1}\n".format(total_memory_limit, max_total_memory)
    +                print(msg, file=sys.stderr)
    +                resource.setrlimit(total_memory, (total_memory_limit, total_memory_limit))
    +
    +        except (resource.error, OSError, ValueError) as e:
    +            # not all systems support resource limits, so warn instead of failing
    +            sys.stderr.write("WARN: Failed to set memory limit: {0}\n".format(e))
    --- End diff --
    
    Fixed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    **[Test build #94230 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94230/testReport)** for PR 21977 at commit [`3616f60`](https://github.com/apache/spark/commit/3616f607d3a20a6f4a27e63ee42642ef1ccc0366).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by holdenk <gi...@git.apache.org>.

Github user holdenk commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    Actually looking at the Jenkins failures, could be the cause of some of them as well (certainly the https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94750/testReport/org.apache.spark.deploy.yarn/YarnClusterSuite/run_Python_application_in_yarn_client_mode/ )


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    **[Test build #94189 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94189/testReport)** for PR 21977 at commit [`f427120`](https://github.com/apache/spark/commit/f42712055a017f270b58576acd3efa3272d90767).
     * This patch **fails to build**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/1697/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94054/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94229/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21977
  
    **[Test build #94065 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94065/testReport)** for PR 21977 at commit [`121d3b5`](https://github.com/apache/spark/commit/121d3b51f6cf06906e188dc7fd07976bbca1bb6a).
     * This patch **fails Python style tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org