You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by ryan-williams <gi...@git.apache.org> on 2014/12/01 05:45:37 UTC

[GitHub] spark pull request: Improve YarnAllocator's parsing of "memory ove...

GitHub user ryan-williams opened a pull request:

    https://github.com/apache/spark/pull/3525

    Improve YarnAllocator's parsing of "memory overhead" param

    * let it be specified as a fraction of the executor memory
    * add/generalize some utilities for parsing "memory strings" that are found around the Spark codebase.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/ryan-williams/spark mem-overhead

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/3525.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #3525
    
----
commit 849c8bc129a8f12290b896085bce3f9feb8b041b
Author: Ryan Williams <ry...@gmail.com>
Date:   2014-11-13T02:32:34Z

    let yarn mem overhead be specified as a fraction
    
    .. of total executor memory. Several people have alluded to the ratio
    of the executor.memoryOverhead to executor.memory as being more
    important than the absolute amount of executor.memoryOverhead, so this
    commit allows for specifying the former, and takes the maximum of the
    computed overheads implied by the fractional and absolute (# of mb)
    conf variables.

commit 5d2327a5dd6b479149b01b8a7eddbbbf94d62aa2
Author: Ryan Williams <ry...@gmail.com>
Date:   2014-11-13T02:52:50Z

    memoryStringToMb can have default scale specified
    
    Previously it assumed a unitless number represented raw bytes, but I
    want to use it for a config variable that previously defaulted to # of
    megabytes and not break backwards-compatibility.

commit 928831e588bf0a1073ea73a34d47a8ecd4b931f8
Author: Ryan Williams <ry...@gmail.com>
Date:   2014-11-13T02:53:01Z

    remove unused import

commit fe8a348cb5e7a50253351193470042694309f4c1
Author: Ryan Williams <ry...@gmail.com>
Date:   2014-11-13T04:00:19Z

    generalize Utils.memoryStringToMb to b/k/m/g scale
    
    - one existing usage in Utils converts a memory string to bytes, so I
      folded that into the generalized `parseMemoryString` function that
      `memoryStringToMb` is also now a wrapper for.
    - a usage in YarnAllocator (see following commit) needs some special
      handling around the unit-less case, further motivating this
      abstraction

commit 5b361397dc0bcdd478a7328029bc51d22aba3fd0
Author: Ryan Williams <ry...@gmail.com>
Date:   2014-11-13T04:00:57Z

    executor.memoryOverhead takes a “memory string”
    
    previously just assumed megabytes

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4666] Improve YarnAllocator's parsing o...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3525#discussion_r24450496
  
    --- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala ---
    @@ -980,33 +980,77 @@ private[spark] object Utils extends Logging {
         )
       }
     
    +
    +  private val TB = 1L << 40
    +  private val GB = 1L << 30
    +  private val MB = 1L << 20
    +  private val KB = 1L << 10
    +
    +  private val scaleCharToFactor: Map[Char, Long] = Map(
    +    'b' -> 1L,
    +    'k' -> KB,
    +    'm' -> MB,
    +    'g' -> GB,
    +    't' -> TB
    +  )
    +
       /**
    -   * Convert a Java memory parameter passed to -Xmx (such as 300m or 1g) to a number of megabytes.
    -   */
    -  def memoryStringToMb(str: String): Int = {
    +   * Convert a Java memory parameter passed to -Xmx (such as "300m" or "1g") to a number of
    +   * megabytes (or other byte-scale denominations as specified by @outputScaleChar).
    +   *
    +   * For @defaultInputScaleChar and @outputScaleChar, valid values are: 'b' (bytes), 'k'
    +   * (kilobytes), 'm' (megabytes), 'g' (gigabytes), and 't' (terabytes).
    +   *
    +   * @param str String to parse an amount of memory out of
    +   * @param defaultInputScaleChar if no "scale" is provided on the end of @str (i.e. @str is a
    +   *                              plain numeric value), assume this scale (default: 'b' for
    +   *                              'bytes')
    +   * @param outputScaleChar express the output in this scale, i.e. number of bytes, kilobytes,
    +   *                        megabytes, or gigabytes.
    +   */
    +  def parseMemoryString(
    +      str: String,
    +      defaultInputScaleChar: Char = 'b',
    +      outputScaleChar: Char = 'm'): Long = {
    +
         val lower = str.toLowerCase
    -    if (lower.endsWith("k")) {
    -      (lower.substring(0, lower.length-1).toLong / 1024).toInt
    -    } else if (lower.endsWith("m")) {
    -      lower.substring(0, lower.length-1).toInt
    -    } else if (lower.endsWith("g")) {
    -      lower.substring(0, lower.length-1).toInt * 1024
    -    } else if (lower.endsWith("t")) {
    -      lower.substring(0, lower.length-1).toInt * 1024 * 1024
    -    } else {// no suffix, so it's just a number in bytes
    -      (lower.toLong / 1024 / 1024).toInt
    -    }
    +    val lastChar = lower(lower.length - 1)
    +    val (num, inputScaleChar) =
    +      if (lastChar.isDigit) {
    +        (lower.toLong, defaultInputScaleChar)
    +      } else {
    +        (lower.substring(0, lower.length - 1).toLong, lastChar)
    +      }
    +
    +    (for {
    --- End diff --
    
    Yeah I get what it does. I was comparing in my mind to...
    
    ```
    val inputScale = scaleCharToFactor.get(inputScaleChar)
    val outputScale = scaleCharToFactor.get(outputScaleChar)
    require(inputScale.isDefined)
    require(outputScale.isDefined)
    inputScale.get * num / outputScale.get
    ```
    
    Here's another instance where I wouldn't mind hearing another opinion as it's a good more general style question.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4665] [SPARK-4666] Improve YarnAllocato...

Posted by ryan-williams <gi...@git.apache.org>.

Github user ryan-williams commented on the pull request:

https://github.com/apache/spark/pull/3525#issuecomment-73589151

Thanks for resurrecting this, @srowen. I think there are 3 potentially separable changes here, in rough order of increasing controversial-ness:

1. Make `spark.yarn.executor.memoryOverhead` take a "memory string" (e.g. `1g` or `100m`)
* defaulting to `m` as the unit for backwards-compatibility.
1. General refactoring of "memory string" handling in conf parameters in some other parts of the codebase.
1. Addition of `spark.yarn.executor.memoryOverheadFraction`

If people prefer, I can make a separate PR with 1∪2 or individual PRs for each of 1 and 2.

Re: 3, I am sympathetic to @tgravescs's arguments, but worry that we are conflating [wishing Spark's configuration surface area was smaller] with [forcing users to build custom Sparks to get at real, existing niches of said configuration surface area (in this case to change a hard-coded `.07` that reasonable people can disagree on the ideal value of)]. Additionally, the arguments about possible confusion around interplay between `memoryOverhead` and `memoryOverheadFraction` are applicable to the status quo, imho.

However, I don't mind much one way or another at this point, so I'm fine dropping 3 if that's what consensus here prefers.

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4666] Improve YarnAllocator's parsing o...

Posted by ryan-williams <gi...@git.apache.org>.

Github user ryan-williams commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3525#discussion_r24446286
  
    --- Diff: core/src/main/scala/org/apache/spark/SparkConf.scala ---
    @@ -203,6 +203,28 @@ class SparkConf(loadDefaults: Boolean) extends Cloneable with Logging {
         getOption(key).map(_.toBoolean).getOrElse(defaultValue)
       }
     
    +  // Limit of bytes for total size of results (default is 1GB)
    --- End diff --
    
    NB: that was an answer to why this property is special-cased here, as opposed to over in `Utils`. You may be more interested in the question of why it's special-cased at all, but that seems reasonable to me given the couple magic strings and its greater-than-1 call-sites (namely: 2).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: Improve YarnAllocator's parsing of "memory ove...

Posted by ryan-williams <gi...@git.apache.org>.

Github user ryan-williams commented on the pull request:

    https://github.com/apache/spark/pull/3525#issuecomment-65022927
  
    there's actually two JIRAs I'll file here:
    * add param to set fractional overhead memory amount
    * make the existing param a "memory string", not just a # of megabytes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4665] [SPARK-4666] Improve YarnAllocato...

Posted by lianhuiwang <gi...@git.apache.org>.

Github user lianhuiwang commented on the pull request:

    https://github.com/apache/spark/pull/3525#issuecomment-73535044
  
    before i think that both of two configs can be existed.  from @tgravescs i think overhead is more necessary than OverheadFraction , because at some time it has very large memory,we just use overhead to increase executor memory. now memoryOverheadFraction  is very large and inappropriate. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4666] Improve YarnAllocator's parsing o...

Posted by ryan-williams <gi...@git.apache.org>.

Github user ryan-williams commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3525#discussion_r24633577
  
    --- Diff: core/src/main/scala/org/apache/spark/SparkConf.scala ---
    @@ -203,6 +203,28 @@ class SparkConf(loadDefaults: Boolean) extends Cloneable with Logging {
         getOption(key).map(_.toBoolean).getOrElse(defaultValue)
       }
     
    +  // Limit of bytes for total size of results (default is 1GB)
    --- End diff --
    
    hm, I'd vote we put it back in `Utils` over un-factoring those two calls that are doing the same thing as one another


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4665] Improve YarnAllocator's parsing o...

Posted by tgravescs <gi...@git.apache.org>.

Github user tgravescs commented on the pull request:

    https://github.com/apache/spark/pull/3525#issuecomment-65071045
  
     Perhaps I've missed it but I haven't heard a lot of cases for either way.  Do you have examples or use cases?  I'd be open to changing it but want more reasoning behind it.   I've found putting in the value rather then a % easier in some cases that weren't small/straight forward jobs.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4665] Improve YarnAllocator's parsing o...

Posted by ryan-williams <gi...@git.apache.org>.

Github user ryan-williams commented on the pull request:

    https://github.com/apache/spark/pull/3525#issuecomment-65098147
  
    @arahuja recently had YARN killing jobs until he bumped memory-overhead to 4-5GB, on executor memory of 20-30GB, so the hard-coded 7% was not enough for him. In general, this fraction should be configurable; maybe some people want <7% too! 7% is not special, afaik.
    
    @sryza has given me the impression that the overhead tends to grow proportionally to the executor memory, meaning that allowing people to configure the *fraction* makes as much or more sense than having them do some division and tweak their cmd-line flag for every job in order to specify an absolute amount of memory.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4665] [SPARK-4666] Improve YarnAllocato...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on the pull request:

    https://github.com/apache/spark/pull/3525#issuecomment-73388468
  
    FWIW I think the logic behind the existing parameters is: 7% should be enough for anyone. And if it isn't, they know it, and have a clear idea of just how much they want the overhead to be. Or: if 7% is consistently not enough, it should be higher, like 10%. I'd leave the existing param unless it's obviously problematic.
    
    But that's separate from the question here, of whether lots of these strings could be specified in mega or gigabytes. Is that still something that can move forward? the refactoring of all that seems helpful in any event.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4665] [SPARK-4666] Improve YarnAllocato...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on the pull request:

    https://github.com/apache/spark/pull/3525#issuecomment-73429227
  
    OK maybe there's enough momentum for ... two flags? one of which overrides the other? But maybe we should also bump up the default a bit too.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4665] [SPARK-4666] Improve YarnAllocato...

Posted by tgravescs <gi...@git.apache.org>.

Github user tgravescs commented on the pull request:

https://github.com/apache/spark/pull/3525#issuecomment-73523911

I'm still not on board with having 2 different configs. You then have to explain to the user what they are, which one takes presendence, etc. I can see cases the % is useful but then again you can argue the other way too. If my job needs 4G overhead, I may want to increase the executor memory without changing the overhead amount. In that case then I have to go mess with % to get it back what I want. Either way I think the user has to experiment to get the best number. I haven't seen any reason to bump up the default at this point but if others have more real world data lets consider it.

How many people are complaining or having real issues with this?

The main reason I'm against adding this is that I consider it an api change and we now have to support 2 configs until we can get rid of the deprecated one. Its just more dev overhead, testing, and potential user confusion. If its enough of a benefit to the user to change it then I would be ok with it but would rather deprecate the existing one in favor of the %. To me the raw number is more flexible then the % but I see the % as being easier in situations.

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4665] Improve YarnAllocator's parsing o...

Posted by tgravescs <gi...@git.apache.org>.

Github user tgravescs commented on the pull request:

    https://github.com/apache/spark/pull/3525#issuecomment-65132142
  
    Basically I would like some hard data or user input before changing this since changing/adding configs  can just lead to more confusion.   I have seen a few cases where a % wasn't ideal. I've seen others where 7% worked just fine.  Overall I'd like to understand if % covers the majority of cases or if say it is only useful doing etl but then breaks down doing ML or graphx.   We had this discussion briefly before and left it at the value vs % and I haven't really seen much more data to change it but perhaps others have?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4666] Improve YarnAllocator's parsing o...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on the pull request:

    https://github.com/apache/spark/pull/3525#issuecomment-73674272
  
    I personally definitely like the clarity this is adding in the code. Looks like it needs another rebase and test to see where we stand with Jenkins and Mima.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: Improve YarnAllocator's parsing of "memory ove...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/3525#issuecomment-65020451
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4665] Improve YarnAllocator's parsing o...

Posted by ryan-williams <gi...@git.apache.org>.

Github user ryan-williams commented on the pull request:

    https://github.com/apache/spark/pull/3525#issuecomment-65101456
  
    @arahuja also mentioned that his issue (needing seemingly excessive amounts of memory overhead) was fixed by #3465, but I think that still supports the idea that sometimes people need the ability to tweak  this parameter on a case-by-case basis.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4666] Improve YarnAllocator's parsing o...

Posted by ryan-williams <gi...@git.apache.org>.

Github user ryan-williams commented on the pull request:

    https://github.com/apache/spark/pull/3525#issuecomment-73783538
  
    OK I think it is back to you @srowen 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4666] Improve YarnAllocator's parsing o...

Posted by ryan-williams <gi...@git.apache.org>.

Github user ryan-williams commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3525#discussion_r24448450
  
    --- Diff: core/src/main/scala/org/apache/spark/SparkConf.scala ---
    @@ -203,6 +203,28 @@ class SparkConf(loadDefaults: Boolean) extends Cloneable with Logging {
         getOption(key).map(_.toBoolean).getOrElse(defaultValue)
       }
     
    +  // Limit of bytes for total size of results (default is 1GB)
    --- End diff --
    
    Option 3. could be: Put such methods in a `SparkConfUtils` object that would be less prone to kitchen-sink-iness.
    
    I'm impartial, I'll let you make the call between these three.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4666] Improve YarnAllocator's parsing o...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3525#discussion_r24400791
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/local/LocalBackend.scala ---
    @@ -20,6 +20,7 @@ package org.apache.spark.scheduler.local
     import java.nio.ByteBuffer
     
     import scala.concurrent.duration._
    +import scala.language.postfixOps
    --- End diff --
    
    Yes that's the right thing, if you think that the use of postfix ops is justified.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4665] [SPARK-4666] Improve YarnAllocato...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on the pull request:

    https://github.com/apache/spark/pull/3525#issuecomment-73594574
  
    Without clear consensus at this point, and given 3 is really a question of how you like to override a default in the same way you override others, I'd do 1+2 for the moment.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4666] Improve YarnAllocator's parsing o...

Posted by ryan-williams <gi...@git.apache.org>.

Github user ryan-williams commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3525#discussion_r24449997
  
    --- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala ---
    @@ -980,33 +980,77 @@ private[spark] object Utils extends Logging {
         )
       }
     
    +
    +  private val TB = 1L << 40
    +  private val GB = 1L << 30
    +  private val MB = 1L << 20
    +  private val KB = 1L << 10
    +
    +  private val scaleCharToFactor: Map[Char, Long] = Map(
    +    'b' -> 1L,
    +    'k' -> KB,
    +    'm' -> MB,
    +    'g' -> GB,
    +    't' -> TB
    +  )
    +
       /**
    -   * Convert a Java memory parameter passed to -Xmx (such as 300m or 1g) to a number of megabytes.
    -   */
    -  def memoryStringToMb(str: String): Int = {
    +   * Convert a Java memory parameter passed to -Xmx (such as "300m" or "1g") to a number of
    +   * megabytes (or other byte-scale denominations as specified by @outputScaleChar).
    +   *
    +   * For @defaultInputScaleChar and @outputScaleChar, valid values are: 'b' (bytes), 'k'
    +   * (kilobytes), 'm' (megabytes), 'g' (gigabytes), and 't' (terabytes).
    +   *
    +   * @param str String to parse an amount of memory out of
    +   * @param defaultInputScaleChar if no "scale" is provided on the end of @str (i.e. @str is a
    +   *                              plain numeric value), assume this scale (default: 'b' for
    +   *                              'bytes')
    +   * @param outputScaleChar express the output in this scale, i.e. number of bytes, kilobytes,
    +   *                        megabytes, or gigabytes.
    +   */
    +  def parseMemoryString(
    +      str: String,
    +      defaultInputScaleChar: Char = 'b',
    +      outputScaleChar: Char = 'm'): Long = {
    +
         val lower = str.toLowerCase
    -    if (lower.endsWith("k")) {
    -      (lower.substring(0, lower.length-1).toLong / 1024).toInt
    -    } else if (lower.endsWith("m")) {
    -      lower.substring(0, lower.length-1).toInt
    -    } else if (lower.endsWith("g")) {
    -      lower.substring(0, lower.length-1).toInt * 1024
    -    } else if (lower.endsWith("t")) {
    -      lower.substring(0, lower.length-1).toInt * 1024 * 1024
    -    } else {// no suffix, so it's just a number in bytes
    -      (lower.toLong / 1024 / 1024).toInt
    -    }
    +    val lastChar = lower(lower.length - 1)
    +    val (num, inputScaleChar) =
    +      if (lastChar.isDigit) {
    +        (lower.toLong, defaultInputScaleChar)
    +      } else {
    +        (lower.substring(0, lower.length - 1).toLong, lastChar)
    +      }
    +
    +    (for {
    --- End diff --
    
    > Why is a for construction used here? just to handle the invalid in / out scale param?
    
    `for` comprehensions are commonly used when `map`ing over 2 or more monads to avoid arguably-ugly `.flatMap`-chaining; the syntax just removes some boilerplate, e.g.:
    
    ```
        scaleCharToFactor.get(inputScaleChar).flatMap(inputScale => 
          scaleCharToFactor.get(outputScaleChar).map(outputScale =>
            inputScale * num / outputScale
          )
        ).getOrElse(
            // throw
          )
    ```
    vs.
    
    ```
        (for {
          inputScale <- scaleCharToFactor.get(inputScaleChar)
          outputScale <- scaleCharToFactor.get(outputScaleChar)
        } yield {
          inputScale * num / outputScale
        }).getOrElse(
            // throw
          )
    ```
    
    (I collapsed the `scale` wrapper line in the latter for apples-to-apples brevity, and can do that in the PR as well).
    
    So, it's not a "looping construct" so much as a "mapping" one, commonly used on `Option`s, `List`s, and [even things like twitter `Future`s](https://twitter.github.io/scala_school/finagle.html) (search for "for {"). 
    
    However, it does get better as the number of chained `map`s increases, e.g. especially when there are 3 or more, so I'm not that tied to it here.
    
    Of course, handling such situations using a `match` is also possible:
    
    ```
        (scaleCharToFactor.get(inputScaleChar), scaleCharToFactor.get(outputScaleChar)) match {
          case (Some(inputScale), Some(outputScale)) =>
            inputScale * num / outputScale
          case _ =>
            // throw
        }
    ```
    
    I prefer all of these to, say, the following straw-man that doesn't take advantage of any of the nice things that come from using `Option`s:
    
    ```
        if (scaleCharToFactor.get(inputScaleChar).isDefined && 
            scaleCharToFactor.get(outputScaleChar).isDefined)
          scaleCharToFactor.get(inputScaleChar).get * num / scaleCharToFactor.get(outputScaleChar).get
        else
          // throw
    ```
    
    but I'll defer to you on the approach you like best.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4665] [SPARK-4666] Improve YarnAllocato...

Posted by sryza <gi...@git.apache.org>.

Github user sryza commented on the pull request:

    https://github.com/apache/spark/pull/3525#issuecomment-73427026
  
    Not a huge deal, but I think the converse is kind of true.  In my experience, the right percentage is totally unclear to everyone in all situations, so pretty much always needs to be set experimentally.  If I settle on a percentage that works for my job (or all my jobs), it would be nice for me not to need to reset this param every time I tweak spark.executor.memory.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: Improve YarnAllocator's parsing of "memory ove...

Posted by ryan-williams <gi...@git.apache.org>.

Github user ryan-williams commented on the pull request:

    https://github.com/apache/spark/pull/3525#issuecomment-65020605
  
    possibly relevant to @sryza


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4666] Improve YarnAllocator's parsing o...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3525#discussion_r24401098
  
    --- Diff: docs/running-on-yarn.md ---
    @@ -113,9 +113,9 @@ Most of the configs are the same for Spark on YARN as for other deployment modes
     </tr>
     <tr>
      <td><code>spark.yarn.executor.memoryOverhead</code></td>
    -  <td>executorMemory * 0.07, with minimum of 384 </td>
    +  <td>executorMemory * 0.07, with minimum of 384 megabytes </td>
    --- End diff --
    
    Indeed, and should the other properties here get the same treatment, in docs and code?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4666] Improve YarnAllocator's parsing o...

Posted by ryan-williams <gi...@git.apache.org>.

Github user ryan-williams commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3525#discussion_r24450366
  
    --- Diff: docs/running-on-yarn.md ---
    @@ -113,9 +113,9 @@ Most of the configs are the same for Spark on YARN as for other deployment modes
     </tr>
     <tr>
      <td><code>spark.yarn.executor.memoryOverhead</code></td>
    -  <td>executorMemory * 0.07, with minimum of 384 </td>
    +  <td>executorMemory * 0.07, with minimum of 384 megabytes </td>
    --- End diff --
    
    good catch, updated the other similar strings in this file


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4666] Improve YarnAllocator's parsing o...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3525#discussion_r24448190
  
    --- Diff: core/src/main/scala/org/apache/spark/SparkConf.scala ---
    @@ -203,6 +203,28 @@ class SparkConf(loadDefaults: Boolean) extends Cloneable with Logging {
         getOption(key).map(_.toBoolean).getOrElse(defaultValue)
       }
     
    +  // Limit of bytes for total size of results (default is 1GB)
    --- End diff --
    
    I would have picked `Utils` I suppose. Or is there somewhere less generic to put this that covers the 2 call sites? Other opinions?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4666] Improve YarnAllocator's parsing o...

Posted by ryan-williams <gi...@git.apache.org>.

Github user ryan-williams commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3525#discussion_r24446161
  
    --- Diff: core/src/main/scala/org/apache/spark/SparkConf.scala ---
    @@ -203,6 +203,28 @@ class SparkConf(loadDefaults: Boolean) extends Cloneable with Logging {
         getOption(key).map(_.toBoolean).getOrElse(defaultValue)
       }
     
    +  // Limit of bytes for total size of results (default is 1GB)
    --- End diff --
    
    I weighed two options:
    
    1. have a method in `Utils` that takes a `SparkConf` as a parameter and thinly wraps a method call on said `SparkConf`
    1. make the aforementioned wrapper a method on `SparkConf` that delegates to another method on `SparkConf`
    
    ..and felt like the latter was better/cleaner. My feeling was that a kitchen-sink / generically-named `Utils` class that wraps methods for `SparkConf` (and possibly other classes?) to maintain an illusion of simplicity in the `SparkConf` API is not helping code clarity.
    
    Of course, this is subjective and I'm open to putting it back in `Utils`, lmk.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4666] Improve YarnAllocator's parsing o...

Posted by ryan-williams <gi...@git.apache.org>.

Github user ryan-williams commented on the pull request:

    https://github.com/apache/spark/pull/3525#issuecomment-73625367
  
    OK, this is ready to go again. Changes:
    * removed SPARK-4665 from the title and closed that JIRA
    * in the process of rebasing this change I found myself wanting "MB" suffixes on various `Int` variables that represent numbers of megabytes; I've included several commits here that perform such renames, but they're separate and easy to remove from this PR if that's controversial
    * I made `Utils.getMaxResultSize(SparkConf)` a method of `SparkConf` instead.
    * I cleaned up the semantics around which `Utils` and `SparkConf` methods assume `Int`s to represent numbers of megabytes, vs. ones that are generic across memory-size orders-of-magnitude.
    
    Let me know what you think.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4665] Improve YarnAllocator's parsing o...

Posted by sryza <gi...@git.apache.org>.

Github user sryza commented on the pull request:

    https://github.com/apache/spark/pull/3525#issuecomment-65023669
  
    @ryan-williams your reasoning makes sense to me.  @andrewor14 @tgravescs what do you think about deprecating memoryOverhead and adding memoryOverheadFraction?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: Improve YarnAllocator's parsing of "memory ove...

Posted by ryan-williams <gi...@git.apache.org>.

Github user ryan-williams commented on the pull request:

https://github.com/apache/spark/pull/3525#issuecomment-65022899

sure, will file JIRA.

To answer what I think is the spirit of your question, whether a spark user wants more or less than 7% seems like something that should be configurable. I have definitely wanted to set this value to closer to 10% of the executor memory, or higher, to make sure that that wasn't my problem, when debugging some of my own Spark jobs. I can't swear that I've seen yarn kill my executors for exceeding the 7%, but I've definitely felt like I wanted to configure it.

I can see that it's unusual to have two params that both affect this value, but it seems like the right tradeoff, since the fraction more closely models what we want, but by an accident of history we have this other value and we don't want to break backwards-compatibility.

Everyone I've heard talk about how to think about the overhead (which is mostly you, admittedly) talks about it in terms of the fraction of the executor memory. If you think the best course is to deprecate the absolute-amount-of-memory one as well, I'm open to that, but I don't think the status quo (1. have a hard-coded value for the logical param that we care about, and 2. have a param that sets the value we want indirectly) makes much sense.

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4665] [SPARK-4666] Improve YarnAllocato...

Posted by lianhuiwang <gi...@git.apache.org>.

Github user lianhuiwang commented on the pull request:

    https://github.com/apache/spark/pull/3525#issuecomment-73520398
  
    i think memoryOverheadFraction is good to me.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: Improve YarnAllocator's parsing of "memory ove...

Posted by sryza <gi...@git.apache.org>.

Github user sryza commented on the pull request:

    https://github.com/apache/spark/pull/3525#issuecomment-65021562
  
    In Spark 1.2, the memory overhead defaults to 7% of the executor memory.  Have you noticed that you need larger than this fraction?  In the change that added that fraction, there was some concern about having two different params (a constant overhead and a fraction) to control the same value.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: Improve YarnAllocator's parsing of "memory ove...

Posted by sryza <gi...@git.apache.org>.

Github user sryza commented on the pull request:

    https://github.com/apache/spark/pull/3525#issuecomment-65021214
  
    Hey @ryan-williams , mind filing a JIRA for this?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4666] Improve YarnAllocator's parsing o...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3525#discussion_r24632812
  
    --- Diff: core/src/main/scala/org/apache/spark/SparkConf.scala ---
    @@ -203,6 +203,28 @@ class SparkConf(loadDefaults: Boolean) extends Cloneable with Logging {
         getOption(key).map(_.toBoolean).getOrElse(defaultValue)
       }
     
    +  // Limit of bytes for total size of results (default is 1GB)
    --- End diff --
    
    I'd like another set of eyes on the change from @pwendell or @JoshRosen  . The reason I hesitate before merging is the large number of small changes and merge conflict potential. Although I do definitely think the variable names are improved by the suffix.
    
    For this particular issue, maybe expose just `getMemory` from this class, and inline the two simple calls to it that currently use `getMaxResultSize`? writing `getMemory("spark.driver.maxResultSize", "1g", outputScale = 'b')` in two places isn't bad versus constructing a new home for it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4666] Improve YarnAllocator's parsing o...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on the pull request:

    https://github.com/apache/spark/pull/3525#issuecomment-76058874
  
    @hammer @ryan-williams I'd prefer to hear from @JoshRosen or @pwendell before merging. The two key questions are
    
    - Gut feeling about the potential for merge conflict -- OK?
    - An opinion on https://github.com/apache/spark/pull/3525/files#r24632812 and maybe the `for` comprehension question though that's really not that important either way


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4665] [SPARK-4666] Improve YarnAllocato...

Posted by ryan-williams <gi...@git.apache.org>.

Github user ryan-williams commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3525#discussion_r24373338
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/local/LocalBackend.scala ---
    @@ -20,6 +20,7 @@ package org.apache.spark.scheduler.local
     import java.nio.ByteBuffer
     
     import scala.concurrent.duration._
    +import scala.language.postfixOps
    --- End diff --
    
    Saw this while compiling:
    
    > [WARNING] /Users/ryan/c/spark/streaming/src/main/scala/org/apache/spark/streaming/util/WriteAheadLogManager.scala:156: postfix operator second should be enabled
    by making the implicit value scala.language.postfixOps visible.
    This can be achieved by adding the import clause 'import scala.language.postfixOps'
    or by setting the compiler option -language:postfixOps.
    See the Scala docs for value scala.language.postfixOps for a discussion
    why the feature should be explicitly enabled.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4666] Improve YarnAllocator's parsing o...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3525#discussion_r24400754
  
    --- Diff: core/src/main/scala/org/apache/spark/SparkConf.scala ---
    @@ -203,6 +203,28 @@ class SparkConf(loadDefaults: Boolean) extends Cloneable with Logging {
         getOption(key).map(_.toBoolean).getOrElse(defaultValue)
       }
     
    +  // Limit of bytes for total size of results (default is 1GB)
    --- End diff --
    
    Hm, why is this property special-cased here? the methods in this class are generally generic.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4666] Improve YarnAllocator's parsing o...

Posted by ryan-williams <gi...@git.apache.org>.

Github user ryan-williams commented on the pull request:

    https://github.com/apache/spark/pull/3525#issuecomment-73783554
  
    (thanks for the review!)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4666] Improve YarnAllocator's parsing o...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3525#discussion_r24401004
  
    --- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala ---
    @@ -980,33 +980,77 @@ private[spark] object Utils extends Logging {
         )
       }
     
    +
    +  private val TB = 1L << 40
    +  private val GB = 1L << 30
    +  private val MB = 1L << 20
    +  private val KB = 1L << 10
    +
    +  private val scaleCharToFactor: Map[Char, Long] = Map(
    +    'b' -> 1L,
    +    'k' -> KB,
    +    'm' -> MB,
    +    'g' -> GB,
    +    't' -> TB
    +  )
    +
       /**
    -   * Convert a Java memory parameter passed to -Xmx (such as 300m or 1g) to a number of megabytes.
    -   */
    -  def memoryStringToMb(str: String): Int = {
    +   * Convert a Java memory parameter passed to -Xmx (such as "300m" or "1g") to a number of
    +   * megabytes (or other byte-scale denominations as specified by @outputScaleChar).
    +   *
    +   * For @defaultInputScaleChar and @outputScaleChar, valid values are: 'b' (bytes), 'k'
    +   * (kilobytes), 'm' (megabytes), 'g' (gigabytes), and 't' (terabytes).
    +   *
    +   * @param str String to parse an amount of memory out of
    +   * @param defaultInputScaleChar if no "scale" is provided on the end of @str (i.e. @str is a
    +   *                              plain numeric value), assume this scale (default: 'b' for
    +   *                              'bytes')
    +   * @param outputScaleChar express the output in this scale, i.e. number of bytes, kilobytes,
    +   *                        megabytes, or gigabytes.
    +   */
    +  def parseMemoryString(
    +      str: String,
    +      defaultInputScaleChar: Char = 'b',
    +      outputScaleChar: Char = 'm'): Long = {
    +
         val lower = str.toLowerCase
    -    if (lower.endsWith("k")) {
    -      (lower.substring(0, lower.length-1).toLong / 1024).toInt
    -    } else if (lower.endsWith("m")) {
    -      lower.substring(0, lower.length-1).toInt
    -    } else if (lower.endsWith("g")) {
    -      lower.substring(0, lower.length-1).toInt * 1024
    -    } else if (lower.endsWith("t")) {
    -      lower.substring(0, lower.length-1).toInt * 1024 * 1024
    -    } else {// no suffix, so it's just a number in bytes
    -      (lower.toLong / 1024 / 1024).toInt
    -    }
    +    val lastChar = lower(lower.length - 1)
    +    val (num, inputScaleChar) =
    +      if (lastChar.isDigit) {
    +        (lower.toLong, defaultInputScaleChar)
    +      } else {
    +        (lower.substring(0, lower.length - 1).toLong, lastChar)
    +      }
    +
    +    (for {
    --- End diff --
    
    Why is a for construction used here? just to handle the invalid in / out scale param? My personal taste would be to just check that the Option[Long] exists directly and do away with it. I can't tell how much that's just me versus how the kids talk in Scala these days. A looping construct surprised me as there is no loop.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4665] [SPARK-4666] Improve YarnAllocato...

Posted by ryan-williams <gi...@git.apache.org>.

Github user ryan-williams commented on the pull request:

    https://github.com/apache/spark/pull/3525#issuecomment-73594255
  
    (until I hear otherwise I am pulling/rebasing 1∪2, roughly the SPARK-4666 bits, into their own PR)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4666] Improve YarnAllocator's parsing o...

Posted by hammer <gi...@git.apache.org>.

Github user hammer commented on the pull request:

    https://github.com/apache/spark/pull/3525#issuecomment-76057244
  
    @srowen this one ready to go in?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4665] Improve YarnAllocator's parsing o...

Posted by ryan-williams <gi...@git.apache.org>.

Github user ryan-williams commented on the pull request:

    https://github.com/apache/spark/pull/3525#issuecomment-65023413
  
    Filed [SPARK-4665](https://issues.apache.org/jira/browse/SPARK-4665) and [SPARK-4666](https://issues.apache.org/jira/browse/SPARK-4666), noted the former in the PR title, and put a note about both in the opening PR comment. lmk if there's some other syntax for addressing two JIRAs that is warranted.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org