You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by gerashegalov <gi...@git.apache.org> on 2018/02/11 07:27:20 UTC

[GitHub] spark pull request #20575: [SPARK-23386][DEPLOY] enable direct application l...

GitHub user gerashegalov opened a pull request:

    https://github.com/apache/spark/pull/20575

    [SPARK-23386][DEPLOY] enable direct application links in SHS before replay

    ## What changes were proposed in this pull request?
    Enable direct job links already in the scan thread before full replay. Otherwise, direct job links might not be available for hours.
    
    ## How was this patch tested?
    Test with a deploy on multiple 10k apps. This is currently a prototype for YARN, but should generalizable.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/gerashegalov/spark gera/logs-events-from-listing

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/20575.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #20575
    
----
commit e27880263f36a7b8beee62c902389c293bb2a17e
Author: Gera Shegalov <ge...@...>
Date:   2018-02-09T15:05:12Z

    List-driven bootstrap replay
    
    (cherry picked from commit 0d4e2a2215bb9e102ce449c52bcf7c3d44fc6d44)

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20575: [SPARK-23386][DEPLOY] enable direct application links in...

Posted by gerashegalov <gi...@git.apache.org>.
Github user gerashegalov commented on the issue:

    https://github.com/apache/spark/pull/20575
  
    @vanzin what do you mean by "as part of parsing the logs"? This PR is about avoiding the long wait for eventLogs to be read from a remote filesystem, and being parsed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20575: [SPARK-23386][DEPLOY] enable direct application links in...

Posted by gerashegalov <gi...@git.apache.org>.
Github user gerashegalov commented on the issue:

    https://github.com/apache/spark/pull/20575
  
    Thanks for suggestions, I will look into them. I agree that a solution into the right direction will definitely involve changing the write call path. I did not go down this path because I have no control over my customer's Spark versions (at least for now). 
    
    There is another long standing issue with FsHistoryProvider. It uses `rename` which is inefficient on S3.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20575: [SPARK-23386][DEPLOY] enable direct application links in...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20575
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20575: [SPARK-23386][DEPLOY] enable direct application links in...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20575
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20575: [SPARK-23386][DEPLOY] enable direct application links in...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20575
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20575: [SPARK-23386][DEPLOY] enable direct application links in...

Posted by vanzin <gi...@git.apache.org>.
Github user vanzin commented on the issue:

    https://github.com/apache/spark/pull/20575
  
    True. Still, to be able to do that, you're hardcoding YARN-isms into the code, e.g., how application IDs look, so that you can create a "fake" application entry that will, hopefully, eventually match the actual contents of the log file.
    
    What you're trying here is a stop-gap fix for SPARK-6951. I was hoping we could have an actual solution to that problem. I thought about skipping data (instead of the current code that still reads the data, just doesn't process events it doesn't care about), but couldn't figure out how to make that work with compression on.
    
    There have been suggestions thrown around, like having Spark write a summary file side-by-side with the event log, for the SHS to consume. But that doesn't help existing event logs.
    
    If you'd like to go down this path I'd suggest forgetting about the whole app id parsing thing, and creating actual, fake entries for these logs that clearly indicate they're fake and temporary, and cleaning them up once the log file is parsed. You could do that by creating the fake entry (if the app's entry doesn't exist yet) and providing it to the parsing task, so that once it's done it cleans up the temp entry before writing the real one.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20575: [SPARK-23386][DEPLOY] enable direct application links in...

Posted by vanzin <gi...@git.apache.org>.
Github user vanzin commented on the issue:

    https://github.com/apache/spark/pull/20575
  
    If doing this, it would be cleaner to do it as part of parsing the logs. e.g., if you make `AppListingListener` write the app info to the store when interesting events happen, that would be much better and less race-prone.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org