You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by szhem <gi...@git.apache.org> on 2018/09/01 11:32:49 UTC
[GitHub] spark issue #19410: [SPARK-22184][CORE][GRAPHX] GraphX fails in case of insu...

Github user szhem commented on the issue:

    https://github.com/apache/spark/pull/19410
  
    Hi @asolimando,
    
    I believe that the solution with weak references will work and probably with `ContextCleaner` too, but there are some points I'd like to discuss if you don't mind
    
    - Let's take `ContextCleaner` first. In that case we should have a pretty simple solution, but the backward compatibility of `PeriodicRDDCheckpointer` and `PeriodicGraphCheckpointer` will be lost, because 
      - it will be necessary to update these classes to prevent deleting checkpoint files 
      - user will always have to provide `spark.cleaner.referenceTracking.cleanCheckpoints` property in order to clean unnecessary checkpoints. 
      - the users who haven't specified `spark.cleaner.referenceTracking.cleanCheckpoints` previously (and I believe there will be most of them) will be affected by this new unexpected behaviour 
    
    - In case of custom solution based on weak references 
      - it will be necessary to poll a reference queue at some place and moment to remove unnecessary checkpoint files. 
      - polling the reference queue in the separate thread will complicate the code 
      - polling the reference queue synchronously does not guarantee deleting all the unnecessary checkpoint files.
    
    In case of reference queue, could you please recommend the convenient place in the source code to do it? 
    
    As for me 
    - setting `spark.cleaner.referenceTracking.cleanCheckpoints` to `true` by default should work 
    - setting `spark.cleaner.referenceTracking.cleanCheckpoints` to `true` by default will allow us to simplify `PeriodicRDDCheckpointer` and `PeriodicGraphCheckpointer` too by deleting unnecessary code that cleans up checkpoint files
    - setting `spark.cleaner.referenceTracking.cleanCheckpoints` to `true` by default sounds reasonable and should not break the code of those users who didn't do it previously
    - but setting `spark.cleaner.referenceTracking.cleanCheckpoints` to `true` will probably break the backward compatibility (although this PR tries to preserve it)
    
    What do you think? Will the community accept setting `spark.cleaner.referenceTracking.cleanCheckpoints` to `true`by default?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org