You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/03/03 10:26:15 UTC

[GitHub] [spark] attilapiros opened a new pull request #35720: [SPARK-33206][CORE] Fix shuffle index cache weight calculation for small index files

attilapiros opened a new pull request #35720:
URL: https://github.com/apache/spark/pull/35720


   ### What changes were proposed in this pull request?
   
   Increasing the shuffle index weight with a constant number to avoid underestimating retained memory size caused by the bookkeeping objects: the `java.io.File` (depending on the path ~ 960 bytes) object and the `ShuffleIndexInformation` object (~180 bytes).
   
   ### Why are the changes needed?
   
   Underestimating cache entry size easily can cause OOM in the Yarn NodeManager. 
   In the following analyses of a prod issue (HPROF file) we can see the leak suspect Guava's `LocalCache$Segment` objects:
   
   <img width="943" alt="Screenshot 2022-02-17 at 18 55 40" src="https://user-images.githubusercontent.com/2017933/154541995-44014212-2046-41d6-ba7f-99369ca7d739.png">
   
   Going further we can see a `ShuffleIndexInformation` for a small index file (16 bytes) but the retained heap memory is 1192 bytes:
   
   <img width="1351" alt="image" src="https://user-images.githubusercontent.com/2017933/154645212-e0318d0f-cefa-4ae3-8a3b-97d2b506757d.png">
   
   Finally we can see this is very common within this heap dump (using MAT's Object Query Language):
    
   <img width="1418" alt="image" src="https://user-images.githubusercontent.com/2017933/154547678-44c8af34-1765-4e14-b71a-dc03d1a304aa.png">
   
   I have even exported the data to a CSV and done some calculations with `awk`:
   
   ```
   $ tail -n+2 export.csv | awk -F, 'BEGIN { numUnderEstimated=0; } { sumOldSize += $1; corrected=$1 + 1176; sumCorrectedSize += corrected; sumRetainedMem += $2; if (corrected < $2) numUnderEstimated+=1; } END { print "sum old size: " sumOldSize / 1024 / 1024   " MB, sum corrected size: " sumCorrectedSize / 1024 / 1024 " MB, sum retained memory:" sumRetainedMem / 1024 / 1024  " MB, num under estimated: " numUnderEstimated }'
   ```
   
   It gives the followings:
   ```
   sum old size: 76.8785 MB, sum corrected size: 1066.93 MB, sum retained memory:1064.47 MB, num under estimated: 0
   ```
   
   So using the old calculation we were at 7.6.8 MB way under the default cache limit (100 MB). 
   Using the correction (applying 1176 as increment to the size) we are at 1066.93 MB (~1GB) which is close to the real retained sum heap: 1064.47 MB (~1GB) and there is no entry which was underestimated.
   
   But we can go further and get rid of `java.io.File` completely and store the `ShuffleIndexInformation` for the file path.
   This way not only the cache size estimate is improved but the its size is decreased as well. 
   Here the path size is not counted into the cache size as that string is interned. 
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   With the calculations above.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] attilapiros commented on pull request #35720: [SPARK-33206][CORE][3.1] Fix shuffle index cache weight calculation for small index files

Posted by GitBox <gi...@apache.org>.
attilapiros commented on pull request #35720:
URL: https://github.com/apache/spark/pull/35720#issuecomment-1058039329


   Thanks @HyukjinKwon for correcting the title!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun closed pull request #35720: [SPARK-33206][CORE][3.1] Fix shuffle index cache weight calculation for small index files

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun closed pull request #35720:
URL: https://github.com/apache/spark/pull/35720


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] attilapiros commented on pull request #35720: [SPARK-33206][CORE][3.1] Fix shuffle index cache weight calculation for small index files

Posted by GitBox <gi...@apache.org>.
attilapiros commented on pull request #35720:
URL: https://github.com/apache/spark/pull/35720#issuecomment-1058044744


   cc @dongjoon-hyun 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #35720: [SPARK-33206][CORE][3.1] Fix shuffle index cache weight calculation for small index files

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #35720:
URL: https://github.com/apache/spark/pull/35720#issuecomment-1058297847


   Thank you, @attilapiros and @HyukjinKwon .


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org