You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2019/03/26 20:58:51 UTC

[GitHub] [spark] srowen opened a new pull request #24226: [SPARK-26660][FOLLOWUP] Raise task serialized size warning threshold to 1000 KiB

srowen opened a new pull request #24226: [SPARK-26660][FOLLOWUP] Raise task serialized size warning threshold to 1000 KiB
URL: https://github.com/apache/spark/pull/24226
 
 
   ## What changes were proposed in this pull request?
   
   Raise the threshold size for serialized task size at which a warning is generated from 100KiB to 1000KiB. 
   
   As several people have noted, the original change for this JIRA highlighted that this threshold is low. Test output regularly shows:
   
   ```
   - sorting on StringType with nullable=false, sortOrder=List('a DESC NULLS LAST)
   22:47:53.320 WARN org.apache.spark.scheduler.TaskSetManager: Stage 80 contains a task of very large size (755 KiB). The maximum recommended task size is 100 KiB.
   22:47:53.348 WARN org.apache.spark.scheduler.TaskSetManager: Stage 81 contains a task of very large size (755 KiB). The maximum recommended task size is 100 KiB.
   22:47:53.417 WARN org.apache.spark.scheduler.TaskSetManager: Stage 83 contains a task of very large size (755 KiB). The maximum recommended task size is 100 KiB.
   22:47:53.444 WARN org.apache.spark.scheduler.TaskSetManager: Stage 84 contains a task of very large size (755 KiB). The maximum recommended task size is 100 KiB.
   
   ...
   
   - SPARK-20688: correctly check analysis for scalar sub-queries
   22:49:10.314 WARN org.apache.spark.scheduler.DAGScheduler: Broadcasting large task binary with size 150.8 KiB
   - SPARK-21835: Join in correlated subquery should be duplicateResolved: case 1
   22:49:10.595 WARN org.apache.spark.scheduler.DAGScheduler: Broadcasting large task binary with size 150.7 KiB
   22:49:10.744 WARN org.apache.spark.scheduler.DAGScheduler: Broadcasting large task binary with size 150.7 KiB
   22:49:10.894 WARN org.apache.spark.scheduler.DAGScheduler: Broadcasting large task binary with size 150.7 KiB
   - SPARK-21835: Join in correlated subquery should be duplicateResolved: case 2
   - SPARK-21835: Join in correlated subquery should be duplicateResolved: case 3
   - SPARK-23316: AnalysisException after max iteration reached for IN query
   22:49:11.559 WARN org.apache.spark.scheduler.DAGScheduler: Broadcasting large task binary with size 154.2 KiB
   ```
   
   It seems that a larger threshold of about 1MB is more suitable.
   
   ## How was this patch tested?
   
   Existing tests.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org