You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Wes Holler <wh...@algebraixdata.com> on 2016/04/30 01:09:17 UTC

RPC Timeout and Abnormally Long JvmGcTime

Recently we switched to EMR 4.5/Spark 1.6.1 and have since encountered a new failure scenario.

The primary symptom is that the cluster appears to be stalled. The job has not failed but will not proceed and has to be killed. One or more RpcTimeoutException s (see below) are usually found towards the bottom of the log. Another observation is that, using the status API, I can see that the jvmGcTime value (see JSON snippet below) for some tasks in a stage that never finished is abnormally large (in the millions). Even for identical queries/data, this problem doesn't always happen, and when the cluster doesn't hang, I also do not see these large values for jvmGcTime.

Some questions:
1. What units is jvmGcTime in?
2. If it is in milliseconds, then that seems to indicate garbage collections times into the tens of minutes - far longer that the queries themselves usually take and also longer than the default spark.rpc.askTimeout value of 120s. Could this be blocking/starving the promise and causing the failure?
3. We are doing almost entirely standard SQL, no UDFs, and no use of the RDD API. I was under the impression that in Spark 1.6, pretty much everything DataFrames is using the custom Tungsten serialization and unsafe buffers, even in the actual processing (so no actual unroll). Is this correct? What could be causing such GC churn?

===============================================================================

The exception found in the log (several stack trace lines removed for brevity:

16/04/29 17:43:29 ERROR ContextCleaner: Error cleaning broadcast 0
org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [120 seconds]. This timeout is controlled by spark.rpc.askTimeout
	at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48)
	.
	at org.apache.spark.ContextCleaner$$anon$3.run(ContextCleaner.scala:68)
Caused by: java.util.concurrent.TimeoutException: Futures timed out after [120 seconds]
	at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
	.
	... 12 more
16/04/29 17:43:29 WARN BlockManagerMaster: Failed to remove broadcast 0 with removeFromMaster = true - Cannot receive any reply in 120 seconds. This timeout is controlled by spark.rpc.askTimeout
org.apache.spark.rpc.RpcTimeoutException: Cannot receive any reply in 120 seconds. This timeout is controlled by spark.rpc.askTimeout
	at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48)
	.
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:178)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:292)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
	at java.lang.Thread.run(Thread.java:745)
Caused by: java.util.concurrent.TimeoutException: Cannot receive any reply in 120 seconds
	at org.apache.spark.rpc.netty.NettyRpcEnv$$anon$1.run(NettyRpcEnv.scala:242)
	... 7 more
16/04/29 17:43:29 INFO ContextCleaner: Cleaned accumulator 2
16/04/29 19:33:23 INFO PackagesResourceConfig: Scanning for root resource and provider classes in the packages:
  org.apache.spark.status.api.v1
16/04/29 19:33:28 INFO ScanningResourceConfig: Root resource classes found:
  class org.apache.spark.status.api.v1.ApiRootResource
16/04/29 19:33:28 INFO ScanningResourceConfig: Provider classes found:
  class org.apache.spark.status.api.v1.JacksonMessageWriter
16/04/29 19:33:28 INFO WebApplicationImpl: Initiating Jersey application, version 'Jersey: 1.9 09/02/2011 11:17 AM'
16/04/29 19:34:09 WARN Errors: The following warnings have been detected with resource and/or provider classes:
  WARNING: A sub-resource method, public scala.collection.Seq org.apache.spark.status.api.v1.OneStageResource.stageData(int), with URI template, "", is treated as a resource method

===============================================================================

A sample from the adasdasdasdasd:

"1446406" : {
      "taskId" : 1446406,
      "index" : 593,
      "attempt" : 0,
      "launchTime" : "2016-04-27T11:16:25.005GMT",
      "executorId" : "364",
      "host" : "ip-10-0-0-41.ec2.internal",
      "taskLocality" : "RACK_LOCAL",
      "speculative" : false,
      "accumulatorUpdates" : [ ],
      "taskMetrics" : {
        "executorDeserializeTime" : 0,
        "executorRunTime" : 0,
        "resultSize" : 0,
        "jvmGcTime" : 1637765,
        "resultSerializationTime" : 0,
        "memoryBytesSpilled" : 0,
        "diskBytesSpilled" : 0,
        "inputMetrics" : {
          "bytesRead" : 40824341,
          "recordsRead" : 1130000
        },
        "shuffleWriteMetrics" : {
          "bytesWritten" : 0,
          "writeTime" : 0,
          "recordsWritten" : 0
        }
      }
    },



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

RE: RPC Timeout and Abnormally Long JvmGcTime

Posted by Wes Holler <wh...@algebraixdata.com>.

Oops... The "adasdasdasdasd" below is JSON scraped from the status API endpoint.

-----Original Message-----
From: Wes Holler 
Sent: Friday, April 29, 2016 6:09 PM
To: dev
Subject: RPC Timeout and Abnormally Long JvmGcTime

Recently we switched to EMR 4.5/Spark 1.6.1 and have since encountered a new failure scenario.

The primary symptom is that the cluster appears to be stalled. The job has not failed but will not proceed and has to be killed. One or more RpcTimeoutException s (see below) are usually found towards the bottom of the log. Another observation is that, using the status API, I can see that the jvmGcTime value (see JSON snippet below) for some tasks in a stage that never finished is abnormally large (in the millions). Even for identical queries/data, this problem doesn't always happen, and when the cluster doesn't hang, I also do not see these large values for jvmGcTime.

Some questions:
1. What units is jvmGcTime in?
2. If it is in milliseconds, then that seems to indicate garbage collections times into the tens of minutes - far longer that the queries themselves usually take and also longer than the default spark.rpc.askTimeout value of 120s. Could this be blocking/starving the promise and causing the failure?
3. We are doing almost entirely standard SQL, no UDFs, and no use of the RDD API. I was under the impression that in Spark 1.6, pretty much everything DataFrames is using the custom Tungsten serialization and unsafe buffers, even in the actual processing (so no actual unroll). Is this correct? What could be causing such GC churn?

===============================================================================

The exception found in the log (several stack trace lines removed for brevity:

16/04/29 17:43:29 ERROR ContextCleaner: Error cleaning broadcast 0
org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [120 seconds]. This timeout is controlled by spark.rpc.askTimeout
	at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48)
	.
	at org.apache.spark.ContextCleaner$$anon$3.run(ContextCleaner.scala:68)
Caused by: java.util.concurrent.TimeoutException: Futures timed out after [120 seconds]
	at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
	.
	... 12 more
16/04/29 17:43:29 WARN BlockManagerMaster: Failed to remove broadcast 0 with removeFromMaster = true - Cannot receive any reply in 120 seconds. This timeout is controlled by spark.rpc.askTimeout
org.apache.spark.rpc.RpcTimeoutException: Cannot receive any reply in 120 seconds. This timeout is controlled by spark.rpc.askTimeout
	at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48)
	.
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:178)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:292)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
	at java.lang.Thread.run(Thread.java:745)
Caused by: java.util.concurrent.TimeoutException: Cannot receive any reply in 120 seconds
	at org.apache.spark.rpc.netty.NettyRpcEnv$$anon$1.run(NettyRpcEnv.scala:242)
	... 7 more
16/04/29 17:43:29 INFO ContextCleaner: Cleaned accumulator 2
16/04/29 19:33:23 INFO PackagesResourceConfig: Scanning for root resource and provider classes in the packages:
  org.apache.spark.status.api.v1
16/04/29 19:33:28 INFO ScanningResourceConfig: Root resource classes found:
  class org.apache.spark.status.api.v1.ApiRootResource
16/04/29 19:33:28 INFO ScanningResourceConfig: Provider classes found:
  class org.apache.spark.status.api.v1.JacksonMessageWriter
16/04/29 19:33:28 INFO WebApplicationImpl: Initiating Jersey application, version 'Jersey: 1.9 09/02/2011 11:17 AM'
16/04/29 19:34:09 WARN Errors: The following warnings have been detected with resource and/or provider classes:
  WARNING: A sub-resource method, public scala.collection.Seq org.apache.spark.status.api.v1.OneStageResource.stageData(int), with URI template, "", is treated as a resource method

===============================================================================

A sample from the adasdasdasdasd:

"1446406" : {
      "taskId" : 1446406,
      "index" : 593,
      "attempt" : 0,
      "launchTime" : "2016-04-27T11:16:25.005GMT",
      "executorId" : "364",
      "host" : "ip-10-0-0-41.ec2.internal",
      "taskLocality" : "RACK_LOCAL",
      "speculative" : false,
      "accumulatorUpdates" : [ ],
      "taskMetrics" : {
        "executorDeserializeTime" : 0,
        "executorRunTime" : 0,
        "resultSize" : 0,
        "jvmGcTime" : 1637765,
        "resultSerializationTime" : 0,
        "memoryBytesSpilled" : 0,
        "diskBytesSpilled" : 0,
        "inputMetrics" : {
          "bytesRead" : 40824341,
          "recordsRead" : 1130000
        },
        "shuffleWriteMetrics" : {
          "bytesWritten" : 0,
          "writeTime" : 0,
          "recordsWritten" : 0
        }
      }
    },



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org For additional commands, e-mail: dev-help@spark.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org