You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@ignite.apache.org by percent620 <pe...@163.com> on 2016/09/02 14:45:13 UTC

Spark stage was hang via ignite

/u01/spark-1.6.0-hive/bin/spark-submit --driver-memory 4G --class
com.ETLTransform --master yarn --executor-cores 4 --executor-memory 1000m
--num-executors 20 --conf spark.rdd.compress=false --conf
spark.shuffle.compress=false --conf spark.broadcast.compress=false
/u01/spark_engine863.jar -quesize 10 -batchSize 5000 -writethread 30
-runningSeconds 20
args name=-quesize value=10
args name=-batchSize value=5000
args name=-writethread value=30
args name=-runningSeconds value=20
16/09/02 22:21:17 WARN Utils: Service 'SparkUI' could not bind on port 4040.
Attempting port 4041.
16/09/02 22:21:17 WARN Utils: Service 'SparkUI' could not bind on port 4041.
Attempting port 4042.
16/09/02 22:21:17 WARN Utils: Service 'SparkUI' could not bind on port 4042.
Attempting port 4043.
16/09/02 22:21:17 WARN Utils: Service 'SparkUI' could not bind on port 4043.
Attempting port 4044.
ignite:start======
*[Stage 0:==========================================>              (15 + 5)
/ 20]*

Stage was handed by this, any progress on this.


I found running executor on spark ui and found the following error message
as below

16/09/02 22:21:43 INFO spark.SecurityManager: SecurityManager:
authentication disabled; ui acls disabled; users with view permissions:
Set(hbase); users with modify permissions: Set(hbase)
16/09/02 22:21:44 INFO slf4j.Slf4jLogger: Slf4jLogger started
16/09/02 22:21:44 INFO Remoting: Starting remoting
16/09/02 22:21:44 INFO Remoting: Remoting started; listening on addresses
:[akka.tcp://sparkExecutorActorSystem@vmsecdomain010194070063.cm10:56914]
16/09/02 22:21:44 INFO util.Utils: Successfully started service
'sparkExecutorActorSystem' on port 56914.
16/09/02 22:21:44 INFO storage.DiskBlockManager: Created local directory at
/u01/hbase/tmp/nm-local-dir/usercache/hbase/appcache/application_1455892346017_5645/blockmgr-218e5cde-129e-41f8-b05e-c262e24c346f
16/09/02 22:21:44 INFO storage.MemoryStore: MemoryStore started with
capacity 6.8 GB
16/09/02 22:21:44 INFO executor.CoarseGrainedExecutorBackend: Connecting to
driver: spark://CoarseGrainedScheduler@10.194.70.26:51811
16/09/02 22:21:44 INFO executor.CoarseGrainedExecutorBackend: Successfully
registered with driver
16/09/02 22:21:44 INFO executor.Executor: Starting executor ID 2 on host
xxxxxxx
16/09/02 22:21:45 INFO util.Utils: Successfully started service
'org.apache.spark.network.netty.NettyBlockTransferService' on port 41006.
16/09/02 22:21:45 INFO netty.NettyBlockTransferService: Server created on
41006
16/09/02 22:21:45 INFO storage.BlockManagerMaster: Trying to register
BlockManager
16/09/02 22:21:45 INFO storage.BlockManagerMaster: Registered BlockManager
16/09/02 22:21:50 INFO executor.CoarseGrainedExecutorBackend: Got assigned
task 1
16/09/02 22:21:50 INFO executor.CoarseGrainedExecutorBackend: Got assigned
task 9
16/09/02 22:21:50 INFO executor.CoarseGrainedExecutorBackend: Got assigned
task 17
16/09/02 22:21:50 INFO executor.Executor: Running task 9.0 in stage 0.0 (TID
9)
16/09/02 22:21:50 INFO executor.Executor: Running task 1.0 in stage 0.0 (TID
1)
16/09/02 22:21:50 INFO executor.Executor: Running task 17.0 in stage 0.0
(TID 17)
16/09/02 22:21:50 INFO executor.Executor: Fetching
http://10.194.70.26:48676/jars/spark_zmqpull_engine863.jar with timestamp
1472826078761
16/09/02 22:21:50 INFO util.Utils: Fetching
http://10.194.70.26:48676/jars/spark_zmqpull_engine863.jar to
/u01/hbase/tmp/nm-local-dir/usercache/hbase/appcache/application_1455892346017_5645/spark-50073b87-72e3-4f4a-84cd-f01a7d5061dd/fetchFileTemp8818016450869617667.tmp
16/09/02 22:21:58 INFO util.Utils: Copying
/u01/hbase/tmp/nm-local-dir/usercache/hbase/appcache/application_1455892346017_5645/spark-50073b87-72e3-4f4a-84cd-f01a7d5061dd/-10798427751472826078761_cache
to
/u01/hbase/tmp/nm-local-dir/usercache/hbase/appcache/application_1455892346017_5645/container_1455892346017_5645_01_000009/./spark_zmqpull_engine863.jar
16/09/02 22:21:59 INFO executor.Executor: Adding
file:/u01/hbase/tmp/nm-local-dir/usercache/hbase/appcache/application_1455892346017_5645/container_1455892346017_5645_01_000009/./spark_zmqpull_engine863.jar
to class loader
16/09/02 22:21:59 INFO broadcast.TorrentBroadcast: Started reading broadcast
variable 0
16/09/02 22:21:59 INFO storage.MemoryStore: Block broadcast_0_piece0 stored
as bytes in memory (estimated size 1568.0 B, free 1568.0 B)
16/09/02 22:21:59 INFO broadcast.TorrentBroadcast: Reading broadcast
variable 0 took 129 ms
16/09/02 22:21:59 INFO storage.MemoryStore: Block broadcast_0 stored as
values in memory (estimated size 1608.0 B, free 3.1 KB)
16/09/02 22:21:59 INFO internal.IgniteKernal: 

>>>    __________  ________________  
>>>   /  _/ ___/ |/ /  _/_  __/ __/  
>>>  _/ // (7 7    // /  / / / _/    
>>> /___/\___/_/|_/___/ /_/ /___/   
>>> 
>>> ver. 1.6.0#20160518-sha1:0b22c45b
>>> 2016 Copyright(C) Apache Software Foundation
>>> 
>>> Ignite documentation: http://ignite.apache.org

16/09/02 22:21:59 INFO internal.IgniteKernal: Config URL: n/a
16/09/02 22:21:59 INFO internal.IgniteKernal: Daemon mode: off
16/09/02 22:21:59 INFO internal.IgniteKernal: OS: Linux
2.6.32-220.23.2.ali878.el6.x86_64 amd64
16/09/02 22:21:59 INFO internal.IgniteKernal: OS user: hbase
16/09/02 22:21:59 INFO internal.IgniteKernal: Language runtime: Java
Platform API Specification ver. 1.7
16/09/02 22:21:59 INFO internal.IgniteKernal: VM information: Java(TM) SE
Runtime Environment 1.7.0_79-b15 Oracle Corporation OpenJDK (Alibaba) 64-Bit
Server VM 24.79-b02-internal
16/09/02 22:21:59 INFO internal.IgniteKernal: VM total memory: 9.4GB
16/09/02 22:21:59 INFO internal.IgniteKernal: Remote Management [restart:
off, REST: on, JMX (remote: off)]
16/09/02 22:21:59 INFO internal.IgniteKernal: IGNITE_HOME=null
16/09/02 22:21:59 INFO internal.IgniteKernal: VM arguments:
[-XX:OnOutOfMemoryError=kill %p, -Xms10000m, -Xmx10000m,
-XX:MaxPermSize=256M,
-Djava.io.tmpdir=/u01/hbase/tmp/nm-local-dir/usercache/hbase/appcache/application_1455892346017_5645/container_1455892346017_5645_01_000009/tmp,
-Dspark.driver.port=51811,
-Dspark.yarn.app.container.log.dir=/u01/hbase/hadoop-2.5.0-cdh5.3.0/logs/userlogs/application_1455892346017_5645/container_1455892346017_5645_01_000009,
-XX:MaxPermSize=256m]
16/09/02 22:21:59 INFO internal.IgniteKernal: Configured caches
['ignite-marshaller-sys-cache', 'ignite-sys-cache',
'ignite-atomics-sys-cache']
16/09/02 22:22:00 INFO internal.IgniteKernal: Non-loopback local IPs:
10.194.70.63
16/09/02 22:22:00 INFO internal.IgniteKernal: Enabled local MACs:
283152A77F79
16/09/02 22:22:00 INFO plugin.IgnitePluginProcessor: Configured plugins:
16/09/02 22:22:00 INFO plugin.IgnitePluginProcessor:   ^-- None
16/09/02 22:22:00 INFO plugin.IgnitePluginProcessor: 
16/09/02 22:22:00 INFO tcp.TcpCommunicationSpi: IPC shared memory server
endpoint started [port=48101,
tokDir=/u01/hbase/tmp/nm-local-dir/usercache/hbase/appcache/application_1455892346017_5645/container_1455892346017_5645_01_000009/tmp/ignite/work/ipc/shmem/cad18430-c808-4c97-a6b1-ecff919b0108-82210]
16/09/02 22:22:00 INFO tcp.TcpCommunicationSpi: Successfully bound shared
memory communication to TCP port [port=48101, locHost=0.0.0.0/0.0.0.0]
16/09/02 22:22:00 INFO tcp.TcpCommunicationSpi: Successfully bound to TCP
port [port=47101, locHost=0.0.0.0/0.0.0.0]
16/09/02 22:22:00 WARN noop.NoopCheckpointSpi: Checkpoints are disabled (to
enable configure any GridCheckpointSpi implementation)
16/09/02 22:22:00 WARN collision.GridCollisionManager: Collision resolution
is disabled (all jobs will be activated upon arrival).
16/09/02 22:22:00 WARN noop.NoopSwapSpaceSpi: Swap space is disabled. To
enable use FileSwapSpaceSpi.
16/09/02 22:22:00 INFO internal.IgniteKernal: Security status
[authentication=off, tls/ssl=off]
16/09/02 22:22:00 INFO tcp.GridTcpRestProtocol: Command protocol
successfully started [name=TCP binary, host=0.0.0.0/0.0.0.0, port=11212]
16/09/02 22:22:00 INFO tcp.TcpDiscoverySpi: Successfully bound to TCP port
[port=47501, localHost=0.0.0.0/0.0.0.0]
16/09/02 22:22:00 WARN multicast.TcpDiscoveryMulticastIpFinder:
TcpDiscoveryMulticastIpFinder has no pre-configured addresses (it is
recommended in production to specify at least one address in
TcpDiscoveryMulticastIpFinder.getAddresses() configuration property)
16/09/02 22:22:05 INFO cache.GridCacheProcessor: Started cache
[name=ignite-marshaller-sys-cache, mode=REPLICATED]
16/09/02 22:22:05 INFO cache.GridCacheProcessor: Started cache
[name=embedCache, mode=PARTITIONED]
16/09/02 22:22:05 INFO cache.GridCacheProcessor: Started cache
[name=ignite-atomics-sys-cache, mode=PARTITIONED]
16/09/02 22:22:05 INFO cache.GridCacheProcessor: Started cache
[name=ignite-sys-cache, mode=REPLICATED]
16/09/02 22:22:09 INFO discovery.GridDiscoveryManager: Added new node to
topology: TcpDiscoveryNode [id=68644149-534e-4a21-932d-43c02062d085,
addrs=[10.194.70.71, 127.0.0.1], sockAddrs=[xx:47500, /xx:47500,
/127.0.0.1:47500], discPort=47500, order=1187, intOrder=640,
lastExchangeTime=1472826127728, loc=false, ver=1.6.0#20160518-sha1:0b22c45b,
isClient=false]
16/09/02 22:22:09 INFO discovery.GridDiscoveryManager: Topology snapshot
[ver=1187, servers=92, clients=1, CPUs=1824, heap=600.0GB]
16/09/02 22:22:35 WARN cache.GridCachePartitionExchangeManager: Failed to
wait for initial partition map exchange. Possible reasons are: 
  ^-- Transactions in deadlock.
  ^-- Long running transactions (ignore if this is the case).
  ^-- Unreleased explicit locks.
16/09/02 22:23:05 WARN cache.GridCachePartitionExchangeManager: Still
waiting for initial partition map exchange
[fut=GridDhtPartitionsExchangeFuture [dummy=false, forcePreload=false,
reassign=false, discoEvt=DiscoveryEvent [evtNode=TcpDiscoveryNode
[id=cad18430-c808-4c97-a6b1-ecff919b0108, addrs=[10.194.70.63, 127.0.0.1],
sockAddrs=[xx.cm10/yy:47501, /yy:47501, /127.0.0.1:47501], discPort=47501,
order=1186, intOrder=639, lastExchangeTime=1472826184572, loc=true,
ver=1.6.0#20160518-sha1:0b22c45b, isClient=false], topVer=1186,
nodeId8=cad18430, msg=null, type=NODE_JOINED, tstamp=1472826124647],
crd=TcpDiscoveryNode [id=e13a3446-b45a-4f31-9740-f8a75ac3fd78,
addrs=[10.194.70.60, 127.0.0.1], sockAddrs=[xx/bb:47500, /bb:47500,
/127.0.0.1:47500], discPort=47500, order=876, intOrder=459,
lastExchangeTime=1472826122976, loc=false, ver=1.6.0#20160518-sha1:0b22c45b,
isClient=false], exchId=GridDhtPartitionExchangeId
[topVer=AffinityTopologyVersion [topVer=1186, minorTopVer=0],
nodeId=cad18430, evt=NODE_JOINED], added=false, initFut=GridFutureAdapter
[resFlag=0, res=null, startTime=1472826125231, endTime=0,
ignoreInterrupts=false, state=INIT], init=false, topSnapshot=null,
lastVer=null, partReleaseFut=null, affChangeMsg=null, skipPreload=false,
clientOnlyExchange=false, initTs=1472826125231, centralizedAff=false,
evtLatch=0, remaining=[c8c7f8c4-3db4-4215-ba1c-dc45502d6327,
4aaca392-4985-4f4d-8ea3-66281584be36, 0d45635b-fc3e-4650-8f46-6336caae9b89,
1e7ebe14-ba8a-4711-93d3-11cf2c931d50, 5d0949e6-730d-4eef-90af-1fe1ada4fef2,
5698e9ab-ae20-4155-bcf7-f60945c7b688, 30a42087-2cde-4937-a3f1-4924e4a947a3,
588cf5af-5f39-405c-bdf5-adc9030375d2, 44c03c97-fb32-4fc0-beb5-4551c2857183,
0847fce3-56ef-48eb-91ae-c07ac45c8340, 9e11121e-e928-4986-971a-f5f82fc44ab8,
3b2c76ff-d384-443f-9575-59048d8631cf, e3298869-f1e3-4d88-93c5-c22b0e3f05c7,
c97a064a-a4d2-4090-80cb-bf34f7aaf6a3, 44b9a00d-5f01-46a2-9709-7be87128d32b,
465bd409-8ea9-4d23-9ad6-8dfceef6f852, 1d4250aa-d289-4ebe-903d-08aafa5e10b1,
093185dd-9934-4fe5-b60e-13871c35259a, c19e588c-6a94-4276-8238-9b45aa6cfdee,
cec1a06a-c944-487a-8ac6-90492e3481cc, 43f4b602-f0d0-4453-8c56-2c480b4dea98,
44aaf175-f371-4dee-9e06-e484fe0e47ed, 4678a454-c8c1-4485-8780-bec84528ca62,
2a3a6591-2f2b-4f7a-b02d-c31b16b8aa6d, 90c0446c-ad67-4cd7-badb-44d944b068ec,
d3ac0e36-e631-4c19-8a0c-65bd933dd985, 0f13528a-509e-472c-8a05-65a384cbff52,
a302ee03-111a-499b-b090-1d2bf9f2c6cc, f7ecaea0-2ec4-4260-9603-8c0789a92a94,
9a8a4150-49ef-4b5b-b6a5-6f4931adaddc, ae7cfab9-34c8-4a24-8105-1b9820e7614f,
9bb3a6cd-900f-496e-b0ab-df08d2b4b0de, c1e4ae9b-2c2b-4aff-8503-733ad127d63c,
c36eb7dc-db09-4238-afe6-3cbef603feaf, 74e3ab66-dffd-4717-aaca-ddb731dc77b0,
1c0ac9c2-cbc5-4091-9706-dd07259c1cf0, f8ac6957-9232-4190-9bdf-9c54bbc4cacb,
ae4070b0-cd16-4a30-aefa-4b3b6286fa10, c29bfc50-1e1b-4d4a-bd59-b35ce54ca00d,
52886c04-9f19-4a7f-a512-a91835c11afb, 4ba28d00-4415-4f17-ac7a-6c8145e444a6,
32d460a8-c47d-4e30-888a-f9ca7c5d0fa4, 2c08987e-e22c-4243-8950-2c6482f36ad1,
7ceb7755-4597-4ee2-a29b-49b114c6137d, 78b37939-8917-42a2-8b6b-a0e10e0ebd15,
7f96eb82-1c17-45d1-a9d7-5a592ecb069e, a95afa33-8060-4872-9d69-97022cd83c70,
f38a66ac-ed39-4f1a-bc68-528a6283c377, e6f43a4b-12be-4b8c-a108-baa385063729,
84bd0196-fdbb-4c10-86ed-9d4412057835, f627c4bc-fecc-4293-9218-f5ca13e5c841,
b75a22fa-bead-49ae-8a0d-38aa06ccc683, 201cea0f-8b07-4b22-9b92-ff338c3de827,
787117ee-544b-4519-b35a-a7264c07bc1e, 3998dc72-b2fc-4338-b7a8-94fbd704c1b3,
8e11d28a-30a1-40c4-a34e-5ea8128ae099, 761270ac-5b5e-45fe-af43-8d4252c396b3,
01899710-287f-4c15-b567-58281c69857a, 1085f698-3c03-42d1-96af-816c1eb9eb4e,
76e0979b-40d5-4bf8-a8fd-f623c8a0e200, e77e58e5-6f30-41a0-b5a5-5b59405353cc,
ce5b8430-8061-406a-ae29-ced20d5a21f4, 5492c340-1441-429a-9a01-9741a84c5e2d,
3980e286-125b-4370-a785-0bf229ab223a, 1b63cbca-b944-4511-b5f8-f7205f8e8377,
f7c8ee22-f18a-4d5e-a500-40754786cab9, f2deea1b-282d-4a1f-9edc-b07bdd6f478e,
cd736923-0978-4ecb-9a1a-ebf87d760bd6, 73d3bfd4-0eb2-40d6-8f47-a65b4adc5cbc,
28498f9a-983c-4683-b588-dc37e15ef84c, 3b436182-790f-48bf-bb07-0dd3f3ef013e,
d96425ef-15e8-437f-8402-8273f0b49c7b, f492bab3-986c-4d51-aa00-99f71ff87660,
454c3ac1-1946-4289-921b-33868d278ac3, b0521958-7fbc-464d-a135-601fd78a0cdf,
9a691f5e-cb6d-46a0-b99d-ea0aee88c2c4, 0383e4bb-c560-4e54-a5c0-7a8042a9c983,
42566a31-ee1d-4abf-ad42-78ba9f1bc16c, e300013c-df6e-4bba-bc9a-b83441cb6b59,
6b189281-fb50-4a7a-b007-1365a535dd8a, 44a8749e-aa3e-42de-8d77-1fb7beb746a5,
e13a3446-b45a-4f31-9740-f8a75ac3fd78, 538cb865-05a5-477b-97fb-6ad4bc72149b,
72228a4b-6059-4f9c-b9c8-2319e2461708, bf51c304-6b3d-4f25-bdfe-98b5f0bf5176,
52fade64-efc7-4b8d-bda2-017b3aa53f42, 0b16dec0-ec90-46f8-ac71-9c5bb833e5e1,
b31e02df-91d3-4308-adf0-1a25e3b82642, 807d4d47-87b4-43ac-9bbf-e31745cbd5f2,
c72a35e4-021c-483c-b8ea-81e8099eabce], srvNodes=[TcpDiscoveryNode
[id=e13a3446-b45a-4f31-9740-f8a75ac3fd78, addrs=[10.194.70.60, 127.0.0.1],
sockAddrs=[zzz/yyy:47500, /xxx:47500, /127.0.0.1:47500], discPort=47500,
order=876, intOrder=459, lastExchangeTime=1472826122976, loc=false,
ver=1.6.0#20160518-sha1:0b22c45b, isClient=false], TcpDiscoveryNode
[id=e77e58e5-6f30-41a0-b5a5-5b59405353cc, addrs=[10.194.78.21, 127.0.0.1],
sockAddrs=[yyyy.cm10/mm:47500, /10.194.78.21:47500, /127.0.0.1:47500],
discPort=47500, order=877, intOrder=460, lastExchangeTime=1472826122976,
loc=false, ver=1.6.0#20160518-sha1:0b22c45b, isClient=false],
TcpDiscoveryNode [id=6b189281-fb50-4a7a-b007-1365a535dd8a,
addrs=[10.194.62.38, 127.0.0.1], sockAddrs=[xxx.cm10/xxx:47500, /yy:47500,
/127.0.0.1:47500], discPort=47500, order=878, intOrder=461,
lastExchangeTime=1472826122986, loc=false, ver=1.6.0#20160518-sha1:0b22c45b,
isClient=false], TcpDiscoveryNode [id=3998dc72-b2fc-4338-b7a8-94fbd704c1b3,
addrs=[10.194.54.71, 127.0.0.1], 

Can anyone help me on this? Thanks!!!



--
View this message in context: http://apache-ignite-users.70518.x6.nabble.com/Spark-stage-was-hang-via-ignite-tp7485.html
Sent from the Apache Ignite Users mailing list archive at Nabble.com.

Re: Spark stage was hang via ignite

Posted by vkulichenko <va...@gmail.com>.
Hi,

I can't tell anything without the full logs. Please upload them somewhere if
you still need our help.

-Val



--
View this message in context: http://apache-ignite-users.70518.x6.nabble.com/Spark-stage-was-hang-via-ignite-tp7485p7563.html
Sent from the Apache Ignite Users mailing list archive at Nabble.com.

Re: Spark stage was hang via ignite

Posted by percent620 <pe...@163.com>.
We have two executors that running very slowly...

<http://apache-ignite-users.70518.x6.nabble.com/file/n7525/ignite.png> 1





--
View this message in context: http://apache-ignite-users.70518.x6.nabble.com/Spark-stage-was-hang-via-ignite-tp7485p7525.html
Sent from the Apache Ignite Users mailing list archive at Nabble.com.

Re: Spark stage was hang via ignite

Posted by vkulichenko <va...@gmail.com>.
Can you attach full logs from all Ignite nodes? Please don't copy-paste them
in the message, use attach function or upload somewhere.

-Val



--
View this message in context: http://apache-ignite-users.70518.x6.nabble.com/Spark-stage-was-hang-via-ignite-tp7485p7507.html
Sent from the Apache Ignite Users mailing list archive at Nabble.com.