You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@phoenix.apache.org by Batyrshin Alexander <0x...@gmail.com> on 2018/09/20 17:19:37 UTC

Table dead lock: ERROR 1120 (XCL20): Writes to table blocked until index can be updated

 Hello,
Looks live we got dead lock with repeating "ERROR 1120 (XCL20)" exception. At this time all indexes is ACTIVE.
Can you help to make deeper diagnose?

java.sql.SQLException: ERROR 1120 (XCL20): Writes to table blocked until index can be updated. tableName=TBL_MARK
	at org.apache.phoenix.exception.SQLExceptionCode$Factory$1.newException(SQLExceptionCode.java:494)
	at org.apache.phoenix.exception.SQLExceptionInfo.buildException(SQLExceptionInfo.java:150)
	at org.apache.phoenix.execute.MutationState.validateAndGetServerTimestamp(MutationState.java:815)
	at org.apache.phoenix.execute.MutationState.validateAll(MutationState.java:789)
	at org.apache.phoenix.execute.MutationState.send(MutationState.java:981)
	at org.apache.phoenix.execute.MutationState.send(MutationState.java:1514)
	at org.apache.phoenix.execute.MutationState.commit(MutationState.java:1337)
	at org.apache.phoenix.jdbc.PhoenixConnection$3.call(PhoenixConnection.java:670)
	at org.apache.phoenix.jdbc.PhoenixConnection$3.call(PhoenixConnection.java:666)
	at org.apache.phoenix.call.CallRunner.run(CallRunner.java:53)
	at org.apache.phoenix.jdbc.PhoenixConnection.commit(PhoenixConnection.java:666)
	at x.persistence.phoenix.PhoenixDao.$anonfun$doUpsert$1(PhoenixDao.scala:103)
	at scala.util.Try$.apply(Try.scala:209)
	at x.persistence.phoenix.PhoenixDao.doUpsert(PhoenixDao.scala:101)
	at x.persistence.phoenix.PhoenixDao.$anonfun$batchInsert$2(PhoenixDao.scala:45)
	at x.persistence.phoenix.PhoenixDao.$anonfun$batchInsert$2$adapted(PhoenixDao.scala:45)
	at scala.collection.immutable.Stream.flatMap(Stream.scala:486)
	at scala.collection.immutable.Stream.$anonfun$flatMap$1(Stream.scala:494)
	at scala.collection.immutable.Stream.$anonfun$append$1(Stream.scala:252)
	at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1169)
	at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1159)
	at scala.collection.immutable.Stream.length(Stream.scala:309)
	at scala.collection.SeqLike.size(SeqLike.scala:105)
	at scala.collection.SeqLike.size$(SeqLike.scala:105)
	at scala.collection.AbstractSeq.size(Seq.scala:41)
	at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:285)
	at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:283)
	at scala.collection.AbstractTraversable.toArray(Traversable.scala:104)
	at x.persistence.phoenix.PhoenixDao.$anonfun$batchInsert$1(PhoenixDao.scala:45)
	at scala.util.Try$.apply(Try.scala:209)
	at x.persistence.phoenix.PhoenixDao.batchInsert(PhoenixDao.scala:45)
	at x.persistence.phoenix.PhoenixDao.$anonfun$insert$2(PhoenixDao.scala:35)
	at scala.concurrent.Future$.$anonfun$apply$1(Future.scala:655)
	at scala.util.Success.$anonfun$map$1(Try.scala:251)
	at scala.util.Success.map(Try.scala:209)
	at scala.concurrent.Future.$anonfun$map$1(Future.scala:289)
	at scala.concurrent.impl.Promise.liftedTree1$1(Promise.scala:29)
	at scala.concurrent.impl.Promise.$anonfun$transform$1(Promise.scala:29)
	at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:60)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

Re: Table dead lock: ERROR 1120 (XCL20): Writes to table blocked until index can be updated

Posted by Batyrshin Alexander <0x...@gmail.com>.

Yep, we tried restarting Master + all Region Servers.
hbase.hregion.max.filesize is just 10GB. Our CPU/Disk busy at 5% or even lesser


Just got another split:

Oct 01 17:24:09 prod001 hbase[18135]: 2018-10-01 17:24:09,964 INFO  [prod001,60000,1538342034697_ChoreService_6] balancer.StochasticLoadBalancer: Skipping load balancing because balanced cluster; total cost is 18.168898456852713, sum multiplier is 1102.0 min cost which need balance is 0.0
Oct 01 17:28:00 prod001 hbase[18135]: 2018-10-01 17:28:00,362 INFO  [Idle-Rpc-Conn-Sweeper-pool2-t1] ipc.AbstractRpcClient: Cleanup idle connection to prod017/10.0.0.17:60020
Oct 01 17:28:08 prod001 hbase[18135]: 2018-10-01 17:28:08,252 INFO  [AM.ZK.Worker-pool5-t175] master.RegionStates: Transition null to {3bf96cfad6bb33b6f8a0db6a242c2577 state=SPLITTING_NEW, ts=1538404088252, server=prod014,60020,1538344014594}
Oct 01 17:28:08 prod001 hbase[18135]: 2018-10-01 17:28:08,252 INFO  [AM.ZK.Worker-pool5-t175] master.RegionStates: Transition null to {8669f928e67ee67e567400e7c011727f state=SPLITTING_NEW, ts=1538404088252, server=prod014,60020,1538344014594}
Oct 01 17:28:08 prod001 hbase[18135]: 2018-10-01 17:28:08,252 INFO  [AM.ZK.Worker-pool5-t175] master.RegionStates: Transition {8277aec30576531134e36bbde39e372d state=OPEN, ts=1538344020332, server=prod014,60020,1538344014594} to {8277aec30576531134e36bbde39e372d state=SPLITTING, ts=153840
Oct 01 17:28:08 prod001 hbase[18135]: 2018-10-01 17:28:08,677 INFO  [AM.ZK.Worker-pool5-t177] master.RegionStates: Transition {8277aec30576531134e36bbde39e372d state=SPLITTING, ts=1538404088677, server=prod014,60020,1538344014594} to {8277aec30576531134e36bbde39e372d state=SPLIT, ts=15384
Oct 01 17:28:08 prod001 hbase[18135]: 2018-10-01 17:28:08,677 INFO  [AM.ZK.Worker-pool5-t177] master.RegionStates: Offlined 8277aec30576531134e36bbde39e372d from prod014,60020,1538344014594
Oct 01 17:28:08 prod001 hbase[18135]: 2018-10-01 17:28:08,677 INFO  [AM.ZK.Worker-pool5-t177] master.RegionStates: Transition {3bf96cfad6bb33b6f8a0db6a242c2577 state=SPLITTING_NEW, ts=1538404088677, server=prod014,60020,1538344014594} to {3bf96cfad6bb33b6f8a0db6a242c2577 state=OPEN, ts=15
Oct 01 17:28:08 prod001 hbase[18135]: 2018-10-01 17:28:08,677 INFO  [AM.ZK.Worker-pool5-t177] master.RegionStates: Transition {8669f928e67ee67e567400e7c011727f state=SPLITTING_NEW, ts=1538404088677, server=prod014,60020,1538344014594} to {8669f928e67ee67e567400e7c011727f state=OPEN, ts=15
Oct 01 17:28:08 prod001 hbase[18135]: 2018-10-01 17:28:08,678 INFO  [AM.ZK.Worker-pool5-t177] master.AssignmentManager: Handled SPLIT event; parent=IDX_MARK_O,\x06\x0000000046200020r:N,8ef\x00\x00\x80\x00\x01d\xEBcIp\x00\x00\x00\x00,1537400027657.8277aec30576531134e36bbde39e372d.
Oct 01 17:29:10 prod001 hbase[18135]: 2018-10-01 17:29:10,039 INFO  [prod001,60000,1538342034697_ChoreService_7] balancer.StochasticLoadBalancer: Skipping load balancing because balanced cluster; total cost is 13.420884661878974, sum multiplier is 1102.0 min cost which need balance is 0.0
Oct 01 17:32:00 prod001 hbase[18135]: 2018-10-01 17:32:00,363 INFO  [Idle-Rpc-Conn-Sweeper-pool2-t1] ipc.AbstractRpcClient: Cleanup idle connection to prod017/10.0.0.17:60020
Oct 01 17:34:09 prod001 hbase[18135]: 2018-10-01 17:34:09,957 INFO  [prod001,60000,1538342034697_ChoreService_1] balancer.StochasticLoadBalancer: Skipping load balancing because balanced cluster; total cost is 13.623502266311172, sum multiplier is 1102.0 min cost which need balance is 0.0
Oct 01 17:34:10 prod001 hbase[18135]: 2018-10-01 17:34:10,059 INFO  [prod001,60000,1538342034697_ChoreService_2] hbase.MetaTableAccessor: Deleted IDX_MARK_O,\x06\x0000000046200020r:N,8ef\x00\x00\x80\x00\x01d\xEBcIp\x00\x00\x00\x00,1537400027657.8277aec30576531134e36bbde39e372d.
Oct 01 17:34:10 prod001 hbase[18135]: 2018-10-01 17:34:10,060 INFO  [prod001,60000,1538342034697_ChoreService_2] master.CatalogJanitor: Scanned 791 catalog row(s), gc'd 0 unreferenced merged region(s) and 1 unreferenced parent region(s)


Master start time:
HMaster Start Time	Mon Oct 01 00:13:54 MSK 2018

RServer start time:
RS Start Time	Mon Oct 01 00:46:54 MSK 2018

Per table config:
hbase(main):128:0> describe 'IDX_MARK_O'
Table IDX_MARK_O is ENABLED
IDX_MARK_O, {TABLE_ATTRIBUTES => {PRIORITY => '1000', coprocessor$1 => '|org.apache.phoenix.coprocessor.ScanRegionObserver|805306366|', coprocessor$2 => '|org.apache.phoenix.coprocessor.UngroupedAggregateRegionObserver|805306366|', coprocessor$3 => '|org.apache.phoenix.coprocessor.GroupedAggregateRegionObserver|805306366|', coprocessor$4 => '|org.apache.phoenix.coprocessor.ServerCachingEndpointImpl|805306366|', METADATA => {'DATA_TABLE_NAME' => 'MARK', 'SPLIT_POLICY' => 'org.apache.hadoop.hbase.regionserver.ConstantSizeRegionSplitPolicy'}}
COLUMN FAMILIES DESCRIPTION
{NAME => 'd', BLOOMFILTER => 'NONE', VERSIONS => '1', IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'FAST_DIFF', TTL => 'FOREVER', COMPRESSION => 'NONE', MIN_VERSIONS => '0', BLOCKCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'}
1 row(s) in 0.3340 seconds

Per HBase config:
<property>
<name>hbase.regionserver.region.split.policy</name>
<value>
org.apache.hadoop.hbase.regionserver.ConstantSizeRegionSplitPolicy
</value>
<source>hbase-site.xml</source>
</property>
<property>
<name>hbase.hregion.max.filesize</name>
<value>10737418240</value>
<source>hbase-default.xml</source>
</property>



> On 30 Sep 2018, at 06:09, Jaanai Zhang <cl...@gmail.com> wrote:
> 
> Did you restart the cluster and you should set 'hbase.hregion.max.filesize' to a safeguard value which less than RS's capabilities.
> 
> ----------------------------------------
>    Jaanai Zhang
>    Best regards!
> 
> 
> 
> Batyrshin Alexander <0x62ash@gmail.com <ma...@gmail.com>> 于2018年9月29日周六 下午5:28写道：
> Meanwhile we tried to disable regions split via per index table options
> 'SPLIT_POLICY' => 'org.apache.hadoop.hbase.regionserver.ConstantSizeRegionSplitPolicy'  and hbase.hregion.max.filesize = 10737418240
> Looks like this options set doesn't. Some regions splits at size < 2GB
> 
> Then we tried to disable all splits via hbase shell: splitormerge_switch 'SPLIT', false
> Seems that this also doesn't work.
> 
> Any ideas why we can't disable regions split?
> 
>> On 27 Sep 2018, at 02:52, Vincent Poon <vincent.poon.us@gmail.com <ma...@gmail.com>> wrote:
>> 
>> We are planning a Phoenix 4.14.1 release which will have this fix
>> 
>> On Wed, Sep 26, 2018 at 3:36 PM Batyrshin Alexander <0x62ash@gmail.com <ma...@gmail.com>> wrote:
>> Thank you. We will try somehow...
>> Is there any chance that this fix will be included in next release for HBASE-1.4 (not 2.0)?
>> 
>>> On 27 Sep 2018, at 01:04, Ankit Singhal <ankitsinghal59@gmail.com <ma...@gmail.com>> wrote:
>>> 
>>> You might be hitting PHOENIX-4785 <https://jira.apache.org/jira/browse/PHOENIX-4785>,  you can apply the patch on top of 4.14 and see if it fixes your problem.
>>> 
>>> Regards,
>>> Ankit Singhal
>>> 
>>> On Wed, Sep 26, 2018 at 2:33 PM Batyrshin Alexander <0x62ash@gmail.com <ma...@gmail.com>> wrote:
>>> Any advices? Helps?
>>> I can reproduce problem and capture more logs if needed.
>>> 
>>>> On 21 Sep 2018, at 02:13, Batyrshin Alexander <0x62ash@gmail.com <ma...@gmail.com>> wrote:
>>>> 
>>>> Looks like lock goes away 30 minutes after index region split.
>>>> So i can assume that this issue comes from cache that configured by this option: phoenix.coprocessor.maxMetaDataCacheTimeToLiveMs
>>>> 
>>>> 
>>>> 
>>>>> On 21 Sep 2018, at 00:15, Batyrshin Alexander <0x62ash@gmail.com <ma...@gmail.com>> wrote:
>>>>> 
>>>>> And how this split looks at Master logs:
>>>>> 
>>>>> Sep 20 19:45:04 prod001 hbase[10838]: 2018-09-20 19:45:04,888 INFO  [AM.ZK.Worker-pool5-t282] master.RegionStates: Transition {3e44b85ddf407da831dbb9a871496986 state=OPEN, ts=1537304859509, server=prod013,60020,1537304282885} to {3e44b85ddf407da831dbb9a871496986 state=SPLITTING, ts=1537461904888, server=prod
>>>>> Sep 20 19:45:05 prod001 hbase[10838]: 2018-09-20 19:45:05,340 INFO  [AM.ZK.Worker-pool5-t284] master.RegionStates: Transition {3e44b85ddf407da831dbb9a871496986 state=SPLITTING, ts=1537461905340, server=prod013,60020,1537304282885} to {3e44b85ddf407da831dbb9a871496986 state=SPLIT, ts=1537461905340, server=pro
>>>>> Sep 20 19:45:05 prod001 hbase[10838]: 2018-09-20 19:45:05,340 INFO  [AM.ZK.Worker-pool5-t284] master.RegionStates: Offlined 3e44b85ddf407da831dbb9a871496986 from prod013,60020,1537304282885
>>>>> Sep 20 19:45:05 prod001 hbase[10838]: 2018-09-20 19:45:05,341 INFO  [AM.ZK.Worker-pool5-t284] master.RegionStates: Transition {33cba925c7acb347ac3f5e70e839c3cb state=SPLITTING_NEW, ts=1537461905340, server=prod013,60020,1537304282885} to {33cba925c7acb347ac3f5e70e839c3cb state=OPEN, ts=1537461905341, server=
>>>>> Sep 20 19:45:05 prod001 hbase[10838]: 2018-09-20 19:45:05,341 INFO  [AM.ZK.Worker-pool5-t284] master.RegionStates: Transition {acb8f16a004a894c8706f6e12cd26144 state=SPLITTING_NEW, ts=1537461905340, server=prod013,60020,1537304282885} to {acb8f16a004a894c8706f6e12cd26144 state=OPEN, ts=1537461905341, server=
>>>>> Sep 20 19:45:05 prod001 hbase[10838]: 2018-09-20 19:45:05,343 INFO  [AM.ZK.Worker-pool5-t284] master.AssignmentManager: Handled SPLIT event; parent=IDX_MARK_O,\x107834005168\x0000000046200020LWfBS4c,1536637905252.3e44b85ddf407da831dbb9a871496986., daughter a=IDX_MARK_O,\x107834005168\x0000000046200020LWfBS4c,1
>>>>> Sep 20 19:47:41 prod001 hbase[10838]: 2018-09-20 19:47:41,972 INFO  [prod001,60000,1537304851459_ChoreService_2] balancer.StochasticLoadBalancer: Skipping load balancing because balanced cluster; total cost is 17.82282205608522, sum multiplier is 1102.0 min cost which need balance is 0.05
>>>>> Sep 20 19:47:42 prod001 hbase[10838]: 2018-09-20 19:47:42,021 INFO  [prod001,60000,1537304851459_ChoreService_1] hbase.MetaTableAccessor: Deleted IDX_MARK_O,\x107834005168\x0000000046200020LWfBS4c,1536637905252.3e44b85ddf407da831dbb9a871496986.
>>>>> Sep 20 19:47:42 prod001 hbase[10838]: 2018-09-20 19:47:42,022 INFO  [prod001,60000,1537304851459_ChoreService_1] master.CatalogJanitor: Scanned 779 catalog row(s), gc'd 0 unreferenced merged region(s) and 1 unreferenced parent region(s)
>>>>> 
>>>>>> On 20 Sep 2018, at 21:43, Batyrshin Alexander <0x62ash@gmail.com <ma...@gmail.com>> wrote:
>>>>>> 
>>>>>> Looks like problem was because of index region split
>>>>>> 
>>>>>> Index region split at prod013:
>>>>>> Sep 20 19:45:05 prod013 hbase[193055]: 2018-09-20 19:45:05,441 INFO  [regionserver/prod013/10.0.0.13:60020-splits-1537400010677] regionserver.SplitRequest: Region split, hbase:meta updated, and report to master. Parent=IDX_MARK_O,\x107834005168\x0000000046200020LWfBS4c,1536637905252.3e44b85ddf407da831dbb9a871496986., new regions: IDX_MARK_O,\x107834005168\x0000000046200020LWfBS4c,1537461904877.33cba925c7acb347ac3f5e70e839c3cb., IDX_MARK_O,\x107834005168\x0000000046200068=4YF!YI,1537461904877.acb8f16a004a894c8706f6e12cd26144.. Split took 0sec
>>>>>> Sep 20 19:45:05 prod013 hbase[193055]: 2018-09-20 19:45:05,441 INFO  [regionserver/prod013/10.0.0.13:60020-splits-1537400010677] regionserver.SplitRequest: Split transaction journal:
>>>>>> Sep 20 19:45:05 prod013 hbase[193055]:         STARTED at 1537461904853
>>>>>> Sep 20 19:45:05 prod013 hbase[193055]:         PREPARED at 1537461904877
>>>>>> Sep 20 19:45:05 prod013 hbase[193055]:         BEFORE_PRE_SPLIT_HOOK at 1537461904877
>>>>>> Sep 20 19:45:05 prod013 hbase[193055]:         AFTER_PRE_SPLIT_HOOK at 1537461904877
>>>>>> Sep 20 19:45:05 prod013 hbase[193055]:         SET_SPLITTING at 1537461904880
>>>>>> Sep 20 19:45:05 prod013 hbase[193055]:         CREATE_SPLIT_DIR at 1537461904987
>>>>>> Sep 20 19:45:05 prod013 hbase[193055]:         CLOSED_PARENT_REGION at 1537461905002
>>>>>> Sep 20 19:45:05 prod013 hbase[193055]:         OFFLINED_PARENT at 1537461905002
>>>>>> Sep 20 19:45:05 prod013 hbase[193055]:         STARTED_REGION_A_CREATION at 1537461905056
>>>>>> Sep 20 19:45:05 prod013 hbase[193055]:         STARTED_REGION_B_CREATION at 1537461905131
>>>>>> Sep 20 19:45:05 prod013 hbase[193055]:         PONR at 1537461905192
>>>>>> Sep 20 19:45:05 prod013 hbase[193055]:         OPENED_REGION_A at 1537461905249
>>>>>> Sep 20 19:45:05 prod013 hbase[193055]:         OPENED_REGION_B at 1537461905252
>>>>>> Sep 20 19:45:05 prod013 hbase[193055]:         BEFORE_POST_SPLIT_HOOK at 1537461905439
>>>>>> Sep 20 19:45:05 prod013 hbase[193055]:         AFTER_POST_SPLIT_HOOK at 1537461905439
>>>>>> Sep 20 19:45:05 prod013 hbase[193055]:         COMPLETED at 1537461905439
>>>>>> 
>>>>>> Index update failed at prod002:
>>>>>> Sep 20 20:09:24 prod002 hbase[97285]: 2018-09-20 20:09:24,520 WARN  [hconnection-0x4f3242a0-shared--pool32-t36014] client.AsyncProcess: #220, table=IDX_MARK_O, attempt=1/1 failed=1ops, last exception: org.apache.hadoop.hbase.NotServingRegionException: org.apache.hadoop.hbase.NotServingRegionException: Re
>>>>>> gion IDX_MARK_O,\x107834005168\x0000000046200020LWfBS4c,1536637905252.3e44b85ddf407da831dbb9a871496986. is not online on prod013,60020,1537304282885
>>>>>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3081)
>>>>>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1271)
>>>>>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.regionserver.RSRpcServices.multi(RSRpcServices.java:2365)
>>>>>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:36621)
>>>>>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2359)
>>>>>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:124)
>>>>>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:297)
>>>>>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:277)
>>>>>> Sep 20 20:09:24 prod002 hbase[97285]:  on prod013,60020,1537304282885, tracking started Thu Sep 20 20:09:24 MSK 2018; not retrying 1 - final failure
>>>>>> Sep 20 20:09:24 prod002 hbase[97285]: 2018-09-20 20:09:24,549 INFO  [RpcServer.default.FPBQ.Fifo.handler=98,queue=8,port=60020] zookeeper.RecoverableZooKeeper: Process identifier=hconnection-0x39beae45 connecting to ZooKeeper ensemble=10.0.0.1:2181 <http://10.0.0.1:2181/>,10.0.0.2:2181 <http://10.0.0.2:2181/>,10.0.0.3:2181 <http://10.0.0.3:2181/>
>>>>>> Sep 20 20:09:24 prod002 hbase[97285]: 2018-09-20 20:09:24,549 INFO  [RpcServer.default.FPBQ.Fifo.handler=98,queue=8,port=60020] zookeeper.ZooKeeper: Initiating client connection, connectString=10.0.0.1:2181 <http://10.0.0.1:2181/>,10.0.0.2:2181 <http://10.0.0.2:2181/>,10.0.0.3:2181 <http://10.0.0.3:2181/> sessionTimeout=90000 watcher=org.apache.hadoop.hbase.zookeeper.PendingWatcher@3ef61f7
>>>>>> Sep 20 20:09:24 prod002 hbase[97285]: 2018-09-20 20:09:24,562 INFO  [RpcServer.default.FPBQ.Fifo.handler=98,queue=8,port=60020-SendThread(10.0.0.3:2181 <http://10.0.0.3:2181/>)] zookeeper.ClientCnxn: Opening socket connection to server 10.0.0.3/10.0.0.3:2181 <http://10.0.0.3/10.0.0.3:2181>. Will not attempt to authenticate using SASL (unknown error)
>>>>>> Sep 20 20:09:24 prod002 hbase[97285]: 2018-09-20 20:09:24,570 INFO  [RpcServer.default.FPBQ.Fifo.handler=98,queue=8,port=60020-SendThread(10.0.0.3:2181 <http://10.0.0.3:2181/>)] zookeeper.ClientCnxn: Socket connection established to 10.0.0.3/10.0.0.3:2181 <http://10.0.0.3/10.0.0.3:2181>, initiating session
>>>>>> Sep 20 20:09:24 prod002 hbase[97285]: 2018-09-20 20:09:24,572 INFO  [RpcServer.default.FPBQ.Fifo.handler=98,queue=8,port=60020-SendThread(10.0.0.3:2181 <http://10.0.0.3:2181/>)] zookeeper.ClientCnxn: Session establishment complete on server 10.0.0.3/10.0.0.3:2181 <http://10.0.0.3/10.0.0.3:2181>, sessionid = 0x30000e039e01c7f, negotiated timeout = 40000
>>>>>> Sep 20 20:09:24 prod002 hbase[97285]: 2018-09-20 20:09:24,628 INFO  [RpcServer.default.FPBQ.Fifo.handler=98,queue=8,port=60020] index.PhoenixIndexFailurePolicy: Successfully update INDEX_DISABLE_TIMESTAMP for IDX_MARK_O due to an exception while writing updates. indexState=PENDING_DISABLE
>>>>>> Sep 20 20:09:24 prod002 hbase[97285]: org.apache.phoenix.hbase.index.exception.MultiIndexWriteFailureException:  disableIndexOnFailure=true, Failed to write to multiple index tables: [IDX_MARK_O]
>>>>>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.phoenix.hbase.index.write.TrackingParallelWriterIndexCommitter.write(TrackingParallelWriterIndexCommitter.java:235)
>>>>>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.phoenix.hbase.index.write.IndexWriter.write(IndexWriter.java:195)
>>>>>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.phoenix.hbase.index.write.IndexWriter.writeAndKillYourselfOnFailure(IndexWriter.java:156)
>>>>>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.phoenix.hbase.index.write.IndexWriter.writeAndKillYourselfOnFailure(IndexWriter.java:145)
>>>>>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.phoenix.hbase.index.Indexer.doPostWithExceptions(Indexer.java:620)
>>>>>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.phoenix.hbase.index.Indexer.doPost(Indexer.java:595)
>>>>>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.phoenix.hbase.index.Indexer.postBatchMutateIndispensably(Indexer.java:578)
>>>>>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost$37.call(RegionCoprocessorHost.java:1048)
>>>>>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost$RegionOperation.call(RegionCoprocessorHost.java:1711)
>>>>>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost.execOperation(RegionCoprocessorHost.java:1789)
>>>>>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost.execOperation(RegionCoprocessorHost.java:1745)
>>>>>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost.postBatchMutateIndispensably(RegionCoprocessorHost.java:1044)
>>>>>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.regionserver.HRegion.doMiniBatchMutation(HRegion.java:3646)
>>>>>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(HRegion.java:3108)
>>>>>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(HRegion.java:3050)
>>>>>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.regionserver.RSRpcServices.doBatchOp(RSRpcServices.java:916)
>>>>>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.regionserver.RSRpcServices.doNonAtomicRegionMutation(RSRpcServices.java:844)
>>>>>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.regionserver.RSRpcServices.multi(RSRpcServices.java:2405)
>>>>>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:36621)
>>>>>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2359)
>>>>>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:124)
>>>>>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:297)
>>>>>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:277)
>>>>>> Sep 20 20:09:24 prod002 hbase[97285]: 2018-09-20 20:09:24,632 INFO  [RpcServer.default.FPBQ.Fifo.handler=98,queue=8,port=60020] util.IndexManagementUtil: Rethrowing org.apache.hadoop.hbase.DoNotRetryIOException: ERROR 1121 (XCL21): Write to the index failed.  disableIndexOnFailure=true, Failed to write to multiple index tables: [IDX_MARK_O] ,serverTimestamp=1537463364504,
>>>>>> 
>>>>>> 
>>>>>>> On 20 Sep 2018, at 21:01, Batyrshin Alexander <0x62ash@gmail.com <ma...@gmail.com>> wrote:
>>>>>>> 
>>>>>>> Our setup:
>>>>>>> HBase-1.4.7
>>>>>>> Phoenix-4.14-hbase-1.4
>>>>>>> 
>>>>>>> 
>>>>>>>> On 20 Sep 2018, at 20:19, Batyrshin Alexander <0x62ash@gmail.com <ma...@gmail.com>> wrote:
>>>>>>>> 
>>>>>>>>  Hello,
>>>>>>>> Looks live we got dead lock with repeating "ERROR 1120 (XCL20)" exception. At this time all indexes is ACTIVE.
>>>>>>>> Can you help to make deeper diagnose?
>>>>>>>> 
>>>>>>>> java.sql.SQLException: ERROR 1120 (XCL20): Writes to table blocked until index can be updated. tableName=TBL_MARK
>>>>>>>> 	at org.apache.phoenix.exception.SQLExceptionCode$Factory$1.newException(SQLExceptionCode.java:494)
>>>>>>>> 	at org.apache.phoenix.exception.SQLExceptionInfo.buildException(SQLExceptionInfo.java:150)
>>>>>>>> 	at org.apache.phoenix.execute.MutationState.validateAndGetServerTimestamp(MutationState.java:815)
>>>>>>>> 	at org.apache.phoenix.execute.MutationState.validateAll(MutationState.java:789)
>>>>>>>> 	at org.apache.phoenix.execute.MutationState.send(MutationState.java:981)
>>>>>>>> 	at org.apache.phoenix.execute.MutationState.send(MutationState.java:1514)
>>>>>>>> 	at org.apache.phoenix.execute.MutationState.commit(MutationState.java:1337)
>>>>>>>> 	at org.apache.phoenix.jdbc.PhoenixConnection$3.call(PhoenixConnection.java:670)
>>>>>>>> 	at org.apache.phoenix.jdbc.PhoenixConnection$3.call(PhoenixConnection.java:666)
>>>>>>>> 	at org.apache.phoenix.call.CallRunner.run(CallRunner.java:53)
>>>>>>>> 	at org.apache.phoenix.jdbc.PhoenixConnection.commit(PhoenixConnection.java:666)
>>>>>>>> 	at x.persistence.phoenix.PhoenixDao.$anonfun$doUpsert$1(PhoenixDao.scala:103)
>>>>>>>> 	at scala.util.Try$.apply(Try.scala:209)
>>>>>>>> 	at x.persistence.phoenix.PhoenixDao.doUpsert(PhoenixDao.scala:101)
>>>>>>>> 	at x.persistence.phoenix.PhoenixDao.$anonfun$batchInsert$2(PhoenixDao.scala:45)
>>>>>>>> 	at x.persistence.phoenix.PhoenixDao.$anonfun$batchInsert$2$adapted(PhoenixDao.scala:45)
>>>>>>>> 	at scala.collection.immutable.Stream.flatMap(Stream.scala:486)
>>>>>>>> 	at scala.collection.immutable.Stream.$anonfun$flatMap$1(Stream.scala:494)
>>>>>>>> 	at scala.collection.immutable.Stream.$anonfun$append$1(Stream.scala:252)
>>>>>>>> 	at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1169)
>>>>>>>> 	at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1159)
>>>>>>>> 	at scala.collection.immutable.Stream.length(Stream.scala:309)
>>>>>>>> 	at scala.collection.SeqLike.size(SeqLike.scala:105)
>>>>>>>> 	at scala.collection.SeqLike.size$(SeqLike.scala:105)
>>>>>>>> 	at scala.collection.AbstractSeq.size(Seq.scala:41)
>>>>>>>> 	at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:285)
>>>>>>>> 	at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:283)
>>>>>>>> 	at scala.collection.AbstractTraversable.toArray(Traversable.scala:104)
>>>>>>>> 	at x.persistence.phoenix.PhoenixDao.$anonfun$batchInsert$1(PhoenixDao.scala:45)
>>>>>>>> 	at scala.util.Try$.apply(Try.scala:209)
>>>>>>>> 	at x.persistence.phoenix.PhoenixDao.batchInsert(PhoenixDao.scala:45)
>>>>>>>> 	at x.persistence.phoenix.PhoenixDao.$anonfun$insert$2(PhoenixDao.scala:35)
>>>>>>>> 	at scala.concurrent.Future$.$anonfun$apply$1(Future.scala:655)
>>>>>>>> 	at scala.util.Success.$anonfun$map$1(Try.scala:251)
>>>>>>>> 	at scala.util.Success.map(Try.scala:209)
>>>>>>>> 	at scala.concurrent.Future.$anonfun$map$1(Future.scala:289)
>>>>>>>> 	at scala.concurrent.impl.Promise.liftedTree1$1(Promise.scala:29)
>>>>>>>> 	at scala.concurrent.impl.Promise.$anonfun$transform$1(Promise.scala:29)
>>>>>>>> 	at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:60)
>>>>>>>> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>>>>>>>> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>>>>>>>> 	at java.lang.Thread.run(Thread.java:748)
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> 
>

Re: Table dead lock: ERROR 1120 (XCL20): Writes to table blocked until index can be updated

Posted by Jaanai Zhang <cl...@gmail.com>.

Did you restart the cluster and you should set 'hbase.hregion.max.filesize'
to a safeguard value which less than RS's capabilities.

----------------------------------------
   Jaanai Zhang
   Best regards!



Batyrshin Alexander <0x...@gmail.com> 于2018年9月29日周六 下午5:28写道：

> Meanwhile we tried to disable regions split via per index table options
> 'SPLIT_POLICY' =>
> 'org.apache.hadoop.hbase.regionserver.ConstantSizeRegionSplitPolicy'  and hbase.hregion.max.filesize
> = 10737418240
> Looks like this options set doesn't. Some regions splits at size < 2GB
>
> Then we tried to disable all splits via hbase shell: splitormerge_switch
> 'SPLIT', false
> Seems that this also doesn't work.
>
> Any ideas why we can't disable regions split?
>
> On 27 Sep 2018, at 02:52, Vincent Poon <vi...@gmail.com> wrote:
>
> We are planning a Phoenix 4.14.1 release which will have this fix
>
> On Wed, Sep 26, 2018 at 3:36 PM Batyrshin Alexander <0x...@gmail.com>
> wrote:
>
>> Thank you. We will try somehow...
>> Is there any chance that this fix will be included in next release for
>> HBASE-1.4 (not 2.0)?
>>
>> On 27 Sep 2018, at 01:04, Ankit Singhal <an...@gmail.com> wrote:
>>
>> You might be hitting PHOENIX-4785
>> <https://jira.apache.org/jira/browse/PHOENIX-4785>,  you can apply the
>> patch on top of 4.14 and see if it fixes your problem.
>>
>> Regards,
>> Ankit Singhal
>>
>> On Wed, Sep 26, 2018 at 2:33 PM Batyrshin Alexander <0x...@gmail.com>
>> wrote:
>>
>>> Any advices? Helps?
>>> I can reproduce problem and capture more logs if needed.
>>>
>>> On 21 Sep 2018, at 02:13, Batyrshin Alexander <0x...@gmail.com> wrote:
>>>
>>> Looks like lock goes away 30 minutes after index region split.
>>> So i can assume that this issue comes from cache that configured by this
>>> option:* phoenix.coprocessor.maxMetaDataCacheTimeToLiveMs*
>>>
>>>
>>>
>>> On 21 Sep 2018, at 00:15, Batyrshin Alexander <0x...@gmail.com> wrote:
>>>
>>> And how this split looks at Master logs:
>>>
>>> Sep 20 19:45:04 prod001 hbase[10838]: 2018-09-20 19:45:04,888 INFO
>>>  [AM.ZK.Worker-pool5-t282] master.RegionStates: Transition
>>> {3e44b85ddf407da831dbb9a871496986 state=OPEN,
>>> ts=1537304859509, server=prod013,60020,1537304282885} to
>>> {3e44b85ddf407da831dbb9a871496986 state=SPLITTING, ts=1537461904888,
>>> server=prod
>>> Sep 20 19:45:05 prod001 hbase[10838]: 2018-09-20 19:45:05,340 INFO
>>>  [AM.ZK.Worker-pool5-t284] master.RegionStates: Transition
>>> {3e44b85ddf407da831dbb9a871496986 state=SPLITTING, ts=1537461905340,
>>> server=prod013,60020,1537304282885} to {3e44b85ddf407da831dbb9a871496986
>>> state=SPLIT, ts=1537461905340, server=pro
>>> Sep 20 19:45:05 prod001 hbase[10838]: 2018-09-20 19:45:05,340 INFO
>>>  [AM.ZK.Worker-pool5-t284] master.RegionStates: Offlined
>>> 3e44b85ddf407da831dbb9a871496986 from prod013,60020,1537304282885
>>> Sep 20 19:45:05 prod001 hbase[10838]: 2018-09-20 19:45:05,341 INFO
>>>  [AM.ZK.Worker-pool5-t284] master.RegionStates: Transition
>>> {33cba925c7acb347ac3f5e70e839c3cb state=SPLITTING_NEW, ts=1537461905340,
>>> server=prod013,60020,1537304282885} to {33cba925c7acb347ac3f5e70e839c3cb
>>> state=OPEN, ts=1537461905341, server=
>>> Sep 20 19:45:05 prod001 hbase[10838]: 2018-09-20 19:45:05,341 INFO
>>>  [AM.ZK.Worker-pool5-t284] master.RegionStates: Transition
>>> {acb8f16a004a894c8706f6e12cd26144 state=SPLITTING_NEW, ts=1537461905340,
>>> server=prod013,60020,1537304282885} to {acb8f16a004a894c8706f6e12cd26144
>>> state=OPEN, ts=1537461905341, server=
>>> Sep 20 19:45:05 prod001 hbase[10838]: 2018-09-20 19:45:05,343 INFO
>>>  [AM.ZK.Worker-pool5-t284] master.AssignmentManager: Handled SPLIT
>>> event; parent=IDX_MARK_O,\x107834005168\x0000000046200020LWfBS4c,1536637905252.3e44b85ddf407da831dbb9a871496986.,
>>> daughter a=IDX_MARK_O,\x107834005168\x0000000046200020LWfBS4c,1
>>> Sep 20 19:47:41 prod001 hbase[10838]: 2018-09-20 19:47:41,972 INFO
>>>  [prod001,60000,1537304851459_ChoreService_2]
>>> balancer.StochasticLoadBalancer: Skipping load balancing because balanced
>>> cluster; total cost is 17.82282205608522, sum multiplier is 1102.0 min cost
>>> which need balance is 0.05
>>> Sep 20 19:47:42 prod001 hbase[10838]: 2018-09-20 19:47:42,021 INFO
>>>  [prod001,60000,1537304851459_ChoreService_1] hbase.MetaTableAccessor:
>>> Deleted IDX_MARK_O,\x107834005168\x0000000046200020LWfBS4c,1536637905252.3e44b85ddf407da831dbb9a871496986.
>>> Sep 20 19:47:42 prod001 hbase[10838]: 2018-09-20 19:47:42,022 INFO
>>>  [prod001,60000,1537304851459_ChoreService_1] master.CatalogJanitor:
>>> Scanned 779 catalog row(s), gc'd 0 unreferenced merged region(s) and 1
>>> unreferenced parent region(s)
>>>
>>> On 20 Sep 2018, at 21:43, Batyrshin Alexander <0x...@gmail.com> wrote:
>>>
>>> Looks like problem was because of index region split
>>>
>>> Index region split at prod013:
>>> Sep 20 19:45:05 prod013 hbase[193055]: 2018-09-20 19:45:05,441 INFO
>>>  [regionserver/prod013/10.0.0.13:60020-splits-1537400010677]
>>> regionserver.SplitRequest: Region split, hbase:meta updated, and report to
>>> master.
>>> Parent=IDX_MARK_O,\x107834005168\x0000000046200020LWfBS4c,1536637905252.3e44b85ddf407da831dbb9a871496986.,
>>> new
>>> regions: IDX_MARK_O,\x107834005168\x0000000046200020LWfBS4c,1537461904877.33cba925c7acb347ac3f5e70e839c3cb., IDX_MARK_O,\x107834005168\x0000000046200068=4YF!YI,1537461904877.acb8f16a004a894c8706f6e12cd26144..
>>> Split took 0sec
>>> Sep 20 19:45:05 prod013 hbase[193055]: 2018-09-20 19:45:05,441 INFO
>>>  [regionserver/prod013/10.0.0.13:60020-splits-1537400010677]
>>> regionserver.SplitRequest: Split transaction journal:
>>> Sep 20 19:45:05 prod013 hbase[193055]:         STARTED at 1537461904853
>>> Sep 20 19:45:05 prod013 hbase[193055]:         PREPARED at 1537461904877
>>> Sep 20 19:45:05 prod013 hbase[193055]:         BEFORE_PRE_SPLIT_HOOK at
>>> 1537461904877
>>> Sep 20 19:45:05 prod013 hbase[193055]:         AFTER_PRE_SPLIT_HOOK at
>>> 1537461904877
>>> Sep 20 19:45:05 prod013 hbase[193055]:         SET_SPLITTING at
>>> 1537461904880
>>> Sep 20 19:45:05 prod013 hbase[193055]:         CREATE_SPLIT_DIR at
>>> 1537461904987
>>> Sep 20 19:45:05 prod013 hbase[193055]:         CLOSED_PARENT_REGION at
>>> 1537461905002
>>> Sep 20 19:45:05 prod013 hbase[193055]:         OFFLINED_PARENT at
>>> 1537461905002
>>> Sep 20 19:45:05 prod013 hbase[193055]:         STARTED_REGION_A_CREATION
>>> at 1537461905056
>>> Sep 20 19:45:05 prod013 hbase[193055]:         STARTED_REGION_B_CREATION
>>> at 1537461905131
>>> Sep 20 19:45:05 prod013 hbase[193055]:         PONR at 1537461905192
>>> Sep 20 19:45:05 prod013 hbase[193055]:         OPENED_REGION_A at
>>> 1537461905249
>>> Sep 20 19:45:05 prod013 hbase[193055]:         OPENED_REGION_B at
>>> 1537461905252
>>> Sep 20 19:45:05 prod013 hbase[193055]:         BEFORE_POST_SPLIT_HOOK at
>>> 1537461905439
>>> Sep 20 19:45:05 prod013 hbase[193055]:         AFTER_POST_SPLIT_HOOK at
>>> 1537461905439
>>> Sep 20 19:45:05 prod013 hbase[193055]:         COMPLETED at 1537461905439
>>>
>>> Index update failed at prod002:
>>> Sep 20 20:09:24 prod002 hbase[97285]: 2018-09-20 20:09:24,520 WARN
>>>  [hconnection-0x4f3242a0-shared--pool32-t36014] client.AsyncProcess: #220,
>>> table=IDX_MARK_O, attempt=1/1 failed=1ops, last exception:
>>> org.apache.hadoop.hbase.NotServingRegionException:
>>> org.apache.hadoop.hbase.NotServingRegionException: Re
>>> gion
>>> IDX_MARK_O,\x107834005168\x0000000046200020LWfBS4c,1536637905252.3e44b85ddf407da831dbb9a871496986.
>>> is not online on prod013,60020,1537304282885
>>> Sep 20 20:09:24 prod002 hbase[97285]:         at
>>> org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3081)
>>> Sep 20 20:09:24 prod002 hbase[97285]:         at
>>> org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1271)
>>> Sep 20 20:09:24 prod002 hbase[97285]:         at
>>> org.apache.hadoop.hbase.regionserver.RSRpcServices.multi(RSRpcServices.java:2365)
>>> Sep 20 20:09:24 prod002 hbase[97285]:         at
>>> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:36621)
>>> Sep 20 20:09:24 prod002 hbase[97285]:         at
>>> org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2359)
>>> Sep 20 20:09:24 prod002 hbase[97285]:         at
>>> org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:124)
>>> Sep 20 20:09:24 prod002 hbase[97285]:         at
>>> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:297)
>>> Sep 20 20:09:24 prod002 hbase[97285]:         at
>>> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:277)
>>> Sep 20 20:09:24 prod002 hbase[97285]:  on prod013,60020,1537304282885,
>>> tracking started Thu Sep 20 20:09:24 MSK 2018; not retrying 1 - final
>>> failure
>>> Sep 20 20:09:24 prod002 hbase[97285]: 2018-09-20 20:09:24,549 INFO
>>>  [RpcServer.default.FPBQ.Fifo.handler=98,queue=8,port=60020]
>>> zookeeper.RecoverableZooKeeper: Process identifier=hconnection-0x39beae45
>>> connecting to ZooKeeper ensemble=10.0.0.1:2181,10.0.0.2:2181,
>>> 10.0.0.3:2181
>>> Sep 20 20:09:24 prod002 hbase[97285]: 2018-09-20 20:09:24,549 INFO
>>>  [RpcServer.default.FPBQ.Fifo.handler=98,queue=8,port=60020]
>>> zookeeper.ZooKeeper: Initiating client connection, connectString=
>>> 10.0.0.1:2181,10.0.0.2:2181,10.0.0.3:2181 sessionTimeout=90000
>>> watcher=org.apache.hadoop.hbase.zookeeper.PendingWatcher@3ef61f7
>>> Sep 20 20:09:24 prod002 hbase[97285]: 2018-09-20 20:09:24,562 INFO
>>>  [RpcServer.default.FPBQ.Fifo.handler=98,queue=8,port=60020-SendThread(
>>> 10.0.0.3:2181)] zookeeper.ClientCnxn: Opening socket connection to
>>> server 10.0.0.3/10.0.0.3:2181. Will not attempt to authenticate using
>>> SASL (unknown error)
>>> Sep 20 20:09:24 prod002 hbase[97285]: 2018-09-20 20:09:24,570 INFO
>>>  [RpcServer.default.FPBQ.Fifo.handler=98,queue=8,port=60020-SendThread(
>>> 10.0.0.3:2181)] zookeeper.ClientCnxn: Socket connection established to
>>> 10.0.0.3/10.0.0.3:2181, initiating session
>>> Sep 20 20:09:24 prod002 hbase[97285]: 2018-09-20 20:09:24,572 INFO
>>>  [RpcServer.default.FPBQ.Fifo.handler=98,queue=8,port=60020-SendThread(
>>> 10.0.0.3:2181)] zookeeper.ClientCnxn: Session establishment complete on
>>> server 10.0.0.3/10.0.0.3:2181, sessionid = 0x30000e039e01c7f,
>>> negotiated timeout = 40000
>>> Sep 20 20:09:24 prod002 hbase[97285]: 2018-09-20 20:09:24,628 INFO
>>>  [RpcServer.default.FPBQ.Fifo.handler=98,queue=8,port=60020]
>>> index.PhoenixIndexFailurePolicy: Successfully
>>> update INDEX_DISABLE_TIMESTAMP for IDX_MARK_O due to an exception while
>>> writing updates. indexState=PENDING_DISABLE
>>> Sep 20 20:09:24 prod002 hbase[97285]:
>>> org.apache.phoenix.hbase.index.exception.MultiIndexWriteFailureException:
>>>  disableIndexOnFailure=true, Failed to write to multiple index tables:
>>> [IDX_MARK_O]
>>> Sep 20 20:09:24 prod002 hbase[97285]:         at
>>> org.apache.phoenix.hbase.index.write.TrackingParallelWriterIndexCommitter.write(TrackingParallelWriterIndexCommitter.java:235)
>>> Sep 20 20:09:24 prod002 hbase[97285]:         at
>>> org.apache.phoenix.hbase.index.write.IndexWriter.write(IndexWriter.java:195)
>>> Sep 20 20:09:24 prod002 hbase[97285]:         at
>>> org.apache.phoenix.hbase.index.write.IndexWriter.writeAndKillYourselfOnFailure(IndexWriter.java:156)
>>> Sep 20 20:09:24 prod002 hbase[97285]:         at
>>> org.apache.phoenix.hbase.index.write.IndexWriter.writeAndKillYourselfOnFailure(IndexWriter.java:145)
>>> Sep 20 20:09:24 prod002 hbase[97285]:         at
>>> org.apache.phoenix.hbase.index.Indexer.doPostWithExceptions(Indexer.java:620)
>>> Sep 20 20:09:24 prod002 hbase[97285]:         at
>>> org.apache.phoenix.hbase.index.Indexer.doPost(Indexer.java:595)
>>> Sep 20 20:09:24 prod002 hbase[97285]:         at
>>> org.apache.phoenix.hbase.index.Indexer.postBatchMutateIndispensably(Indexer.java:578)
>>> Sep 20 20:09:24 prod002 hbase[97285]:         at
>>> org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost$37.call(RegionCoprocessorHost.java:1048)
>>> Sep 20 20:09:24 prod002 hbase[97285]:         at
>>> org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost$RegionOperation.call(RegionCoprocessorHost.java:1711)
>>> Sep 20 20:09:24 prod002 hbase[97285]:         at
>>> org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost.execOperation(RegionCoprocessorHost.java:1789)
>>> Sep 20 20:09:24 prod002 hbase[97285]:         at
>>> org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost.execOperation(RegionCoprocessorHost.java:1745)
>>> Sep 20 20:09:24 prod002 hbase[97285]:         at
>>> org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost.postBatchMutateIndispensably(RegionCoprocessorHost.java:1044)
>>> Sep 20 20:09:24 prod002 hbase[97285]:         at
>>> org.apache.hadoop.hbase.regionserver.HRegion.doMiniBatchMutation(HRegion.java:3646)
>>> Sep 20 20:09:24 prod002 hbase[97285]:         at
>>> org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(HRegion.java:3108)
>>> Sep 20 20:09:24 prod002 hbase[97285]:         at
>>> org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(HRegion.java:3050)
>>> Sep 20 20:09:24 prod002 hbase[97285]:         at
>>> org.apache.hadoop.hbase.regionserver.RSRpcServices.doBatchOp(RSRpcServices.java:916)
>>> Sep 20 20:09:24 prod002 hbase[97285]:         at
>>> org.apache.hadoop.hbase.regionserver.RSRpcServices.doNonAtomicRegionMutation(RSRpcServices.java:844)
>>> Sep 20 20:09:24 prod002 hbase[97285]:         at
>>> org.apache.hadoop.hbase.regionserver.RSRpcServices.multi(RSRpcServices.java:2405)
>>> Sep 20 20:09:24 prod002 hbase[97285]:         at
>>> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:36621)
>>> Sep 20 20:09:24 prod002 hbase[97285]:         at
>>> org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2359)
>>> Sep 20 20:09:24 prod002 hbase[97285]:         at
>>> org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:124)
>>> Sep 20 20:09:24 prod002 hbase[97285]:         at
>>> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:297)
>>> Sep 20 20:09:24 prod002 hbase[97285]:         at
>>> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:277)
>>> Sep 20 20:09:24 prod002 hbase[97285]: 2018-09-20 20:09:24,632 INFO
>>>  [RpcServer.default.FPBQ.Fifo.handler=98,queue=8,port=60020]
>>> util.IndexManagementUtil:
>>> Rethrowing org.apache.hadoop.hbase.DoNotRetryIOException: ERROR 1121
>>> (XCL21): Write to the index failed.  disableIndexOnFailure=true, Failed to
>>> write to multiple index tables: [IDX_MARK_O] ,serverTimestamp=1537463364504,
>>>
>>>
>>> On 20 Sep 2018, at 21:01, Batyrshin Alexander <0x...@gmail.com> wrote:
>>>
>>> Our setup:
>>> HBase-1.4.7
>>> Phoenix-4.14-hbase-1.4
>>>
>>>
>>> On 20 Sep 2018, at 20:19, Batyrshin Alexander <0x...@gmail.com> wrote:
>>>
>>>  Hello,
>>> Looks live we got dead lock with repeating "ERROR 1120 (XCL20)"
>>> exception. At this time all indexes is ACTIVE.
>>> Can you help to make deeper diagnose?
>>>
>>> java.sql.SQLException: ERROR 1120 (XCL20): Writes to table blocked until
>>> index can be updated. tableName=TBL_MARK
>>> at
>>> org.apache.phoenix.exception.SQLExceptionCode$Factory$1.newException(SQLExceptionCode.java:494)
>>> at
>>> org.apache.phoenix.exception.SQLExceptionInfo.buildException(SQLExceptionInfo.java:150)
>>> at
>>> org.apache.phoenix.execute.MutationState.validateAndGetServerTimestamp(MutationState.java:815)
>>> at
>>> org.apache.phoenix.execute.MutationState.validateAll(MutationState.java:789)
>>> at org.apache.phoenix.execute.MutationState.send(MutationState.java:981)
>>> at org.apache.phoenix.execute.MutationState.send(MutationState.java:1514)
>>> at
>>> org.apache.phoenix.execute.MutationState.commit(MutationState.java:1337)
>>> at
>>> org.apache.phoenix.jdbc.PhoenixConnection$3.call(PhoenixConnection.java:670)
>>> at
>>> org.apache.phoenix.jdbc.PhoenixConnection$3.call(PhoenixConnection.java:666)
>>> at org.apache.phoenix.call.CallRunner.run(CallRunner.java:53)
>>> at
>>> org.apache.phoenix.jdbc.PhoenixConnection.commit(PhoenixConnection.java:666)
>>> at
>>> x.persistence.phoenix.PhoenixDao.$anonfun$doUpsert$1(PhoenixDao.scala:103)
>>> at scala.util.Try$.apply(Try.scala:209)
>>> at x.persistence.phoenix.PhoenixDao.doUpsert(PhoenixDao.scala:101)
>>> at
>>> x.persistence.phoenix.PhoenixDao.$anonfun$batchInsert$2(PhoenixDao.scala:45)
>>> at
>>> x.persistence.phoenix.PhoenixDao.$anonfun$batchInsert$2$adapted(PhoenixDao.scala:45)
>>> at scala.collection.immutable.Stream.flatMap(Stream.scala:486)
>>> at scala.collection.immutable.Stream.$anonfun$flatMap$1(Stream.scala:494)
>>> at scala.collection.immutable.Stream.$anonfun$append$1(Stream.scala:252)
>>> at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1169)
>>> at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1159)
>>> at scala.collection.immutable.Stream.length(Stream.scala:309)
>>> at scala.collection.SeqLike.size(SeqLike.scala:105)
>>> at scala.collection.SeqLike.size$(SeqLike.scala:105)
>>> at scala.collection.AbstractSeq.size(Seq.scala:41)
>>> at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:285)
>>> at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:283)
>>> at scala.collection.AbstractTraversable.toArray(Traversable.scala:104)
>>> at
>>> x.persistence.phoenix.PhoenixDao.$anonfun$batchInsert$1(PhoenixDao.scala:45)
>>> at scala.util.Try$.apply(Try.scala:209)
>>> at x.persistence.phoenix.PhoenixDao.batchInsert(PhoenixDao.scala:45)
>>> at
>>> x.persistence.phoenix.PhoenixDao.$anonfun$insert$2(PhoenixDao.scala:35)
>>> at scala.concurrent.Future$.$anonfun$apply$1(Future.scala:655)
>>> at scala.util.Success.$anonfun$map$1(Try.scala:251)
>>> at scala.util.Success.map(Try.scala:209)
>>> at scala.concurrent.Future.$anonfun$map$1(Future.scala:289)
>>> at scala.concurrent.impl.Promise.liftedTree1$1(Promise.scala:29)
>>> at scala.concurrent.impl.Promise.$anonfun$transform$1(Promise.scala:29)
>>> at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:60)
>>> at
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>>> at
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>>> at java.lang.Thread.run(Thread.java:748)
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>
>

Re: Table dead lock: ERROR 1120 (XCL20): Writes to table blocked until index can be updated

Posted by Batyrshin Alexander <0x...@gmail.com>.

Meanwhile we tried to disable regions split via per index table options
'SPLIT_POLICY' => 'org.apache.hadoop.hbase.regionserver.ConstantSizeRegionSplitPolicy'  and hbase.hregion.max.filesize = 10737418240
Looks like this options set doesn't. Some regions splits at size < 2GB

Then we tried to disable all splits via hbase shell: splitormerge_switch 'SPLIT', false
Seems that this also doesn't work.

Any ideas why we can't disable regions split?

> On 27 Sep 2018, at 02:52, Vincent Poon <vi...@gmail.com> wrote:
> 
> We are planning a Phoenix 4.14.1 release which will have this fix
> 
> On Wed, Sep 26, 2018 at 3:36 PM Batyrshin Alexander <0x62ash@gmail.com <ma...@gmail.com>> wrote:
> Thank you. We will try somehow...
> Is there any chance that this fix will be included in next release for HBASE-1.4 (not 2.0)?
> 
>> On 27 Sep 2018, at 01:04, Ankit Singhal <ankitsinghal59@gmail.com <ma...@gmail.com>> wrote:
>> 
>> You might be hitting PHOENIX-4785 <https://jira.apache.org/jira/browse/PHOENIX-4785>,  you can apply the patch on top of 4.14 and see if it fixes your problem.
>> 
>> Regards,
>> Ankit Singhal
>> 
>> On Wed, Sep 26, 2018 at 2:33 PM Batyrshin Alexander <0x62ash@gmail.com <ma...@gmail.com>> wrote:
>> Any advices? Helps?
>> I can reproduce problem and capture more logs if needed.
>> 
>>> On 21 Sep 2018, at 02:13, Batyrshin Alexander <0x62ash@gmail.com <ma...@gmail.com>> wrote:
>>> 
>>> Looks like lock goes away 30 minutes after index region split.
>>> So i can assume that this issue comes from cache that configured by this option: phoenix.coprocessor.maxMetaDataCacheTimeToLiveMs
>>> 
>>> 
>>> 
>>>> On 21 Sep 2018, at 00:15, Batyrshin Alexander <0x62ash@gmail.com <ma...@gmail.com>> wrote:
>>>> 
>>>> And how this split looks at Master logs:
>>>> 
>>>> Sep 20 19:45:04 prod001 hbase[10838]: 2018-09-20 19:45:04,888 INFO  [AM.ZK.Worker-pool5-t282] master.RegionStates: Transition {3e44b85ddf407da831dbb9a871496986 state=OPEN, ts=1537304859509, server=prod013,60020,1537304282885} to {3e44b85ddf407da831dbb9a871496986 state=SPLITTING, ts=1537461904888, server=prod
>>>> Sep 20 19:45:05 prod001 hbase[10838]: 2018-09-20 19:45:05,340 INFO  [AM.ZK.Worker-pool5-t284] master.RegionStates: Transition {3e44b85ddf407da831dbb9a871496986 state=SPLITTING, ts=1537461905340, server=prod013,60020,1537304282885} to {3e44b85ddf407da831dbb9a871496986 state=SPLIT, ts=1537461905340, server=pro
>>>> Sep 20 19:45:05 prod001 hbase[10838]: 2018-09-20 19:45:05,340 INFO  [AM.ZK.Worker-pool5-t284] master.RegionStates: Offlined 3e44b85ddf407da831dbb9a871496986 from prod013,60020,1537304282885
>>>> Sep 20 19:45:05 prod001 hbase[10838]: 2018-09-20 19:45:05,341 INFO  [AM.ZK.Worker-pool5-t284] master.RegionStates: Transition {33cba925c7acb347ac3f5e70e839c3cb state=SPLITTING_NEW, ts=1537461905340, server=prod013,60020,1537304282885} to {33cba925c7acb347ac3f5e70e839c3cb state=OPEN, ts=1537461905341, server=
>>>> Sep 20 19:45:05 prod001 hbase[10838]: 2018-09-20 19:45:05,341 INFO  [AM.ZK.Worker-pool5-t284] master.RegionStates: Transition {acb8f16a004a894c8706f6e12cd26144 state=SPLITTING_NEW, ts=1537461905340, server=prod013,60020,1537304282885} to {acb8f16a004a894c8706f6e12cd26144 state=OPEN, ts=1537461905341, server=
>>>> Sep 20 19:45:05 prod001 hbase[10838]: 2018-09-20 19:45:05,343 INFO  [AM.ZK.Worker-pool5-t284] master.AssignmentManager: Handled SPLIT event; parent=IDX_MARK_O,\x107834005168\x0000000046200020LWfBS4c,1536637905252.3e44b85ddf407da831dbb9a871496986., daughter a=IDX_MARK_O,\x107834005168\x0000000046200020LWfBS4c,1
>>>> Sep 20 19:47:41 prod001 hbase[10838]: 2018-09-20 19:47:41,972 INFO  [prod001,60000,1537304851459_ChoreService_2] balancer.StochasticLoadBalancer: Skipping load balancing because balanced cluster; total cost is 17.82282205608522, sum multiplier is 1102.0 min cost which need balance is 0.05
>>>> Sep 20 19:47:42 prod001 hbase[10838]: 2018-09-20 19:47:42,021 INFO  [prod001,60000,1537304851459_ChoreService_1] hbase.MetaTableAccessor: Deleted IDX_MARK_O,\x107834005168\x0000000046200020LWfBS4c,1536637905252.3e44b85ddf407da831dbb9a871496986.
>>>> Sep 20 19:47:42 prod001 hbase[10838]: 2018-09-20 19:47:42,022 INFO  [prod001,60000,1537304851459_ChoreService_1] master.CatalogJanitor: Scanned 779 catalog row(s), gc'd 0 unreferenced merged region(s) and 1 unreferenced parent region(s)
>>>> 
>>>>> On 20 Sep 2018, at 21:43, Batyrshin Alexander <0x62ash@gmail.com <ma...@gmail.com>> wrote:
>>>>> 
>>>>> Looks like problem was because of index region split
>>>>> 
>>>>> Index region split at prod013:
>>>>> Sep 20 19:45:05 prod013 hbase[193055]: 2018-09-20 19:45:05,441 INFO  [regionserver/prod013/10.0.0.13:60020-splits-1537400010677] regionserver.SplitRequest: Region split, hbase:meta updated, and report to master. Parent=IDX_MARK_O,\x107834005168\x0000000046200020LWfBS4c,1536637905252.3e44b85ddf407da831dbb9a871496986., new regions: IDX_MARK_O,\x107834005168\x0000000046200020LWfBS4c,1537461904877.33cba925c7acb347ac3f5e70e839c3cb., IDX_MARK_O,\x107834005168\x0000000046200068=4YF!YI,1537461904877.acb8f16a004a894c8706f6e12cd26144.. Split took 0sec
>>>>> Sep 20 19:45:05 prod013 hbase[193055]: 2018-09-20 19:45:05,441 INFO  [regionserver/prod013/10.0.0.13:60020-splits-1537400010677] regionserver.SplitRequest: Split transaction journal:
>>>>> Sep 20 19:45:05 prod013 hbase[193055]:         STARTED at 1537461904853
>>>>> Sep 20 19:45:05 prod013 hbase[193055]:         PREPARED at 1537461904877
>>>>> Sep 20 19:45:05 prod013 hbase[193055]:         BEFORE_PRE_SPLIT_HOOK at 1537461904877
>>>>> Sep 20 19:45:05 prod013 hbase[193055]:         AFTER_PRE_SPLIT_HOOK at 1537461904877
>>>>> Sep 20 19:45:05 prod013 hbase[193055]:         SET_SPLITTING at 1537461904880
>>>>> Sep 20 19:45:05 prod013 hbase[193055]:         CREATE_SPLIT_DIR at 1537461904987
>>>>> Sep 20 19:45:05 prod013 hbase[193055]:         CLOSED_PARENT_REGION at 1537461905002
>>>>> Sep 20 19:45:05 prod013 hbase[193055]:         OFFLINED_PARENT at 1537461905002
>>>>> Sep 20 19:45:05 prod013 hbase[193055]:         STARTED_REGION_A_CREATION at 1537461905056
>>>>> Sep 20 19:45:05 prod013 hbase[193055]:         STARTED_REGION_B_CREATION at 1537461905131
>>>>> Sep 20 19:45:05 prod013 hbase[193055]:         PONR at 1537461905192
>>>>> Sep 20 19:45:05 prod013 hbase[193055]:         OPENED_REGION_A at 1537461905249
>>>>> Sep 20 19:45:05 prod013 hbase[193055]:         OPENED_REGION_B at 1537461905252
>>>>> Sep 20 19:45:05 prod013 hbase[193055]:         BEFORE_POST_SPLIT_HOOK at 1537461905439
>>>>> Sep 20 19:45:05 prod013 hbase[193055]:         AFTER_POST_SPLIT_HOOK at 1537461905439
>>>>> Sep 20 19:45:05 prod013 hbase[193055]:         COMPLETED at 1537461905439
>>>>> 
>>>>> Index update failed at prod002:
>>>>> Sep 20 20:09:24 prod002 hbase[97285]: 2018-09-20 20:09:24,520 WARN  [hconnection-0x4f3242a0-shared--pool32-t36014] client.AsyncProcess: #220, table=IDX_MARK_O, attempt=1/1 failed=1ops, last exception: org.apache.hadoop.hbase.NotServingRegionException: org.apache.hadoop.hbase.NotServingRegionException: Re
>>>>> gion IDX_MARK_O,\x107834005168\x0000000046200020LWfBS4c,1536637905252.3e44b85ddf407da831dbb9a871496986. is not online on prod013,60020,1537304282885
>>>>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3081)
>>>>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1271)
>>>>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.regionserver.RSRpcServices.multi(RSRpcServices.java:2365)
>>>>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:36621)
>>>>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2359)
>>>>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:124)
>>>>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:297)
>>>>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:277)
>>>>> Sep 20 20:09:24 prod002 hbase[97285]:  on prod013,60020,1537304282885, tracking started Thu Sep 20 20:09:24 MSK 2018; not retrying 1 - final failure
>>>>> Sep 20 20:09:24 prod002 hbase[97285]: 2018-09-20 20:09:24,549 INFO  [RpcServer.default.FPBQ.Fifo.handler=98,queue=8,port=60020] zookeeper.RecoverableZooKeeper: Process identifier=hconnection-0x39beae45 connecting to ZooKeeper ensemble=10.0.0.1:2181 <http://10.0.0.1:2181/>,10.0.0.2:2181 <http://10.0.0.2:2181/>,10.0.0.3:2181 <http://10.0.0.3:2181/>
>>>>> Sep 20 20:09:24 prod002 hbase[97285]: 2018-09-20 20:09:24,549 INFO  [RpcServer.default.FPBQ.Fifo.handler=98,queue=8,port=60020] zookeeper.ZooKeeper: Initiating client connection, connectString=10.0.0.1:2181 <http://10.0.0.1:2181/>,10.0.0.2:2181 <http://10.0.0.2:2181/>,10.0.0.3:2181 <http://10.0.0.3:2181/> sessionTimeout=90000 watcher=org.apache.hadoop.hbase.zookeeper.PendingWatcher@3ef61f7
>>>>> Sep 20 20:09:24 prod002 hbase[97285]: 2018-09-20 20:09:24,562 INFO  [RpcServer.default.FPBQ.Fifo.handler=98,queue=8,port=60020-SendThread(10.0.0.3:2181 <http://10.0.0.3:2181/>)] zookeeper.ClientCnxn: Opening socket connection to server 10.0.0.3/10.0.0.3:2181 <http://10.0.0.3/10.0.0.3:2181>. Will not attempt to authenticate using SASL (unknown error)
>>>>> Sep 20 20:09:24 prod002 hbase[97285]: 2018-09-20 20:09:24,570 INFO  [RpcServer.default.FPBQ.Fifo.handler=98,queue=8,port=60020-SendThread(10.0.0.3:2181 <http://10.0.0.3:2181/>)] zookeeper.ClientCnxn: Socket connection established to 10.0.0.3/10.0.0.3:2181 <http://10.0.0.3/10.0.0.3:2181>, initiating session
>>>>> Sep 20 20:09:24 prod002 hbase[97285]: 2018-09-20 20:09:24,572 INFO  [RpcServer.default.FPBQ.Fifo.handler=98,queue=8,port=60020-SendThread(10.0.0.3:2181 <http://10.0.0.3:2181/>)] zookeeper.ClientCnxn: Session establishment complete on server 10.0.0.3/10.0.0.3:2181 <http://10.0.0.3/10.0.0.3:2181>, sessionid = 0x30000e039e01c7f, negotiated timeout = 40000
>>>>> Sep 20 20:09:24 prod002 hbase[97285]: 2018-09-20 20:09:24,628 INFO  [RpcServer.default.FPBQ.Fifo.handler=98,queue=8,port=60020] index.PhoenixIndexFailurePolicy: Successfully update INDEX_DISABLE_TIMESTAMP for IDX_MARK_O due to an exception while writing updates. indexState=PENDING_DISABLE
>>>>> Sep 20 20:09:24 prod002 hbase[97285]: org.apache.phoenix.hbase.index.exception.MultiIndexWriteFailureException:  disableIndexOnFailure=true, Failed to write to multiple index tables: [IDX_MARK_O]
>>>>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.phoenix.hbase.index.write.TrackingParallelWriterIndexCommitter.write(TrackingParallelWriterIndexCommitter.java:235)
>>>>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.phoenix.hbase.index.write.IndexWriter.write(IndexWriter.java:195)
>>>>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.phoenix.hbase.index.write.IndexWriter.writeAndKillYourselfOnFailure(IndexWriter.java:156)
>>>>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.phoenix.hbase.index.write.IndexWriter.writeAndKillYourselfOnFailure(IndexWriter.java:145)
>>>>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.phoenix.hbase.index.Indexer.doPostWithExceptions(Indexer.java:620)
>>>>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.phoenix.hbase.index.Indexer.doPost(Indexer.java:595)
>>>>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.phoenix.hbase.index.Indexer.postBatchMutateIndispensably(Indexer.java:578)
>>>>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost$37.call(RegionCoprocessorHost.java:1048)
>>>>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost$RegionOperation.call(RegionCoprocessorHost.java:1711)
>>>>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost.execOperation(RegionCoprocessorHost.java:1789)
>>>>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost.execOperation(RegionCoprocessorHost.java:1745)
>>>>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost.postBatchMutateIndispensably(RegionCoprocessorHost.java:1044)
>>>>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.regionserver.HRegion.doMiniBatchMutation(HRegion.java:3646)
>>>>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(HRegion.java:3108)
>>>>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(HRegion.java:3050)
>>>>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.regionserver.RSRpcServices.doBatchOp(RSRpcServices.java:916)
>>>>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.regionserver.RSRpcServices.doNonAtomicRegionMutation(RSRpcServices.java:844)
>>>>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.regionserver.RSRpcServices.multi(RSRpcServices.java:2405)
>>>>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:36621)
>>>>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2359)
>>>>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:124)
>>>>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:297)
>>>>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:277)
>>>>> Sep 20 20:09:24 prod002 hbase[97285]: 2018-09-20 20:09:24,632 INFO  [RpcServer.default.FPBQ.Fifo.handler=98,queue=8,port=60020] util.IndexManagementUtil: Rethrowing org.apache.hadoop.hbase.DoNotRetryIOException: ERROR 1121 (XCL21): Write to the index failed.  disableIndexOnFailure=true, Failed to write to multiple index tables: [IDX_MARK_O] ,serverTimestamp=1537463364504,
>>>>> 
>>>>> 
>>>>>> On 20 Sep 2018, at 21:01, Batyrshin Alexander <0x62ash@gmail.com <ma...@gmail.com>> wrote:
>>>>>> 
>>>>>> Our setup:
>>>>>> HBase-1.4.7
>>>>>> Phoenix-4.14-hbase-1.4
>>>>>> 
>>>>>> 
>>>>>>> On 20 Sep 2018, at 20:19, Batyrshin Alexander <0x62ash@gmail.com <ma...@gmail.com>> wrote:
>>>>>>> 
>>>>>>>  Hello,
>>>>>>> Looks live we got dead lock with repeating "ERROR 1120 (XCL20)" exception. At this time all indexes is ACTIVE.
>>>>>>> Can you help to make deeper diagnose?
>>>>>>> 
>>>>>>> java.sql.SQLException: ERROR 1120 (XCL20): Writes to table blocked until index can be updated. tableName=TBL_MARK
>>>>>>> 	at org.apache.phoenix.exception.SQLExceptionCode$Factory$1.newException(SQLExceptionCode.java:494)
>>>>>>> 	at org.apache.phoenix.exception.SQLExceptionInfo.buildException(SQLExceptionInfo.java:150)
>>>>>>> 	at org.apache.phoenix.execute.MutationState.validateAndGetServerTimestamp(MutationState.java:815)
>>>>>>> 	at org.apache.phoenix.execute.MutationState.validateAll(MutationState.java:789)
>>>>>>> 	at org.apache.phoenix.execute.MutationState.send(MutationState.java:981)
>>>>>>> 	at org.apache.phoenix.execute.MutationState.send(MutationState.java:1514)
>>>>>>> 	at org.apache.phoenix.execute.MutationState.commit(MutationState.java:1337)
>>>>>>> 	at org.apache.phoenix.jdbc.PhoenixConnection$3.call(PhoenixConnection.java:670)
>>>>>>> 	at org.apache.phoenix.jdbc.PhoenixConnection$3.call(PhoenixConnection.java:666)
>>>>>>> 	at org.apache.phoenix.call.CallRunner.run(CallRunner.java:53)
>>>>>>> 	at org.apache.phoenix.jdbc.PhoenixConnection.commit(PhoenixConnection.java:666)
>>>>>>> 	at x.persistence.phoenix.PhoenixDao.$anonfun$doUpsert$1(PhoenixDao.scala:103)
>>>>>>> 	at scala.util.Try$.apply(Try.scala:209)
>>>>>>> 	at x.persistence.phoenix.PhoenixDao.doUpsert(PhoenixDao.scala:101)
>>>>>>> 	at x.persistence.phoenix.PhoenixDao.$anonfun$batchInsert$2(PhoenixDao.scala:45)
>>>>>>> 	at x.persistence.phoenix.PhoenixDao.$anonfun$batchInsert$2$adapted(PhoenixDao.scala:45)
>>>>>>> 	at scala.collection.immutable.Stream.flatMap(Stream.scala:486)
>>>>>>> 	at scala.collection.immutable.Stream.$anonfun$flatMap$1(Stream.scala:494)
>>>>>>> 	at scala.collection.immutable.Stream.$anonfun$append$1(Stream.scala:252)
>>>>>>> 	at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1169)
>>>>>>> 	at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1159)
>>>>>>> 	at scala.collection.immutable.Stream.length(Stream.scala:309)
>>>>>>> 	at scala.collection.SeqLike.size(SeqLike.scala:105)
>>>>>>> 	at scala.collection.SeqLike.size$(SeqLike.scala:105)
>>>>>>> 	at scala.collection.AbstractSeq.size(Seq.scala:41)
>>>>>>> 	at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:285)
>>>>>>> 	at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:283)
>>>>>>> 	at scala.collection.AbstractTraversable.toArray(Traversable.scala:104)
>>>>>>> 	at x.persistence.phoenix.PhoenixDao.$anonfun$batchInsert$1(PhoenixDao.scala:45)
>>>>>>> 	at scala.util.Try$.apply(Try.scala:209)
>>>>>>> 	at x.persistence.phoenix.PhoenixDao.batchInsert(PhoenixDao.scala:45)
>>>>>>> 	at x.persistence.phoenix.PhoenixDao.$anonfun$insert$2(PhoenixDao.scala:35)
>>>>>>> 	at scala.concurrent.Future$.$anonfun$apply$1(Future.scala:655)
>>>>>>> 	at scala.util.Success.$anonfun$map$1(Try.scala:251)
>>>>>>> 	at scala.util.Success.map(Try.scala:209)
>>>>>>> 	at scala.concurrent.Future.$anonfun$map$1(Future.scala:289)
>>>>>>> 	at scala.concurrent.impl.Promise.liftedTree1$1(Promise.scala:29)
>>>>>>> 	at scala.concurrent.impl.Promise.$anonfun$transform$1(Promise.scala:29)
>>>>>>> 	at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:60)
>>>>>>> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>>>>>>> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>>>>>>> 	at java.lang.Thread.run(Thread.java:748)
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> 
>

Re: Table dead lock: ERROR 1120 (XCL20): Writes to table blocked until index can be updated

Posted by Vincent Poon <vi...@gmail.com>.

We are planning a Phoenix 4.14.1 release which will have this fix

On Wed, Sep 26, 2018 at 3:36 PM Batyrshin Alexander <0x...@gmail.com>
wrote:

> Thank you. We will try somehow...
> Is there any chance that this fix will be included in next release for
> HBASE-1.4 (not 2.0)?
>
> On 27 Sep 2018, at 01:04, Ankit Singhal <an...@gmail.com> wrote:
>
> You might be hitting PHOENIX-4785
> <https://jira.apache.org/jira/browse/PHOENIX-4785>,  you can apply the
> patch on top of 4.14 and see if it fixes your problem.
>
> Regards,
> Ankit Singhal
>
> On Wed, Sep 26, 2018 at 2:33 PM Batyrshin Alexander <0x...@gmail.com>
> wrote:
>
>> Any advices? Helps?
>> I can reproduce problem and capture more logs if needed.
>>
>> On 21 Sep 2018, at 02:13, Batyrshin Alexander <0x...@gmail.com> wrote:
>>
>> Looks like lock goes away 30 minutes after index region split.
>> So i can assume that this issue comes from cache that configured by this
>> option:* phoenix.coprocessor.maxMetaDataCacheTimeToLiveMs*
>>
>>
>>
>> On 21 Sep 2018, at 00:15, Batyrshin Alexander <0x...@gmail.com> wrote:
>>
>> And how this split looks at Master logs:
>>
>> Sep 20 19:45:04 prod001 hbase[10838]: 2018-09-20 19:45:04,888 INFO
>>  [AM.ZK.Worker-pool5-t282] master.RegionStates: Transition
>> {3e44b85ddf407da831dbb9a871496986 state=OPEN,
>> ts=1537304859509, server=prod013,60020,1537304282885} to
>> {3e44b85ddf407da831dbb9a871496986 state=SPLITTING, ts=1537461904888,
>> server=prod
>> Sep 20 19:45:05 prod001 hbase[10838]: 2018-09-20 19:45:05,340 INFO
>>  [AM.ZK.Worker-pool5-t284] master.RegionStates: Transition
>> {3e44b85ddf407da831dbb9a871496986 state=SPLITTING, ts=1537461905340,
>> server=prod013,60020,1537304282885} to {3e44b85ddf407da831dbb9a871496986
>> state=SPLIT, ts=1537461905340, server=pro
>> Sep 20 19:45:05 prod001 hbase[10838]: 2018-09-20 19:45:05,340 INFO
>>  [AM.ZK.Worker-pool5-t284] master.RegionStates: Offlined
>> 3e44b85ddf407da831dbb9a871496986 from prod013,60020,1537304282885
>> Sep 20 19:45:05 prod001 hbase[10838]: 2018-09-20 19:45:05,341 INFO
>>  [AM.ZK.Worker-pool5-t284] master.RegionStates: Transition
>> {33cba925c7acb347ac3f5e70e839c3cb state=SPLITTING_NEW, ts=1537461905340,
>> server=prod013,60020,1537304282885} to {33cba925c7acb347ac3f5e70e839c3cb
>> state=OPEN, ts=1537461905341, server=
>> Sep 20 19:45:05 prod001 hbase[10838]: 2018-09-20 19:45:05,341 INFO
>>  [AM.ZK.Worker-pool5-t284] master.RegionStates: Transition
>> {acb8f16a004a894c8706f6e12cd26144 state=SPLITTING_NEW, ts=1537461905340,
>> server=prod013,60020,1537304282885} to {acb8f16a004a894c8706f6e12cd26144
>> state=OPEN, ts=1537461905341, server=
>> Sep 20 19:45:05 prod001 hbase[10838]: 2018-09-20 19:45:05,343 INFO
>>  [AM.ZK.Worker-pool5-t284] master.AssignmentManager: Handled SPLIT
>> event; parent=IDX_MARK_O,\x107834005168\x0000000046200020LWfBS4c,1536637905252.3e44b85ddf407da831dbb9a871496986.,
>> daughter a=IDX_MARK_O,\x107834005168\x0000000046200020LWfBS4c,1
>> Sep 20 19:47:41 prod001 hbase[10838]: 2018-09-20 19:47:41,972 INFO
>>  [prod001,60000,1537304851459_ChoreService_2]
>> balancer.StochasticLoadBalancer: Skipping load balancing because balanced
>> cluster; total cost is 17.82282205608522, sum multiplier is 1102.0 min cost
>> which need balance is 0.05
>> Sep 20 19:47:42 prod001 hbase[10838]: 2018-09-20 19:47:42,021 INFO
>>  [prod001,60000,1537304851459_ChoreService_1] hbase.MetaTableAccessor:
>> Deleted IDX_MARK_O,\x107834005168\x0000000046200020LWfBS4c,1536637905252.3e44b85ddf407da831dbb9a871496986.
>> Sep 20 19:47:42 prod001 hbase[10838]: 2018-09-20 19:47:42,022 INFO
>>  [prod001,60000,1537304851459_ChoreService_1] master.CatalogJanitor:
>> Scanned 779 catalog row(s), gc'd 0 unreferenced merged region(s) and 1
>> unreferenced parent region(s)
>>
>> On 20 Sep 2018, at 21:43, Batyrshin Alexander <0x...@gmail.com> wrote:
>>
>> Looks like problem was because of index region split
>>
>> Index region split at prod013:
>> Sep 20 19:45:05 prod013 hbase[193055]: 2018-09-20 19:45:05,441 INFO
>>  [regionserver/prod013/10.0.0.13:60020-splits-1537400010677]
>> regionserver.SplitRequest: Region split, hbase:meta updated, and report to
>> master.
>> Parent=IDX_MARK_O,\x107834005168\x0000000046200020LWfBS4c,1536637905252.3e44b85ddf407da831dbb9a871496986.,
>> new
>> regions: IDX_MARK_O,\x107834005168\x0000000046200020LWfBS4c,1537461904877.33cba925c7acb347ac3f5e70e839c3cb., IDX_MARK_O,\x107834005168\x0000000046200068=4YF!YI,1537461904877.acb8f16a004a894c8706f6e12cd26144..
>> Split took 0sec
>> Sep 20 19:45:05 prod013 hbase[193055]: 2018-09-20 19:45:05,441 INFO
>>  [regionserver/prod013/10.0.0.13:60020-splits-1537400010677]
>> regionserver.SplitRequest: Split transaction journal:
>> Sep 20 19:45:05 prod013 hbase[193055]:         STARTED at 1537461904853
>> Sep 20 19:45:05 prod013 hbase[193055]:         PREPARED at 1537461904877
>> Sep 20 19:45:05 prod013 hbase[193055]:         BEFORE_PRE_SPLIT_HOOK at
>> 1537461904877
>> Sep 20 19:45:05 prod013 hbase[193055]:         AFTER_PRE_SPLIT_HOOK at
>> 1537461904877
>> Sep 20 19:45:05 prod013 hbase[193055]:         SET_SPLITTING at
>> 1537461904880
>> Sep 20 19:45:05 prod013 hbase[193055]:         CREATE_SPLIT_DIR at
>> 1537461904987
>> Sep 20 19:45:05 prod013 hbase[193055]:         CLOSED_PARENT_REGION at
>> 1537461905002
>> Sep 20 19:45:05 prod013 hbase[193055]:         OFFLINED_PARENT at
>> 1537461905002
>> Sep 20 19:45:05 prod013 hbase[193055]:         STARTED_REGION_A_CREATION
>> at 1537461905056
>> Sep 20 19:45:05 prod013 hbase[193055]:         STARTED_REGION_B_CREATION
>> at 1537461905131
>> Sep 20 19:45:05 prod013 hbase[193055]:         PONR at 1537461905192
>> Sep 20 19:45:05 prod013 hbase[193055]:         OPENED_REGION_A at
>> 1537461905249
>> Sep 20 19:45:05 prod013 hbase[193055]:         OPENED_REGION_B at
>> 1537461905252
>> Sep 20 19:45:05 prod013 hbase[193055]:         BEFORE_POST_SPLIT_HOOK at
>> 1537461905439
>> Sep 20 19:45:05 prod013 hbase[193055]:         AFTER_POST_SPLIT_HOOK at
>> 1537461905439
>> Sep 20 19:45:05 prod013 hbase[193055]:         COMPLETED at 1537461905439
>>
>> Index update failed at prod002:
>> Sep 20 20:09:24 prod002 hbase[97285]: 2018-09-20 20:09:24,520 WARN
>>  [hconnection-0x4f3242a0-shared--pool32-t36014] client.AsyncProcess: #220,
>> table=IDX_MARK_O, attempt=1/1 failed=1ops, last exception:
>> org.apache.hadoop.hbase.NotServingRegionException:
>> org.apache.hadoop.hbase.NotServingRegionException: Re
>> gion
>> IDX_MARK_O,\x107834005168\x0000000046200020LWfBS4c,1536637905252.3e44b85ddf407da831dbb9a871496986.
>> is not online on prod013,60020,1537304282885
>> Sep 20 20:09:24 prod002 hbase[97285]:         at
>> org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3081)
>> Sep 20 20:09:24 prod002 hbase[97285]:         at
>> org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1271)
>> Sep 20 20:09:24 prod002 hbase[97285]:         at
>> org.apache.hadoop.hbase.regionserver.RSRpcServices.multi(RSRpcServices.java:2365)
>> Sep 20 20:09:24 prod002 hbase[97285]:         at
>> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:36621)
>> Sep 20 20:09:24 prod002 hbase[97285]:         at
>> org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2359)
>> Sep 20 20:09:24 prod002 hbase[97285]:         at
>> org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:124)
>> Sep 20 20:09:24 prod002 hbase[97285]:         at
>> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:297)
>> Sep 20 20:09:24 prod002 hbase[97285]:         at
>> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:277)
>> Sep 20 20:09:24 prod002 hbase[97285]:  on prod013,60020,1537304282885,
>> tracking started Thu Sep 20 20:09:24 MSK 2018; not retrying 1 - final
>> failure
>> Sep 20 20:09:24 prod002 hbase[97285]: 2018-09-20 20:09:24,549 INFO
>>  [RpcServer.default.FPBQ.Fifo.handler=98,queue=8,port=60020]
>> zookeeper.RecoverableZooKeeper: Process identifier=hconnection-0x39beae45
>> connecting to ZooKeeper ensemble=10.0.0.1:2181,10.0.0.2:2181,
>> 10.0.0.3:2181
>> Sep 20 20:09:24 prod002 hbase[97285]: 2018-09-20 20:09:24,549 INFO
>>  [RpcServer.default.FPBQ.Fifo.handler=98,queue=8,port=60020]
>> zookeeper.ZooKeeper: Initiating client connection, connectString=
>> 10.0.0.1:2181,10.0.0.2:2181,10.0.0.3:2181 sessionTimeout=90000
>> watcher=org.apache.hadoop.hbase.zookeeper.PendingWatcher@3ef61f7
>> Sep 20 20:09:24 prod002 hbase[97285]: 2018-09-20 20:09:24,562 INFO
>>  [RpcServer.default.FPBQ.Fifo.handler=98,queue=8,port=60020-SendThread(
>> 10.0.0.3:2181)] zookeeper.ClientCnxn: Opening socket connection to
>> server 10.0.0.3/10.0.0.3:2181. Will not attempt to authenticate using
>> SASL (unknown error)
>> Sep 20 20:09:24 prod002 hbase[97285]: 2018-09-20 20:09:24,570 INFO
>>  [RpcServer.default.FPBQ.Fifo.handler=98,queue=8,port=60020-SendThread(
>> 10.0.0.3:2181)] zookeeper.ClientCnxn: Socket connection established to
>> 10.0.0.3/10.0.0.3:2181, initiating session
>> Sep 20 20:09:24 prod002 hbase[97285]: 2018-09-20 20:09:24,572 INFO
>>  [RpcServer.default.FPBQ.Fifo.handler=98,queue=8,port=60020-SendThread(
>> 10.0.0.3:2181)] zookeeper.ClientCnxn: Session establishment complete on
>> server 10.0.0.3/10.0.0.3:2181, sessionid = 0x30000e039e01c7f, negotiated
>> timeout = 40000
>> Sep 20 20:09:24 prod002 hbase[97285]: 2018-09-20 20:09:24,628 INFO
>>  [RpcServer.default.FPBQ.Fifo.handler=98,queue=8,port=60020]
>> index.PhoenixIndexFailurePolicy: Successfully
>> update INDEX_DISABLE_TIMESTAMP for IDX_MARK_O due to an exception while
>> writing updates. indexState=PENDING_DISABLE
>> Sep 20 20:09:24 prod002 hbase[97285]:
>> org.apache.phoenix.hbase.index.exception.MultiIndexWriteFailureException:
>>  disableIndexOnFailure=true, Failed to write to multiple index tables:
>> [IDX_MARK_O]
>> Sep 20 20:09:24 prod002 hbase[97285]:         at
>> org.apache.phoenix.hbase.index.write.TrackingParallelWriterIndexCommitter.write(TrackingParallelWriterIndexCommitter.java:235)
>> Sep 20 20:09:24 prod002 hbase[97285]:         at
>> org.apache.phoenix.hbase.index.write.IndexWriter.write(IndexWriter.java:195)
>> Sep 20 20:09:24 prod002 hbase[97285]:         at
>> org.apache.phoenix.hbase.index.write.IndexWriter.writeAndKillYourselfOnFailure(IndexWriter.java:156)
>> Sep 20 20:09:24 prod002 hbase[97285]:         at
>> org.apache.phoenix.hbase.index.write.IndexWriter.writeAndKillYourselfOnFailure(IndexWriter.java:145)
>> Sep 20 20:09:24 prod002 hbase[97285]:         at
>> org.apache.phoenix.hbase.index.Indexer.doPostWithExceptions(Indexer.java:620)
>> Sep 20 20:09:24 prod002 hbase[97285]:         at
>> org.apache.phoenix.hbase.index.Indexer.doPost(Indexer.java:595)
>> Sep 20 20:09:24 prod002 hbase[97285]:         at
>> org.apache.phoenix.hbase.index.Indexer.postBatchMutateIndispensably(Indexer.java:578)
>> Sep 20 20:09:24 prod002 hbase[97285]:         at
>> org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost$37.call(RegionCoprocessorHost.java:1048)
>> Sep 20 20:09:24 prod002 hbase[97285]:         at
>> org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost$RegionOperation.call(RegionCoprocessorHost.java:1711)
>> Sep 20 20:09:24 prod002 hbase[97285]:         at
>> org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost.execOperation(RegionCoprocessorHost.java:1789)
>> Sep 20 20:09:24 prod002 hbase[97285]:         at
>> org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost.execOperation(RegionCoprocessorHost.java:1745)
>> Sep 20 20:09:24 prod002 hbase[97285]:         at
>> org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost.postBatchMutateIndispensably(RegionCoprocessorHost.java:1044)
>> Sep 20 20:09:24 prod002 hbase[97285]:         at
>> org.apache.hadoop.hbase.regionserver.HRegion.doMiniBatchMutation(HRegion.java:3646)
>> Sep 20 20:09:24 prod002 hbase[97285]:         at
>> org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(HRegion.java:3108)
>> Sep 20 20:09:24 prod002 hbase[97285]:         at
>> org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(HRegion.java:3050)
>> Sep 20 20:09:24 prod002 hbase[97285]:         at
>> org.apache.hadoop.hbase.regionserver.RSRpcServices.doBatchOp(RSRpcServices.java:916)
>> Sep 20 20:09:24 prod002 hbase[97285]:         at
>> org.apache.hadoop.hbase.regionserver.RSRpcServices.doNonAtomicRegionMutation(RSRpcServices.java:844)
>> Sep 20 20:09:24 prod002 hbase[97285]:         at
>> org.apache.hadoop.hbase.regionserver.RSRpcServices.multi(RSRpcServices.java:2405)
>> Sep 20 20:09:24 prod002 hbase[97285]:         at
>> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:36621)
>> Sep 20 20:09:24 prod002 hbase[97285]:         at
>> org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2359)
>> Sep 20 20:09:24 prod002 hbase[97285]:         at
>> org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:124)
>> Sep 20 20:09:24 prod002 hbase[97285]:         at
>> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:297)
>> Sep 20 20:09:24 prod002 hbase[97285]:         at
>> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:277)
>> Sep 20 20:09:24 prod002 hbase[97285]: 2018-09-20 20:09:24,632 INFO
>>  [RpcServer.default.FPBQ.Fifo.handler=98,queue=8,port=60020]
>> util.IndexManagementUtil:
>> Rethrowing org.apache.hadoop.hbase.DoNotRetryIOException: ERROR 1121
>> (XCL21): Write to the index failed.  disableIndexOnFailure=true, Failed to
>> write to multiple index tables: [IDX_MARK_O] ,serverTimestamp=1537463364504,
>>
>>
>> On 20 Sep 2018, at 21:01, Batyrshin Alexander <0x...@gmail.com> wrote:
>>
>> Our setup:
>> HBase-1.4.7
>> Phoenix-4.14-hbase-1.4
>>
>>
>> On 20 Sep 2018, at 20:19, Batyrshin Alexander <0x...@gmail.com> wrote:
>>
>>  Hello,
>> Looks live we got dead lock with repeating "ERROR 1120 (XCL20)"
>> exception. At this time all indexes is ACTIVE.
>> Can you help to make deeper diagnose?
>>
>> java.sql.SQLException: ERROR 1120 (XCL20): Writes to table blocked until
>> index can be updated. tableName=TBL_MARK
>> at
>> org.apache.phoenix.exception.SQLExceptionCode$Factory$1.newException(SQLExceptionCode.java:494)
>> at
>> org.apache.phoenix.exception.SQLExceptionInfo.buildException(SQLExceptionInfo.java:150)
>> at
>> org.apache.phoenix.execute.MutationState.validateAndGetServerTimestamp(MutationState.java:815)
>> at
>> org.apache.phoenix.execute.MutationState.validateAll(MutationState.java:789)
>> at org.apache.phoenix.execute.MutationState.send(MutationState.java:981)
>> at org.apache.phoenix.execute.MutationState.send(MutationState.java:1514)
>> at
>> org.apache.phoenix.execute.MutationState.commit(MutationState.java:1337)
>> at
>> org.apache.phoenix.jdbc.PhoenixConnection$3.call(PhoenixConnection.java:670)
>> at
>> org.apache.phoenix.jdbc.PhoenixConnection$3.call(PhoenixConnection.java:666)
>> at org.apache.phoenix.call.CallRunner.run(CallRunner.java:53)
>> at
>> org.apache.phoenix.jdbc.PhoenixConnection.commit(PhoenixConnection.java:666)
>> at
>> x.persistence.phoenix.PhoenixDao.$anonfun$doUpsert$1(PhoenixDao.scala:103)
>> at scala.util.Try$.apply(Try.scala:209)
>> at x.persistence.phoenix.PhoenixDao.doUpsert(PhoenixDao.scala:101)
>> at
>> x.persistence.phoenix.PhoenixDao.$anonfun$batchInsert$2(PhoenixDao.scala:45)
>> at
>> x.persistence.phoenix.PhoenixDao.$anonfun$batchInsert$2$adapted(PhoenixDao.scala:45)
>> at scala.collection.immutable.Stream.flatMap(Stream.scala:486)
>> at scala.collection.immutable.Stream.$anonfun$flatMap$1(Stream.scala:494)
>> at scala.collection.immutable.Stream.$anonfun$append$1(Stream.scala:252)
>> at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1169)
>> at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1159)
>> at scala.collection.immutable.Stream.length(Stream.scala:309)
>> at scala.collection.SeqLike.size(SeqLike.scala:105)
>> at scala.collection.SeqLike.size$(SeqLike.scala:105)
>> at scala.collection.AbstractSeq.size(Seq.scala:41)
>> at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:285)
>> at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:283)
>> at scala.collection.AbstractTraversable.toArray(Traversable.scala:104)
>> at
>> x.persistence.phoenix.PhoenixDao.$anonfun$batchInsert$1(PhoenixDao.scala:45)
>> at scala.util.Try$.apply(Try.scala:209)
>> at x.persistence.phoenix.PhoenixDao.batchInsert(PhoenixDao.scala:45)
>> at x.persistence.phoenix.PhoenixDao.$anonfun$insert$2(PhoenixDao.scala:35)
>> at scala.concurrent.Future$.$anonfun$apply$1(Future.scala:655)
>> at scala.util.Success.$anonfun$map$1(Try.scala:251)
>> at scala.util.Success.map(Try.scala:209)
>> at scala.concurrent.Future.$anonfun$map$1(Future.scala:289)
>> at scala.concurrent.impl.Promise.liftedTree1$1(Promise.scala:29)
>> at scala.concurrent.impl.Promise.$anonfun$transform$1(Promise.scala:29)
>> at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:60)
>> at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>> at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>> at java.lang.Thread.run(Thread.java:748)
>>
>>
>>
>>
>>
>>
>>
>>
>

Re: Table dead lock: ERROR 1120 (XCL20): Writes to table blocked until index can be updated

Posted by Batyrshin Alexander <0x...@gmail.com>.

Thank you. We will try somehow...
Is there any chance that this fix will be included in next release for HBASE-1.4 (not 2.0)?

> On 27 Sep 2018, at 01:04, Ankit Singhal <an...@gmail.com> wrote:
> 
> You might be hitting PHOENIX-4785 <https://jira.apache.org/jira/browse/PHOENIX-4785>,  you can apply the patch on top of 4.14 and see if it fixes your problem.
> 
> Regards,
> Ankit Singhal
> 
> On Wed, Sep 26, 2018 at 2:33 PM Batyrshin Alexander <0x62ash@gmail.com <ma...@gmail.com>> wrote:
> Any advices? Helps?
> I can reproduce problem and capture more logs if needed.
> 
>> On 21 Sep 2018, at 02:13, Batyrshin Alexander <0x62ash@gmail.com <ma...@gmail.com>> wrote:
>> 
>> Looks like lock goes away 30 minutes after index region split.
>> So i can assume that this issue comes from cache that configured by this option: phoenix.coprocessor.maxMetaDataCacheTimeToLiveMs
>> 
>> 
>> 
>>> On 21 Sep 2018, at 00:15, Batyrshin Alexander <0x62ash@gmail.com <ma...@gmail.com>> wrote:
>>> 
>>> And how this split looks at Master logs:
>>> 
>>> Sep 20 19:45:04 prod001 hbase[10838]: 2018-09-20 19:45:04,888 INFO  [AM.ZK.Worker-pool5-t282] master.RegionStates: Transition {3e44b85ddf407da831dbb9a871496986 state=OPEN, ts=1537304859509, server=prod013,60020,1537304282885} to {3e44b85ddf407da831dbb9a871496986 state=SPLITTING, ts=1537461904888, server=prod
>>> Sep 20 19:45:05 prod001 hbase[10838]: 2018-09-20 19:45:05,340 INFO  [AM.ZK.Worker-pool5-t284] master.RegionStates: Transition {3e44b85ddf407da831dbb9a871496986 state=SPLITTING, ts=1537461905340, server=prod013,60020,1537304282885} to {3e44b85ddf407da831dbb9a871496986 state=SPLIT, ts=1537461905340, server=pro
>>> Sep 20 19:45:05 prod001 hbase[10838]: 2018-09-20 19:45:05,340 INFO  [AM.ZK.Worker-pool5-t284] master.RegionStates: Offlined 3e44b85ddf407da831dbb9a871496986 from prod013,60020,1537304282885
>>> Sep 20 19:45:05 prod001 hbase[10838]: 2018-09-20 19:45:05,341 INFO  [AM.ZK.Worker-pool5-t284] master.RegionStates: Transition {33cba925c7acb347ac3f5e70e839c3cb state=SPLITTING_NEW, ts=1537461905340, server=prod013,60020,1537304282885} to {33cba925c7acb347ac3f5e70e839c3cb state=OPEN, ts=1537461905341, server=
>>> Sep 20 19:45:05 prod001 hbase[10838]: 2018-09-20 19:45:05,341 INFO  [AM.ZK.Worker-pool5-t284] master.RegionStates: Transition {acb8f16a004a894c8706f6e12cd26144 state=SPLITTING_NEW, ts=1537461905340, server=prod013,60020,1537304282885} to {acb8f16a004a894c8706f6e12cd26144 state=OPEN, ts=1537461905341, server=
>>> Sep 20 19:45:05 prod001 hbase[10838]: 2018-09-20 19:45:05,343 INFO  [AM.ZK.Worker-pool5-t284] master.AssignmentManager: Handled SPLIT event; parent=IDX_MARK_O,\x107834005168\x0000000046200020LWfBS4c,1536637905252.3e44b85ddf407da831dbb9a871496986., daughter a=IDX_MARK_O,\x107834005168\x0000000046200020LWfBS4c,1
>>> Sep 20 19:47:41 prod001 hbase[10838]: 2018-09-20 19:47:41,972 INFO  [prod001,60000,1537304851459_ChoreService_2] balancer.StochasticLoadBalancer: Skipping load balancing because balanced cluster; total cost is 17.82282205608522, sum multiplier is 1102.0 min cost which need balance is 0.05
>>> Sep 20 19:47:42 prod001 hbase[10838]: 2018-09-20 19:47:42,021 INFO  [prod001,60000,1537304851459_ChoreService_1] hbase.MetaTableAccessor: Deleted IDX_MARK_O,\x107834005168\x0000000046200020LWfBS4c,1536637905252.3e44b85ddf407da831dbb9a871496986.
>>> Sep 20 19:47:42 prod001 hbase[10838]: 2018-09-20 19:47:42,022 INFO  [prod001,60000,1537304851459_ChoreService_1] master.CatalogJanitor: Scanned 779 catalog row(s), gc'd 0 unreferenced merged region(s) and 1 unreferenced parent region(s)
>>> 
>>>> On 20 Sep 2018, at 21:43, Batyrshin Alexander <0x62ash@gmail.com <ma...@gmail.com>> wrote:
>>>> 
>>>> Looks like problem was because of index region split
>>>> 
>>>> Index region split at prod013:
>>>> Sep 20 19:45:05 prod013 hbase[193055]: 2018-09-20 19:45:05,441 INFO  [regionserver/prod013/10.0.0.13:60020-splits-1537400010677] regionserver.SplitRequest: Region split, hbase:meta updated, and report to master. Parent=IDX_MARK_O,\x107834005168\x0000000046200020LWfBS4c,1536637905252.3e44b85ddf407da831dbb9a871496986., new regions: IDX_MARK_O,\x107834005168\x0000000046200020LWfBS4c,1537461904877.33cba925c7acb347ac3f5e70e839c3cb., IDX_MARK_O,\x107834005168\x0000000046200068=4YF!YI,1537461904877.acb8f16a004a894c8706f6e12cd26144.. Split took 0sec
>>>> Sep 20 19:45:05 prod013 hbase[193055]: 2018-09-20 19:45:05,441 INFO  [regionserver/prod013/10.0.0.13:60020-splits-1537400010677] regionserver.SplitRequest: Split transaction journal:
>>>> Sep 20 19:45:05 prod013 hbase[193055]:         STARTED at 1537461904853
>>>> Sep 20 19:45:05 prod013 hbase[193055]:         PREPARED at 1537461904877
>>>> Sep 20 19:45:05 prod013 hbase[193055]:         BEFORE_PRE_SPLIT_HOOK at 1537461904877
>>>> Sep 20 19:45:05 prod013 hbase[193055]:         AFTER_PRE_SPLIT_HOOK at 1537461904877
>>>> Sep 20 19:45:05 prod013 hbase[193055]:         SET_SPLITTING at 1537461904880
>>>> Sep 20 19:45:05 prod013 hbase[193055]:         CREATE_SPLIT_DIR at 1537461904987
>>>> Sep 20 19:45:05 prod013 hbase[193055]:         CLOSED_PARENT_REGION at 1537461905002
>>>> Sep 20 19:45:05 prod013 hbase[193055]:         OFFLINED_PARENT at 1537461905002
>>>> Sep 20 19:45:05 prod013 hbase[193055]:         STARTED_REGION_A_CREATION at 1537461905056
>>>> Sep 20 19:45:05 prod013 hbase[193055]:         STARTED_REGION_B_CREATION at 1537461905131
>>>> Sep 20 19:45:05 prod013 hbase[193055]:         PONR at 1537461905192
>>>> Sep 20 19:45:05 prod013 hbase[193055]:         OPENED_REGION_A at 1537461905249
>>>> Sep 20 19:45:05 prod013 hbase[193055]:         OPENED_REGION_B at 1537461905252
>>>> Sep 20 19:45:05 prod013 hbase[193055]:         BEFORE_POST_SPLIT_HOOK at 1537461905439
>>>> Sep 20 19:45:05 prod013 hbase[193055]:         AFTER_POST_SPLIT_HOOK at 1537461905439
>>>> Sep 20 19:45:05 prod013 hbase[193055]:         COMPLETED at 1537461905439
>>>> 
>>>> Index update failed at prod002:
>>>> Sep 20 20:09:24 prod002 hbase[97285]: 2018-09-20 20:09:24,520 WARN  [hconnection-0x4f3242a0-shared--pool32-t36014] client.AsyncProcess: #220, table=IDX_MARK_O, attempt=1/1 failed=1ops, last exception: org.apache.hadoop.hbase.NotServingRegionException: org.apache.hadoop.hbase.NotServingRegionException: Re
>>>> gion IDX_MARK_O,\x107834005168\x0000000046200020LWfBS4c,1536637905252.3e44b85ddf407da831dbb9a871496986. is not online on prod013,60020,1537304282885
>>>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3081)
>>>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1271)
>>>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.regionserver.RSRpcServices.multi(RSRpcServices.java:2365)
>>>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:36621)
>>>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2359)
>>>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:124)
>>>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:297)
>>>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:277)
>>>> Sep 20 20:09:24 prod002 hbase[97285]:  on prod013,60020,1537304282885, tracking started Thu Sep 20 20:09:24 MSK 2018; not retrying 1 - final failure
>>>> Sep 20 20:09:24 prod002 hbase[97285]: 2018-09-20 20:09:24,549 INFO  [RpcServer.default.FPBQ.Fifo.handler=98,queue=8,port=60020] zookeeper.RecoverableZooKeeper: Process identifier=hconnection-0x39beae45 connecting to ZooKeeper ensemble=10.0.0.1:2181 <http://10.0.0.1:2181/>,10.0.0.2:2181 <http://10.0.0.2:2181/>,10.0.0.3:2181 <http://10.0.0.3:2181/>
>>>> Sep 20 20:09:24 prod002 hbase[97285]: 2018-09-20 20:09:24,549 INFO  [RpcServer.default.FPBQ.Fifo.handler=98,queue=8,port=60020] zookeeper.ZooKeeper: Initiating client connection, connectString=10.0.0.1:2181 <http://10.0.0.1:2181/>,10.0.0.2:2181 <http://10.0.0.2:2181/>,10.0.0.3:2181 <http://10.0.0.3:2181/> sessionTimeout=90000 watcher=org.apache.hadoop.hbase.zookeeper.PendingWatcher@3ef61f7
>>>> Sep 20 20:09:24 prod002 hbase[97285]: 2018-09-20 20:09:24,562 INFO  [RpcServer.default.FPBQ.Fifo.handler=98,queue=8,port=60020-SendThread(10.0.0.3:2181 <http://10.0.0.3:2181/>)] zookeeper.ClientCnxn: Opening socket connection to server 10.0.0.3/10.0.0.3:2181 <http://10.0.0.3/10.0.0.3:2181>. Will not attempt to authenticate using SASL (unknown error)
>>>> Sep 20 20:09:24 prod002 hbase[97285]: 2018-09-20 20:09:24,570 INFO  [RpcServer.default.FPBQ.Fifo.handler=98,queue=8,port=60020-SendThread(10.0.0.3:2181 <http://10.0.0.3:2181/>)] zookeeper.ClientCnxn: Socket connection established to 10.0.0.3/10.0.0.3:2181 <http://10.0.0.3/10.0.0.3:2181>, initiating session
>>>> Sep 20 20:09:24 prod002 hbase[97285]: 2018-09-20 20:09:24,572 INFO  [RpcServer.default.FPBQ.Fifo.handler=98,queue=8,port=60020-SendThread(10.0.0.3:2181 <http://10.0.0.3:2181/>)] zookeeper.ClientCnxn: Session establishment complete on server 10.0.0.3/10.0.0.3:2181 <http://10.0.0.3/10.0.0.3:2181>, sessionid = 0x30000e039e01c7f, negotiated timeout = 40000
>>>> Sep 20 20:09:24 prod002 hbase[97285]: 2018-09-20 20:09:24,628 INFO  [RpcServer.default.FPBQ.Fifo.handler=98,queue=8,port=60020] index.PhoenixIndexFailurePolicy: Successfully update INDEX_DISABLE_TIMESTAMP for IDX_MARK_O due to an exception while writing updates. indexState=PENDING_DISABLE
>>>> Sep 20 20:09:24 prod002 hbase[97285]: org.apache.phoenix.hbase.index.exception.MultiIndexWriteFailureException:  disableIndexOnFailure=true, Failed to write to multiple index tables: [IDX_MARK_O]
>>>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.phoenix.hbase.index.write.TrackingParallelWriterIndexCommitter.write(TrackingParallelWriterIndexCommitter.java:235)
>>>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.phoenix.hbase.index.write.IndexWriter.write(IndexWriter.java:195)
>>>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.phoenix.hbase.index.write.IndexWriter.writeAndKillYourselfOnFailure(IndexWriter.java:156)
>>>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.phoenix.hbase.index.write.IndexWriter.writeAndKillYourselfOnFailure(IndexWriter.java:145)
>>>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.phoenix.hbase.index.Indexer.doPostWithExceptions(Indexer.java:620)
>>>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.phoenix.hbase.index.Indexer.doPost(Indexer.java:595)
>>>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.phoenix.hbase.index.Indexer.postBatchMutateIndispensably(Indexer.java:578)
>>>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost$37.call(RegionCoprocessorHost.java:1048)
>>>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost$RegionOperation.call(RegionCoprocessorHost.java:1711)
>>>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost.execOperation(RegionCoprocessorHost.java:1789)
>>>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost.execOperation(RegionCoprocessorHost.java:1745)
>>>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost.postBatchMutateIndispensably(RegionCoprocessorHost.java:1044)
>>>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.regionserver.HRegion.doMiniBatchMutation(HRegion.java:3646)
>>>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(HRegion.java:3108)
>>>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(HRegion.java:3050)
>>>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.regionserver.RSRpcServices.doBatchOp(RSRpcServices.java:916)
>>>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.regionserver.RSRpcServices.doNonAtomicRegionMutation(RSRpcServices.java:844)
>>>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.regionserver.RSRpcServices.multi(RSRpcServices.java:2405)
>>>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:36621)
>>>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2359)
>>>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:124)
>>>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:297)
>>>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:277)
>>>> Sep 20 20:09:24 prod002 hbase[97285]: 2018-09-20 20:09:24,632 INFO  [RpcServer.default.FPBQ.Fifo.handler=98,queue=8,port=60020] util.IndexManagementUtil: Rethrowing org.apache.hadoop.hbase.DoNotRetryIOException: ERROR 1121 (XCL21): Write to the index failed.  disableIndexOnFailure=true, Failed to write to multiple index tables: [IDX_MARK_O] ,serverTimestamp=1537463364504,
>>>> 
>>>> 
>>>>> On 20 Sep 2018, at 21:01, Batyrshin Alexander <0x62ash@gmail.com <ma...@gmail.com>> wrote:
>>>>> 
>>>>> Our setup:
>>>>> HBase-1.4.7
>>>>> Phoenix-4.14-hbase-1.4
>>>>> 
>>>>> 
>>>>>> On 20 Sep 2018, at 20:19, Batyrshin Alexander <0x62ash@gmail.com <ma...@gmail.com>> wrote:
>>>>>> 
>>>>>>  Hello,
>>>>>> Looks live we got dead lock with repeating "ERROR 1120 (XCL20)" exception. At this time all indexes is ACTIVE.
>>>>>> Can you help to make deeper diagnose?
>>>>>> 
>>>>>> java.sql.SQLException: ERROR 1120 (XCL20): Writes to table blocked until index can be updated. tableName=TBL_MARK
>>>>>> 	at org.apache.phoenix.exception.SQLExceptionCode$Factory$1.newException(SQLExceptionCode.java:494)
>>>>>> 	at org.apache.phoenix.exception.SQLExceptionInfo.buildException(SQLExceptionInfo.java:150)
>>>>>> 	at org.apache.phoenix.execute.MutationState.validateAndGetServerTimestamp(MutationState.java:815)
>>>>>> 	at org.apache.phoenix.execute.MutationState.validateAll(MutationState.java:789)
>>>>>> 	at org.apache.phoenix.execute.MutationState.send(MutationState.java:981)
>>>>>> 	at org.apache.phoenix.execute.MutationState.send(MutationState.java:1514)
>>>>>> 	at org.apache.phoenix.execute.MutationState.commit(MutationState.java:1337)
>>>>>> 	at org.apache.phoenix.jdbc.PhoenixConnection$3.call(PhoenixConnection.java:670)
>>>>>> 	at org.apache.phoenix.jdbc.PhoenixConnection$3.call(PhoenixConnection.java:666)
>>>>>> 	at org.apache.phoenix.call.CallRunner.run(CallRunner.java:53)
>>>>>> 	at org.apache.phoenix.jdbc.PhoenixConnection.commit(PhoenixConnection.java:666)
>>>>>> 	at x.persistence.phoenix.PhoenixDao.$anonfun$doUpsert$1(PhoenixDao.scala:103)
>>>>>> 	at scala.util.Try$.apply(Try.scala:209)
>>>>>> 	at x.persistence.phoenix.PhoenixDao.doUpsert(PhoenixDao.scala:101)
>>>>>> 	at x.persistence.phoenix.PhoenixDao.$anonfun$batchInsert$2(PhoenixDao.scala:45)
>>>>>> 	at x.persistence.phoenix.PhoenixDao.$anonfun$batchInsert$2$adapted(PhoenixDao.scala:45)
>>>>>> 	at scala.collection.immutable.Stream.flatMap(Stream.scala:486)
>>>>>> 	at scala.collection.immutable.Stream.$anonfun$flatMap$1(Stream.scala:494)
>>>>>> 	at scala.collection.immutable.Stream.$anonfun$append$1(Stream.scala:252)
>>>>>> 	at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1169)
>>>>>> 	at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1159)
>>>>>> 	at scala.collection.immutable.Stream.length(Stream.scala:309)
>>>>>> 	at scala.collection.SeqLike.size(SeqLike.scala:105)
>>>>>> 	at scala.collection.SeqLike.size$(SeqLike.scala:105)
>>>>>> 	at scala.collection.AbstractSeq.size(Seq.scala:41)
>>>>>> 	at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:285)
>>>>>> 	at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:283)
>>>>>> 	at scala.collection.AbstractTraversable.toArray(Traversable.scala:104)
>>>>>> 	at x.persistence.phoenix.PhoenixDao.$anonfun$batchInsert$1(PhoenixDao.scala:45)
>>>>>> 	at scala.util.Try$.apply(Try.scala:209)
>>>>>> 	at x.persistence.phoenix.PhoenixDao.batchInsert(PhoenixDao.scala:45)
>>>>>> 	at x.persistence.phoenix.PhoenixDao.$anonfun$insert$2(PhoenixDao.scala:35)
>>>>>> 	at scala.concurrent.Future$.$anonfun$apply$1(Future.scala:655)
>>>>>> 	at scala.util.Success.$anonfun$map$1(Try.scala:251)
>>>>>> 	at scala.util.Success.map(Try.scala:209)
>>>>>> 	at scala.concurrent.Future.$anonfun$map$1(Future.scala:289)
>>>>>> 	at scala.concurrent.impl.Promise.liftedTree1$1(Promise.scala:29)
>>>>>> 	at scala.concurrent.impl.Promise.$anonfun$transform$1(Promise.scala:29)
>>>>>> 	at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:60)
>>>>>> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>>>>>> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>>>>>> 	at java.lang.Thread.run(Thread.java:748)
>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> 
>

Re: Table dead lock: ERROR 1120 (XCL20): Writes to table blocked until index can be updated

Posted by chinogitano <mi...@yahoo.com>.

Is there any plan to port the fix to the CDH branch?  The latest is still
4.14.0-cdh5.14.2 (09/jun/2018).

Thanks,
Miles




--
Sent from: http://apache-phoenix-user-list.1124778.n5.nabble.com/

Re: Table dead lock: ERROR 1120 (XCL20): Writes to table blocked until index can be updated

Posted by Batyrshin Alexander <0x...@gmail.com>.

I've created bug with reproduce steps: https://issues.apache.org/jira/browse/PHOENIX-4960

> On 3 Oct 2018, at 21:06, Batyrshin Alexander <0x...@gmail.com> wrote:
> 
> But we see that Phoenix commit() in our cases fails with "ERROR 1120 (XCL20): Writes to table blocked until index can be updated" because of org.apache.hadoop.hbase.NotServingRegionException.
> Expected that there must be retry and success for commit()
> 
>> On 2 Oct 2018, at 22:02, Josh Elser <elserj@apache.org <ma...@apache.org>> wrote:
>> 
>> HBase will invalidate the location of a Region on seeing certain exceptions (including NotServingRegionException). After it sees the exception you have copied below, it should re-fetch the location of the Region.
>> 
>> If HBase keeps trying to access a Region on a RS that isn't hosting it, either hbase:meta is wrong or the HBase client has a bug.
>> 
>> However, to the point here, if that region was split successfully, clients should not be reading from that region anymore -- they would read from the daughters of that split region.
>> 
>> On 10/2/18 2:34 PM, Batyrshin Alexander wrote:
>>> We tried branch 4.14-HBase-1.4 at commit https://github.com/apache/phoenix/commit/52893c240e4f24e2bfac0834d35205f866c16ed8 <https://github.com/apache/phoenix/commit/52893c240e4f24e2bfac0834d35205f866c16ed8>
>>> Is there any way to invalidate meta-cache on event of index regions split? Maybe there is some option to set max time to live for cache?
>>> Watching this on regions servers:
>>> At 09:34 regions *96c3ede1c40c98959e60bd6fc0e07269* split on prod019
>>> Oct 02 09:34:39 prod019 hbase[152127]: 2018-10-02 09:34:39,719 INFO   [regionserver/prod019/10.0.0.19:60020-splits-1538462079117] regionserver.SplitRequest: Region split, hbase:meta updated, and report to master. Parent=IDX_MARK_O,\x0B\x0000000046200020qC8kovh\x00\x01\x80\x00\x
>>> 01e\x89\x8B\x99@\x00\x00\x00\x00,1537400033958.*96c3ede1c40c98959e60bd6fc0e07269*., new regions: IDX_MARK_O,\x0B\x0000000046200020qC8kovh\x00\x01\x80\x00\x01e\x89\x8B\x99@\x00\x00\x00\x00,1538462079161.80fc2516619d8665789b0c5a2bca8a8b., IDX_MARK_O,\x0BON_SCHFDOPPR_2AL-5602
>>> 2B7D-2F90-4AA5-8125-4F4001B5BE0D-00000_2AL-C0D76C01-EE7E-496B-BCD6-F6488956F75A-00000_20180228_7E372181-F23D-4EBE-9CAD-5F5218C9798I\x0000000046186195_5.UHQ=\x00\x02\x80\x00\x01a\xD3\xEA@\x80\x00\x00\x00\x00,1538462079161.24b6675d9e51067a21e58f294a9f816b.. Split took 0sec
>>> Fail at 11:51 prod018
>>> Oct 02 11:51:13 prod018 hbase[108476]: 2018-10-02 11:51:13,752 WARN   [hconnection-0x4131af19-shared--pool24-t26652] client.AsyncProcess: #164, table=IDX_MARK_O, attempt=1/1 failed=1ops, last exception: org.apache.hadoop.hbase.NotServingRegionException: org.apache.hadoop.hbase.NotServingRegionException: Region IDX_MARK_O,\x0B\x0000000046200020qC8kovh\x00\x01\x80\x00\x01e\x89\x8B\x99@\x00\x00\x00\x00,1537400033958.*96c3ede1c40c98959e60bd6fc0e07269*. is not online on prod019,60020,1538417663874
>>> Fail at 13:38 on prod005
>>> Oct 02 13:38:06 prod005 hbase[197079]: 2018-10-02 13:38:06,040 WARN   [hconnection-0x5e744e65-shared--pool8-t31214] client.AsyncProcess: #53, table=IDX_MARK_O, attempt=1/1 failed=11ops, last exception: org.apache.hadoop.hbase.NotServingRegionException: org.apache.hadoop.hbase.NotServingRegionException: Region IDX_MARK_O,\x0B\x0000000046200020qC8kovh\x00\x01\x80\x00\x01e\x89\x8B\x99@\x00\x00\x00\x00,1537400033958.*96c3ede1c40c98959e60bd6fc0e07269*. is not online on prod019,60020,1538417663874
>>>> On 27 Sep 2018, at 01:04, Ankit Singhal <ankitsinghal59@gmail.com <ma...@gmail.com> <mailto:ankitsinghal59@gmail.com <ma...@gmail.com>>> wrote:
>>>> 
>>>> You might be hitting PHOENIX-4785 <https://jira.apache.org/jira/browse/PHOENIX-4785 <https://jira.apache.org/jira/browse/PHOENIX-4785>>,  you can apply the patch on top of 4.14 and see if it fixes your problem.
>>>> 
>>>> Regards,
>>>> Ankit Singhal
>>>> 
>>>> On Wed, Sep 26, 2018 at 2:33 PM Batyrshin Alexander <0x62ash@gmail.com <ma...@gmail.com> <mailto:0x62ash@gmail.com <ma...@gmail.com>>> wrote:
>>>> 
>>>>    Any advices? Helps?
>>>>    I can reproduce problem and capture more logs if needed.
>>>> 
>>>>>    On 21 Sep 2018, at 02:13, Batyrshin Alexander <0x62ash@gmail.com <ma...@gmail.com>
>>>>>    <mailto:0x62ash@gmail.com <ma...@gmail.com>>> wrote:
>>>>> 
>>>>>    Looks like lock goes away 30 minutes after index region split.
>>>>>    So i can assume that this issue comes from cache that configured
>>>>>    by this option:*phoenix.coprocessor.maxMetaDataCacheTimeToLiveMs*
>>>>> 
>>>>> 
>>>>> 
>>>>>>    On 21 Sep 2018, at 00:15, Batyrshin Alexander <0x62ash@gmail.com <ma...@gmail.com>
>>>>>>    <mailto:0x62ash@gmail.com <ma...@gmail.com>>> wrote:
>>>>>> 
>>>>>>    And how this split looks at Master logs:
>>>>>> 
>>>>>>    Sep 20 19:45:04 prod001 hbase[10838]: 2018-09-20 19:45:04,888
>>>>>>    INFO  [AM.ZK.Worker-pool5-t282] master.RegionStates: Transition
>>>>>>    {3e44b85ddf407da831dbb9a871496986 state=OPEN,
>>>>>>    ts=1537304859509, server=prod013,60020,1537304282885} to
>>>>>>    {3e44b85ddf407da831dbb9a871496986 state=SPLITTING,
>>>>>>    ts=1537461904888, server=prod
>>>>>>    Sep 20 19:45:05 prod001 hbase[10838]: 2018-09-20 19:45:05,340
>>>>>>    INFO  [AM.ZK.Worker-pool5-t284] master.RegionStates: Transition
>>>>>>    {3e44b85ddf407da831dbb9a871496986
>>>>>>    state=SPLITTING, ts=1537461905340,
>>>>>>    server=prod013,60020,1537304282885} to
>>>>>>    {3e44b85ddf407da831dbb9a871496986 state=SPLIT, ts=1537461905340,
>>>>>>    server=pro
>>>>>>    Sep 20 19:45:05 prod001 hbase[10838]: 2018-09-20 19:45:05,340
>>>>>>    INFO  [AM.ZK.Worker-pool5-t284] master.RegionStates: Offlined
>>>>>>    3e44b85ddf407da831dbb9a871496986 from prod013,60020,1537304282885
>>>>>>    Sep 20 19:45:05 prod001 hbase[10838]: 2018-09-20 19:45:05,341
>>>>>>    INFO  [AM.ZK.Worker-pool5-t284] master.RegionStates: Transition
>>>>>>    {33cba925c7acb347ac3f5e70e839c3cb
>>>>>>    state=SPLITTING_NEW, ts=1537461905340,
>>>>>>    server=prod013,60020,1537304282885} to
>>>>>>    {33cba925c7acb347ac3f5e70e839c3cb state=OPEN, ts=1537461905341,
>>>>>>    server=
>>>>>>    Sep 20 19:45:05 prod001 hbase[10838]: 2018-09-20 19:45:05,341
>>>>>>    INFO  [AM.ZK.Worker-pool5-t284] master.RegionStates: Transition
>>>>>>    {acb8f16a004a894c8706f6e12cd26144
>>>>>>    state=SPLITTING_NEW, ts=1537461905340,
>>>>>>    server=prod013,60020,1537304282885} to
>>>>>>    {acb8f16a004a894c8706f6e12cd26144 state=OPEN, ts=1537461905341,
>>>>>>    server=
>>>>>>    Sep 20 19:45:05 prod001 hbase[10838]: 2018-09-20 19:45:05,343
>>>>>>    INFO  [AM.ZK.Worker-pool5-t284] master.AssignmentManager:
>>>>>>    Handled SPLIT
>>>>>>    event; parent=IDX_MARK_O,\x107834005168\x0000000046200020LWfBS4c,1536637905252.3e44b85ddf407da831dbb9a871496986.,
>>>>>>    daughter a=IDX_MARK_O,\x107834005168\x0000000046200020LWfBS4c,1
>>>>>>    Sep 20 19:47:41 prod001 hbase[10838]: 2018-09-20 19:47:41,972
>>>>>>    INFO  [prod001,60000,1537304851459_ChoreService_2]
>>>>>>    balancer.StochasticLoadBalancer: Skipping load balancing because
>>>>>>    balanced cluster; total cost is 17.82282205608522, sum
>>>>>>    multiplier is 1102.0 min cost which need balance is 0.05
>>>>>>    Sep 20 19:47:42 prod001 hbase[10838]: 2018-09-20 19:47:42,021
>>>>>>    INFO  [prod001,60000,1537304851459_ChoreService_1]
>>>>>>    hbase.MetaTableAccessor:
>>>>>>    Deleted IDX_MARK_O,\x107834005168\x0000000046200020LWfBS4c,1536637905252.3e44b85ddf407da831dbb9a871496986.
>>>>>>    Sep 20 19:47:42 prod001 hbase[10838]: 2018-09-20 19:47:42,022
>>>>>>    INFO  [prod001,60000,1537304851459_ChoreService_1]
>>>>>>    master.CatalogJanitor: Scanned 779 catalog row(s), gc'd 0
>>>>>>    unreferenced merged region(s) and 1 unreferenced parent region(s)
>>>>>> 
>>>>>>>    On 20 Sep 2018, at 21:43, Batyrshin Alexander
>>>>>>>    <0x62ash@gmail.com <ma...@gmail.com> <mailto:0x62ash@gmail.com <ma...@gmail.com>>> wrote:
>>>>>>> 
>>>>>>>    Looks like problem was because of index region split
>>>>>>> 
>>>>>>>    Index region split at prod013:
>>>>>>>    Sep 20 19:45:05 prod013 hbase[193055]: 2018-09-20 19:45:05,441
>>>>>>>    INFO
>>>>>>>     [regionserver/prod013/10.0.0.13:60020-splits-1537400010677]
>>>>>>>    regionserver.SplitRequest: Region split, hbase:meta updated,
>>>>>>>    and report to master.
>>>>>>>    Parent=IDX_MARK_O,\x107834005168\x0000000046200020LWfBS4c,1536637905252.3e44b85ddf407da831dbb9a871496986.,
>>>>>>>    new
>>>>>>>    regions: IDX_MARK_O,\x107834005168\x0000000046200020LWfBS4c,1537461904877.33cba925c7acb347ac3f5e70e839c3cb., IDX_MARK_O,\x107834005168\x0000000046200068=4YF!YI,1537461904877.acb8f16a004a894c8706f6e12cd26144..
>>>>>>>    Split took 0sec
>>>>>>>    Sep 20 19:45:05 prod013 hbase[193055]: 2018-09-20 19:45:05,441
>>>>>>>    INFO
>>>>>>>     [regionserver/prod013/10.0.0.13:60020-splits-1537400010677]
>>>>>>>    regionserver.SplitRequest: Split transaction journal:
>>>>>>>    Sep 20 19:45:05 prod013 hbase[193055]:         STARTED at
>>>>>>>    1537461904853
>>>>>>>    Sep 20 19:45:05 prod013 hbase[193055]:         PREPARED at
>>>>>>>    1537461904877
>>>>>>>    Sep 20 19:45:05 prod013 hbase[193055]:            BEFORE_PRE_SPLIT_HOOK at 1537461904877
>>>>>>>    Sep 20 19:45:05 prod013 hbase[193055]:            AFTER_PRE_SPLIT_HOOK at 1537461904877
>>>>>>>    Sep 20 19:45:05 prod013 hbase[193055]:         SET_SPLITTING at
>>>>>>>    1537461904880
>>>>>>>    Sep 20 19:45:05 prod013 hbase[193055]:         CREATE_SPLIT_DIR
>>>>>>>    at 1537461904987
>>>>>>>    Sep 20 19:45:05 prod013 hbase[193055]:            CLOSED_PARENT_REGION at 1537461905002
>>>>>>>    Sep 20 19:45:05 prod013 hbase[193055]:         OFFLINED_PARENT
>>>>>>>    at 1537461905002
>>>>>>>    Sep 20 19:45:05 prod013 hbase[193055]:            STARTED_REGION_A_CREATION at 1537461905056
>>>>>>>    Sep 20 19:45:05 prod013 hbase[193055]:            STARTED_REGION_B_CREATION at 1537461905131
>>>>>>>    Sep 20 19:45:05 prod013 hbase[193055]:         PONR at
>>>>>>>    1537461905192
>>>>>>>    Sep 20 19:45:05 prod013 hbase[193055]:         OPENED_REGION_A
>>>>>>>    at 1537461905249
>>>>>>>    Sep 20 19:45:05 prod013 hbase[193055]:         OPENED_REGION_B
>>>>>>>    at 1537461905252
>>>>>>>    Sep 20 19:45:05 prod013 hbase[193055]:            BEFORE_POST_SPLIT_HOOK at 1537461905439
>>>>>>>    Sep 20 19:45:05 prod013 hbase[193055]:            AFTER_POST_SPLIT_HOOK at 1537461905439
>>>>>>>    Sep 20 19:45:05 prod013 hbase[193055]:         COMPLETED at
>>>>>>>    1537461905439
>>>>>>> 
>>>>>>>    Index update failed at prod002:
>>>>>>>    Sep 20 20:09:24 prod002 hbase[97285]: 2018-09-20 20:09:24,520
>>>>>>>    WARN  [hconnection-0x4f3242a0-shared--pool32-t36014]
>>>>>>>    client.AsyncProcess: #220, table=IDX_MARK_O, attempt=1/1
>>>>>>>    failed=1ops, last exception:
>>>>>>>    org.apache.hadoop.hbase.NotServingRegionException:
>>>>>>>    org.apache.hadoop.hbase.NotServingRegionException: Re
>>>>>>>    gion
>>>>>>>    IDX_MARK_O,\x107834005168\x0000000046200020LWfBS4c,1536637905252.3e44b85ddf407da831dbb9a871496986.
>>>>>>>    is not online on prod013,60020,1537304282885
>>>>>>>    Sep 20 20:09:24 prod002 hbase[97285]:         at
>>>>>>>    org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3081)
>>>>>>>    Sep 20 20:09:24 prod002 hbase[97285]:         at
>>>>>>>    org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1271)
>>>>>>>    Sep 20 20:09:24 prod002 hbase[97285]:         at
>>>>>>>    org.apache.hadoop.hbase.regionserver.RSRpcServices.multi(RSRpcServices.java:2365)
>>>>>>>    Sep 20 20:09:24 prod002 hbase[97285]:         at
>>>>>>>    org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:36621)
>>>>>>>    Sep 20 20:09:24 prod002 hbase[97285]:         at
>>>>>>>    org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2359)
>>>>>>>    Sep 20 20:09:24 prod002 hbase[97285]:         at
>>>>>>>    org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:124)
>>>>>>>    Sep 20 20:09:24 prod002 hbase[97285]:         at
>>>>>>>    org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:297)
>>>>>>>    Sep 20 20:09:24 prod002 hbase[97285]:         at
>>>>>>>    org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:277)
>>>>>>>    Sep 20 20:09:24 prod002 hbase[97285]:  on
>>>>>>>    prod013,60020,1537304282885, tracking started Thu Sep 20
>>>>>>>    20:09:24 MSK 2018; not retrying 1 - final failure
>>>>>>>    Sep 20 20:09:24 prod002 hbase[97285]: 2018-09-20 20:09:24,549
>>>>>>>    INFO
>>>>>>>     [RpcServer.default.FPBQ.Fifo.handler=98,queue=8,port=60020]
>>>>>>>    zookeeper.RecoverableZooKeeper: Process
>>>>>>>    identifier=hconnection-0x39beae45 connecting to ZooKeeper
>>>>>>>    ensemble=10.0.0.1:2181 <http://10.0.0.1:2181/ <http://10.0.0.1:2181/>>,10.0.0.2:2181
>>>>>>>    <http://10.0.0.2:2181/ <http://10.0.0.2:2181/>>,10.0.0.3:2181 <http://10.0.0.3:2181/ <http://10.0.0.3:2181/>>
>>>>>>>    Sep 20 20:09:24 prod002 hbase[97285]: 2018-09-20 20:09:24,549
>>>>>>>    INFO
>>>>>>>     [RpcServer.default.FPBQ.Fifo.handler=98,queue=8,port=60020]
>>>>>>>    zookeeper.ZooKeeper: Initiating client
>>>>>>>    connection, connectString=10.0.0.1:2181
>>>>>>>    <http://10.0.0.1:2181/ <http://10.0.0.1:2181/>>,10.0.0.2:2181
>>>>>>>    <http://10.0.0.2:2181/ <http://10.0.0.2:2181/>>,10.0.0.3:2181 <http://10.0.0.3:2181/ <http://10.0.0.3:2181/>>
>>>>>>>    sessionTimeout=90000
>>>>>>>    watcher=org.apache.hadoop.hbase.zookeeper.PendingWatcher@3ef61f7
>>>>>>>    Sep 20 20:09:24 prod002 hbase[97285]: 2018-09-20 20:09:24,562
>>>>>>>    INFO
>>>>>>>     [RpcServer.default.FPBQ.Fifo.handler=98,queue=8,port=60020-SendThread(10.0.0.3:2181
>>>>>>>    <http://10.0.0.3:2181/ <http://10.0.0.3:2181/>>)] zookeeper.ClientCnxn: Opening
>>>>>>>    socket connection to server 10.0.0.3/10.0.0.3:2181
>>>>>>>    <http://10.0.0.3/10.0.0.3:2181 <http://10.0.0.3/10.0.0.3:2181>>. Will not attempt to
>>>>>>>    authenticate using SASL (unknown error)
>>>>>>>    Sep 20 20:09:24 prod002 hbase[97285]: 2018-09-20 20:09:24,570
>>>>>>>    INFO
>>>>>>>     [RpcServer.default.FPBQ.Fifo.handler=98,queue=8,port=60020-SendThread(10.0.0.3:2181
>>>>>>>    <http://10.0.0.3:2181/ <http://10.0.0.3:2181/>>)] zookeeper.ClientCnxn: Socket
>>>>>>>    connection established to 10.0.0.3/10.0.0.3:2181
>>>>>>>    <http://10.0.0.3/10.0.0.3:2181 <http://10.0.0.3/10.0.0.3:2181>>, initiating session
>>>>>>>    Sep 20 20:09:24 prod002 hbase[97285]: 2018-09-20 20:09:24,572
>>>>>>>    INFO
>>>>>>>     [RpcServer.default.FPBQ.Fifo.handler=98,queue=8,port=60020-SendThread(10.0.0.3:2181
>>>>>>>    <http://10.0.0.3:2181/ <http://10.0.0.3:2181/>>)] zookeeper.ClientCnxn:
>>>>>>>    Session establishment complete on server 10.0.0.3/10.0.0.3:2181
>>>>>>>    <http://10.0.0.3/10.0.0.3:2181 <http://10.0.0.3/10.0.0.3:2181>>, sessionid = 0x30000e039e01c7f,
>>>>>>>    negotiated timeout = 40000
>>>>>>>    Sep 20 20:09:24 prod002 hbase[97285]: 2018-09-20 20:09:24,628
>>>>>>>    INFO
>>>>>>>     [RpcServer.default.FPBQ.Fifo.handler=98,queue=8,port=60020]
>>>>>>>    index.PhoenixIndexFailurePolicy: Successfully
>>>>>>>    update INDEX_DISABLE_TIMESTAMP for IDX_MARK_O due to an
>>>>>>>    exception while writing updates. indexState=PENDING_DISABLE
>>>>>>>    Sep 20 20:09:24 prod002 hbase[97285]:
>>>>>>>    org.apache.phoenix.hbase.index.exception.MultiIndexWriteFailureException:
>>>>>>>     disableIndexOnFailure=true, Failed to write to multiple index
>>>>>>>    tables: [IDX_MARK_O]
>>>>>>>    Sep 20 20:09:24 prod002 hbase[97285]:         at
>>>>>>>    org.apache.phoenix.hbase.index.write.TrackingParallelWriterIndexCommitter.write(TrackingParallelWriterIndexCommitter.java:235)
>>>>>>>    Sep 20 20:09:24 prod002 hbase[97285]:         at
>>>>>>>    org.apache.phoenix.hbase.index.write.IndexWriter.write(IndexWriter.java:195)
>>>>>>>    Sep 20 20:09:24 prod002 hbase[97285]:         at
>>>>>>>    org.apache.phoenix.hbase.index.write.IndexWriter.writeAndKillYourselfOnFailure(IndexWriter.java:156)
>>>>>>>    Sep 20 20:09:24 prod002 hbase[97285]:         at
>>>>>>>    org.apache.phoenix.hbase.index.write.IndexWriter.writeAndKillYourselfOnFailure(IndexWriter.java:145)
>>>>>>>    Sep 20 20:09:24 prod002 hbase[97285]:         at
>>>>>>>    org.apache.phoenix.hbase.index.Indexer.doPostWithExceptions(Indexer.java:620)
>>>>>>>    Sep 20 20:09:24 prod002 hbase[97285]:         at
>>>>>>>    org.apache.phoenix.hbase.index.Indexer.doPost(Indexer.java:595)
>>>>>>>    Sep 20 20:09:24 prod002 hbase[97285]:         at
>>>>>>>    org.apache.phoenix.hbase.index.Indexer.postBatchMutateIndispensably(Indexer.java:578)
>>>>>>>    Sep 20 20:09:24 prod002 hbase[97285]:         at
>>>>>>>    org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost$37.call(RegionCoprocessorHost.java:1048)
>>>>>>>    Sep 20 20:09:24 prod002 hbase[97285]:         at
>>>>>>>    org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost$RegionOperation.call(RegionCoprocessorHost.java:1711)
>>>>>>>    Sep 20 20:09:24 prod002 hbase[97285]:         at
>>>>>>>    org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost.execOperation(RegionCoprocessorHost.java:1789)
>>>>>>>    Sep 20 20:09:24 prod002 hbase[97285]:         at
>>>>>>>    org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost.execOperation(RegionCoprocessorHost.java:1745)
>>>>>>>    Sep 20 20:09:24 prod002 hbase[97285]:         at
>>>>>>>    org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost.postBatchMutateIndispensably(RegionCoprocessorHost.java:1044)
>>>>>>>    Sep 20 20:09:24 prod002 hbase[97285]:         at
>>>>>>>    org.apache.hadoop.hbase.regionserver.HRegion.doMiniBatchMutation(HRegion.java:3646)
>>>>>>>    Sep 20 20:09:24 prod002 hbase[97285]:         at
>>>>>>>    org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(HRegion.java:3108)
>>>>>>>    Sep 20 20:09:24 prod002 hbase[97285]:         at
>>>>>>>    org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(HRegion.java:3050)
>>>>>>>    Sep 20 20:09:24 prod002 hbase[97285]:         at
>>>>>>>    org.apache.hadoop.hbase.regionserver.RSRpcServices.doBatchOp(RSRpcServices.java:916)
>>>>>>>    Sep 20 20:09:24 prod002 hbase[97285]:         at
>>>>>>>    org.apache.hadoop.hbase.regionserver.RSRpcServices.doNonAtomicRegionMutation(RSRpcServices.java:844)
>>>>>>>    Sep 20 20:09:24 prod002 hbase[97285]:         at
>>>>>>>    org.apache.hadoop.hbase.regionserver.RSRpcServices.multi(RSRpcServices.java:2405)
>>>>>>>    Sep 20 20:09:24 prod002 hbase[97285]:         at
>>>>>>>    org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:36621)
>>>>>>>    Sep 20 20:09:24 prod002 hbase[97285]:         at
>>>>>>>    org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2359)
>>>>>>>    Sep 20 20:09:24 prod002 hbase[97285]:         at
>>>>>>>    org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:124)
>>>>>>>    Sep 20 20:09:24 prod002 hbase[97285]:         at
>>>>>>>    org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:297)
>>>>>>>    Sep 20 20:09:24 prod002 hbase[97285]:         at
>>>>>>>    org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:277)
>>>>>>>    Sep 20 20:09:24 prod002 hbase[97285]: 2018-09-20 20:09:24,632
>>>>>>>    INFO
>>>>>>>     [RpcServer.default.FPBQ.Fifo.handler=98,queue=8,port=60020]
>>>>>>>    util.IndexManagementUtil:
>>>>>>>    Rethrowing org.apache.hadoop.hbase.DoNotRetryIOException: ERROR
>>>>>>>    1121 (XCL21): Write to the index failed.
>>>>>>>     disableIndexOnFailure=true, Failed to write to multiple index
>>>>>>>    tables: [IDX_MARK_O] ,serverTimestamp=1537463364504,
>>>>>>> 
>>>>>>> 
>>>>>>>>    On 20 Sep 2018, at 21:01, Batyrshin Alexander
>>>>>>>>    <0x62ash@gmail.com <ma...@gmail.com> <mailto:0x62ash@gmail.com <ma...@gmail.com>>> wrote:
>>>>>>>> 
>>>>>>>>    Our setup:
>>>>>>>>    HBase-1.4.7
>>>>>>>>    Phoenix-4.14-hbase-1.4
>>>>>>>> 
>>>>>>>> 
>>>>>>>>>    On 20 Sep 2018, at 20:19, Batyrshin Alexander
>>>>>>>>>    <0x62ash@gmail.com <ma...@gmail.com> <mailto:0x62ash@gmail.com <ma...@gmail.com>>> wrote:
>>>>>>>>> 
>>>>>>>>>     Hello,
>>>>>>>>>    Looks live we got dead lock with repeating "ERROR 1120
>>>>>>>>>    (XCL20)" exception. At this time all indexes is ACTIVE.
>>>>>>>>>    Can you help to make deeper diagnose?
>>>>>>>>> 
>>>>>>>>>    java.sql.SQLException: ERROR 1120 (XCL20): Writes to table
>>>>>>>>>    blocked until index can be updated. tableName=TBL_MARK
>>>>>>>>>    at
>>>>>>>>>    org.apache.phoenix.exception.SQLExceptionCode$Factory$1.newException(SQLExceptionCode.java:494)
>>>>>>>>>    at
>>>>>>>>>    org.apache.phoenix.exception.SQLExceptionInfo.buildException(SQLExceptionInfo.java:150)
>>>>>>>>>    at
>>>>>>>>>    org.apache.phoenix.execute.MutationState.validateAndGetServerTimestamp(MutationState.java:815)
>>>>>>>>>    at
>>>>>>>>>    org.apache.phoenix.execute.MutationState.validateAll(MutationState.java:789)
>>>>>>>>>    at
>>>>>>>>>    org.apache.phoenix.execute.MutationState.send(MutationState.java:981)
>>>>>>>>>    at
>>>>>>>>>    org.apache.phoenix.execute.MutationState.send(MutationState.java:1514)
>>>>>>>>>    at
>>>>>>>>>    org.apache.phoenix.execute.MutationState.commit(MutationState.java:1337)
>>>>>>>>>    at
>>>>>>>>>    org.apache.phoenix.jdbc.PhoenixConnection$3.call(PhoenixConnection.java:670)
>>>>>>>>>    at
>>>>>>>>>    org.apache.phoenix.jdbc.PhoenixConnection$3.call(PhoenixConnection.java:666)
>>>>>>>>>    at org.apache.phoenix.call.CallRunner.run(CallRunner.java:53)
>>>>>>>>>    at
>>>>>>>>>    org.apache.phoenix.jdbc.PhoenixConnection.commit(PhoenixConnection.java:666)
>>>>>>>>>    at
>>>>>>>>>    x.persistence.phoenix.PhoenixDao.$anonfun$doUpsert$1(PhoenixDao.scala:103)
>>>>>>>>>    at scala.util.Try$.apply(Try.scala:209)
>>>>>>>>>    at
>>>>>>>>>    x.persistence.phoenix.PhoenixDao.doUpsert(PhoenixDao.scala:101)
>>>>>>>>>    at
>>>>>>>>>    x.persistence.phoenix.PhoenixDao.$anonfun$batchInsert$2(PhoenixDao.scala:45)
>>>>>>>>>    at
>>>>>>>>>    x.persistence.phoenix.PhoenixDao.$anonfun$batchInsert$2$adapted(PhoenixDao.scala:45)
>>>>>>>>>    at scala.collection.immutable.Stream.flatMap(Stream.scala:486)
>>>>>>>>>    at
>>>>>>>>>    scala.collection.immutable.Stream.$anonfun$flatMap$1(Stream.scala:494)
>>>>>>>>>    at
>>>>>>>>>    scala.collection.immutable.Stream.$anonfun$append$1(Stream.scala:252)
>>>>>>>>>    at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1169)
>>>>>>>>>    at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1159)
>>>>>>>>>    at scala.collection.immutable.Stream.length(Stream.scala:309)
>>>>>>>>>    at scala.collection.SeqLike.size(SeqLike.scala:105)
>>>>>>>>>    at scala.collection.SeqLike.size$(SeqLike.scala:105)
>>>>>>>>>    at scala.collection.AbstractSeq.size(Seq.scala:41)
>>>>>>>>>    at
>>>>>>>>>    scala.collection.TraversableOnce.toArray(TraversableOnce.scala:285)
>>>>>>>>>    at
>>>>>>>>>    scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:283)
>>>>>>>>>    at
>>>>>>>>>    scala.collection.AbstractTraversable.toArray(Traversable.scala:104)
>>>>>>>>>    at
>>>>>>>>>    x.persistence.phoenix.PhoenixDao.$anonfun$batchInsert$1(PhoenixDao.scala:45)
>>>>>>>>>    at scala.util.Try$.apply(Try.scala:209)
>>>>>>>>>    at
>>>>>>>>>    x.persistence.phoenix.PhoenixDao.batchInsert(PhoenixDao.scala:45)
>>>>>>>>>    at
>>>>>>>>>    x.persistence.phoenix.PhoenixDao.$anonfun$insert$2(PhoenixDao.scala:35)
>>>>>>>>>    at scala.concurrent.Future$.$anonfun$apply$1(Future.scala:655)
>>>>>>>>>    at scala.util.Success.$anonfun$map$1(Try.scala:251)
>>>>>>>>>    at scala.util.Success.map(Try.scala:209)
>>>>>>>>>    at scala.concurrent.Future.$anonfun$map$1(Future.scala:289)
>>>>>>>>>    at scala.concurrent.impl.Promise.liftedTree1$1(Promise.scala:29)
>>>>>>>>>    at
>>>>>>>>>    scala.concurrent.impl.Promise.$anonfun$transform$1(Promise.scala:29)
>>>>>>>>>    at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:60)
>>>>>>>>>    at
>>>>>>>>>    java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>>>>>>>>>    at
>>>>>>>>>    java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>>>>>>>>>    at java.lang.Thread.run(Thread.java:748)
>

Re: Table dead lock: ERROR 1120 (XCL20): Writes to table blocked until index can be updated

Posted by Batyrshin Alexander <0x...@gmail.com>.

But we see that Phoenix commit() in our cases fails with "ERROR 1120 (XCL20): Writes to table blocked until index can be updated" because of org.apache.hadoop.hbase.NotServingRegionException.
Expected that there must be retry and success for commit()

> On 2 Oct 2018, at 22:02, Josh Elser <el...@apache.org> wrote:
> 
> HBase will invalidate the location of a Region on seeing certain exceptions (including NotServingRegionException). After it sees the exception you have copied below, it should re-fetch the location of the Region.
> 
> If HBase keeps trying to access a Region on a RS that isn't hosting it, either hbase:meta is wrong or the HBase client has a bug.
> 
> However, to the point here, if that region was split successfully, clients should not be reading from that region anymore -- they would read from the daughters of that split region.
> 
> On 10/2/18 2:34 PM, Batyrshin Alexander wrote:
>> We tried branch 4.14-HBase-1.4 at commit https://github.com/apache/phoenix/commit/52893c240e4f24e2bfac0834d35205f866c16ed8
>> Is there any way to invalidate meta-cache on event of index regions split? Maybe there is some option to set max time to live for cache?
>> Watching this on regions servers:
>> At 09:34 regions *96c3ede1c40c98959e60bd6fc0e07269* split on prod019
>> Oct 02 09:34:39 prod019 hbase[152127]: 2018-10-02 09:34:39,719 INFO   [regionserver/prod019/10.0.0.19:60020-splits-1538462079117] regionserver.SplitRequest: Region split, hbase:meta updated, and report to master. Parent=IDX_MARK_O,\x0B\x0000000046200020qC8kovh\x00\x01\x80\x00\x
>> 01e\x89\x8B\x99@\x00\x00\x00\x00,1537400033958.*96c3ede1c40c98959e60bd6fc0e07269*., new regions: IDX_MARK_O,\x0B\x0000000046200020qC8kovh\x00\x01\x80\x00\x01e\x89\x8B\x99@\x00\x00\x00\x00,1538462079161.80fc2516619d8665789b0c5a2bca8a8b., IDX_MARK_O,\x0BON_SCHFDOPPR_2AL-5602
>> 2B7D-2F90-4AA5-8125-4F4001B5BE0D-00000_2AL-C0D76C01-EE7E-496B-BCD6-F6488956F75A-00000_20180228_7E372181-F23D-4EBE-9CAD-5F5218C9798I\x0000000046186195_5.UHQ=\x00\x02\x80\x00\x01a\xD3\xEA@\x80\x00\x00\x00\x00,1538462079161.24b6675d9e51067a21e58f294a9f816b.. Split took 0sec
>> Fail at 11:51 prod018
>> Oct 02 11:51:13 prod018 hbase[108476]: 2018-10-02 11:51:13,752 WARN   [hconnection-0x4131af19-shared--pool24-t26652] client.AsyncProcess: #164, table=IDX_MARK_O, attempt=1/1 failed=1ops, last exception: org.apache.hadoop.hbase.NotServingRegionException: org.apache.hadoop.hbase.NotServingRegionException: Region IDX_MARK_O,\x0B\x0000000046200020qC8kovh\x00\x01\x80\x00\x01e\x89\x8B\x99@\x00\x00\x00\x00,1537400033958.*96c3ede1c40c98959e60bd6fc0e07269*. is not online on prod019,60020,1538417663874
>> Fail at 13:38 on prod005
>> Oct 02 13:38:06 prod005 hbase[197079]: 2018-10-02 13:38:06,040 WARN   [hconnection-0x5e744e65-shared--pool8-t31214] client.AsyncProcess: #53, table=IDX_MARK_O, attempt=1/1 failed=11ops, last exception: org.apache.hadoop.hbase.NotServingRegionException: org.apache.hadoop.hbase.NotServingRegionException: Region IDX_MARK_O,\x0B\x0000000046200020qC8kovh\x00\x01\x80\x00\x01e\x89\x8B\x99@\x00\x00\x00\x00,1537400033958.*96c3ede1c40c98959e60bd6fc0e07269*. is not online on prod019,60020,1538417663874
>>> On 27 Sep 2018, at 01:04, Ankit Singhal <ankitsinghal59@gmail.com <ma...@gmail.com> <mailto:ankitsinghal59@gmail.com <ma...@gmail.com>>> wrote:
>>> 
>>> You might be hitting PHOENIX-4785 <https://jira.apache.org/jira/browse/PHOENIX-4785 <https://jira.apache.org/jira/browse/PHOENIX-4785>>,  you can apply the patch on top of 4.14 and see if it fixes your problem.
>>> 
>>> Regards,
>>> Ankit Singhal
>>> 
>>> On Wed, Sep 26, 2018 at 2:33 PM Batyrshin Alexander <0x62ash@gmail.com <ma...@gmail.com> <mailto:0x62ash@gmail.com <ma...@gmail.com>>> wrote:
>>> 
>>>    Any advices? Helps?
>>>    I can reproduce problem and capture more logs if needed.
>>> 
>>>>    On 21 Sep 2018, at 02:13, Batyrshin Alexander <0x62ash@gmail.com <ma...@gmail.com>
>>>>    <mailto:0x62ash@gmail.com <ma...@gmail.com>>> wrote:
>>>> 
>>>>    Looks like lock goes away 30 minutes after index region split.
>>>>    So i can assume that this issue comes from cache that configured
>>>>    by this option:*phoenix.coprocessor.maxMetaDataCacheTimeToLiveMs*
>>>> 
>>>> 
>>>> 
>>>>>    On 21 Sep 2018, at 00:15, Batyrshin Alexander <0x62ash@gmail.com <ma...@gmail.com>
>>>>>    <mailto:0x62ash@gmail.com <ma...@gmail.com>>> wrote:
>>>>> 
>>>>>    And how this split looks at Master logs:
>>>>> 
>>>>>    Sep 20 19:45:04 prod001 hbase[10838]: 2018-09-20 19:45:04,888
>>>>>    INFO  [AM.ZK.Worker-pool5-t282] master.RegionStates: Transition
>>>>>    {3e44b85ddf407da831dbb9a871496986 state=OPEN,
>>>>>    ts=1537304859509, server=prod013,60020,1537304282885} to
>>>>>    {3e44b85ddf407da831dbb9a871496986 state=SPLITTING,
>>>>>    ts=1537461904888, server=prod
>>>>>    Sep 20 19:45:05 prod001 hbase[10838]: 2018-09-20 19:45:05,340
>>>>>    INFO  [AM.ZK.Worker-pool5-t284] master.RegionStates: Transition
>>>>>    {3e44b85ddf407da831dbb9a871496986
>>>>>    state=SPLITTING, ts=1537461905340,
>>>>>    server=prod013,60020,1537304282885} to
>>>>>    {3e44b85ddf407da831dbb9a871496986 state=SPLIT, ts=1537461905340,
>>>>>    server=pro
>>>>>    Sep 20 19:45:05 prod001 hbase[10838]: 2018-09-20 19:45:05,340
>>>>>    INFO  [AM.ZK.Worker-pool5-t284] master.RegionStates: Offlined
>>>>>    3e44b85ddf407da831dbb9a871496986 from prod013,60020,1537304282885
>>>>>    Sep 20 19:45:05 prod001 hbase[10838]: 2018-09-20 19:45:05,341
>>>>>    INFO  [AM.ZK.Worker-pool5-t284] master.RegionStates: Transition
>>>>>    {33cba925c7acb347ac3f5e70e839c3cb
>>>>>    state=SPLITTING_NEW, ts=1537461905340,
>>>>>    server=prod013,60020,1537304282885} to
>>>>>    {33cba925c7acb347ac3f5e70e839c3cb state=OPEN, ts=1537461905341,
>>>>>    server=
>>>>>    Sep 20 19:45:05 prod001 hbase[10838]: 2018-09-20 19:45:05,341
>>>>>    INFO  [AM.ZK.Worker-pool5-t284] master.RegionStates: Transition
>>>>>    {acb8f16a004a894c8706f6e12cd26144
>>>>>    state=SPLITTING_NEW, ts=1537461905340,
>>>>>    server=prod013,60020,1537304282885} to
>>>>>    {acb8f16a004a894c8706f6e12cd26144 state=OPEN, ts=1537461905341,
>>>>>    server=
>>>>>    Sep 20 19:45:05 prod001 hbase[10838]: 2018-09-20 19:45:05,343
>>>>>    INFO  [AM.ZK.Worker-pool5-t284] master.AssignmentManager:
>>>>>    Handled SPLIT
>>>>>    event; parent=IDX_MARK_O,\x107834005168\x0000000046200020LWfBS4c,1536637905252.3e44b85ddf407da831dbb9a871496986.,
>>>>>    daughter a=IDX_MARK_O,\x107834005168\x0000000046200020LWfBS4c,1
>>>>>    Sep 20 19:47:41 prod001 hbase[10838]: 2018-09-20 19:47:41,972
>>>>>    INFO  [prod001,60000,1537304851459_ChoreService_2]
>>>>>    balancer.StochasticLoadBalancer: Skipping load balancing because
>>>>>    balanced cluster; total cost is 17.82282205608522, sum
>>>>>    multiplier is 1102.0 min cost which need balance is 0.05
>>>>>    Sep 20 19:47:42 prod001 hbase[10838]: 2018-09-20 19:47:42,021
>>>>>    INFO  [prod001,60000,1537304851459_ChoreService_1]
>>>>>    hbase.MetaTableAccessor:
>>>>>    Deleted IDX_MARK_O,\x107834005168\x0000000046200020LWfBS4c,1536637905252.3e44b85ddf407da831dbb9a871496986.
>>>>>    Sep 20 19:47:42 prod001 hbase[10838]: 2018-09-20 19:47:42,022
>>>>>    INFO  [prod001,60000,1537304851459_ChoreService_1]
>>>>>    master.CatalogJanitor: Scanned 779 catalog row(s), gc'd 0
>>>>>    unreferenced merged region(s) and 1 unreferenced parent region(s)
>>>>> 
>>>>>>    On 20 Sep 2018, at 21:43, Batyrshin Alexander
>>>>>>    <0x62ash@gmail.com <ma...@gmail.com> <mailto:0x62ash@gmail.com <ma...@gmail.com>>> wrote:
>>>>>> 
>>>>>>    Looks like problem was because of index region split
>>>>>> 
>>>>>>    Index region split at prod013:
>>>>>>    Sep 20 19:45:05 prod013 hbase[193055]: 2018-09-20 19:45:05,441
>>>>>>    INFO
>>>>>>     [regionserver/prod013/10.0.0.13:60020-splits-1537400010677]
>>>>>>    regionserver.SplitRequest: Region split, hbase:meta updated,
>>>>>>    and report to master.
>>>>>>    Parent=IDX_MARK_O,\x107834005168\x0000000046200020LWfBS4c,1536637905252.3e44b85ddf407da831dbb9a871496986.,
>>>>>>    new
>>>>>>    regions: IDX_MARK_O,\x107834005168\x0000000046200020LWfBS4c,1537461904877.33cba925c7acb347ac3f5e70e839c3cb., IDX_MARK_O,\x107834005168\x0000000046200068=4YF!YI,1537461904877.acb8f16a004a894c8706f6e12cd26144..
>>>>>>    Split took 0sec
>>>>>>    Sep 20 19:45:05 prod013 hbase[193055]: 2018-09-20 19:45:05,441
>>>>>>    INFO
>>>>>>     [regionserver/prod013/10.0.0.13:60020-splits-1537400010677]
>>>>>>    regionserver.SplitRequest: Split transaction journal:
>>>>>>    Sep 20 19:45:05 prod013 hbase[193055]:         STARTED at
>>>>>>    1537461904853
>>>>>>    Sep 20 19:45:05 prod013 hbase[193055]:         PREPARED at
>>>>>>    1537461904877
>>>>>>    Sep 20 19:45:05 prod013 hbase[193055]:            BEFORE_PRE_SPLIT_HOOK at 1537461904877
>>>>>>    Sep 20 19:45:05 prod013 hbase[193055]:            AFTER_PRE_SPLIT_HOOK at 1537461904877
>>>>>>    Sep 20 19:45:05 prod013 hbase[193055]:         SET_SPLITTING at
>>>>>>    1537461904880
>>>>>>    Sep 20 19:45:05 prod013 hbase[193055]:         CREATE_SPLIT_DIR
>>>>>>    at 1537461904987
>>>>>>    Sep 20 19:45:05 prod013 hbase[193055]:            CLOSED_PARENT_REGION at 1537461905002
>>>>>>    Sep 20 19:45:05 prod013 hbase[193055]:         OFFLINED_PARENT
>>>>>>    at 1537461905002
>>>>>>    Sep 20 19:45:05 prod013 hbase[193055]:            STARTED_REGION_A_CREATION at 1537461905056
>>>>>>    Sep 20 19:45:05 prod013 hbase[193055]:            STARTED_REGION_B_CREATION at 1537461905131
>>>>>>    Sep 20 19:45:05 prod013 hbase[193055]:         PONR at
>>>>>>    1537461905192
>>>>>>    Sep 20 19:45:05 prod013 hbase[193055]:         OPENED_REGION_A
>>>>>>    at 1537461905249
>>>>>>    Sep 20 19:45:05 prod013 hbase[193055]:         OPENED_REGION_B
>>>>>>    at 1537461905252
>>>>>>    Sep 20 19:45:05 prod013 hbase[193055]:            BEFORE_POST_SPLIT_HOOK at 1537461905439
>>>>>>    Sep 20 19:45:05 prod013 hbase[193055]:            AFTER_POST_SPLIT_HOOK at 1537461905439
>>>>>>    Sep 20 19:45:05 prod013 hbase[193055]:         COMPLETED at
>>>>>>    1537461905439
>>>>>> 
>>>>>>    Index update failed at prod002:
>>>>>>    Sep 20 20:09:24 prod002 hbase[97285]: 2018-09-20 20:09:24,520
>>>>>>    WARN  [hconnection-0x4f3242a0-shared--pool32-t36014]
>>>>>>    client.AsyncProcess: #220, table=IDX_MARK_O, attempt=1/1
>>>>>>    failed=1ops, last exception:
>>>>>>    org.apache.hadoop.hbase.NotServingRegionException:
>>>>>>    org.apache.hadoop.hbase.NotServingRegionException: Re
>>>>>>    gion
>>>>>>    IDX_MARK_O,\x107834005168\x0000000046200020LWfBS4c,1536637905252.3e44b85ddf407da831dbb9a871496986.
>>>>>>    is not online on prod013,60020,1537304282885
>>>>>>    Sep 20 20:09:24 prod002 hbase[97285]:         at
>>>>>>    org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3081)
>>>>>>    Sep 20 20:09:24 prod002 hbase[97285]:         at
>>>>>>    org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1271)
>>>>>>    Sep 20 20:09:24 prod002 hbase[97285]:         at
>>>>>>    org.apache.hadoop.hbase.regionserver.RSRpcServices.multi(RSRpcServices.java:2365)
>>>>>>    Sep 20 20:09:24 prod002 hbase[97285]:         at
>>>>>>    org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:36621)
>>>>>>    Sep 20 20:09:24 prod002 hbase[97285]:         at
>>>>>>    org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2359)
>>>>>>    Sep 20 20:09:24 prod002 hbase[97285]:         at
>>>>>>    org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:124)
>>>>>>    Sep 20 20:09:24 prod002 hbase[97285]:         at
>>>>>>    org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:297)
>>>>>>    Sep 20 20:09:24 prod002 hbase[97285]:         at
>>>>>>    org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:277)
>>>>>>    Sep 20 20:09:24 prod002 hbase[97285]:  on
>>>>>>    prod013,60020,1537304282885, tracking started Thu Sep 20
>>>>>>    20:09:24 MSK 2018; not retrying 1 - final failure
>>>>>>    Sep 20 20:09:24 prod002 hbase[97285]: 2018-09-20 20:09:24,549
>>>>>>    INFO
>>>>>>     [RpcServer.default.FPBQ.Fifo.handler=98,queue=8,port=60020]
>>>>>>    zookeeper.RecoverableZooKeeper: Process
>>>>>>    identifier=hconnection-0x39beae45 connecting to ZooKeeper
>>>>>>    ensemble=10.0.0.1:2181 <http://10.0.0.1:2181/ <http://10.0.0.1:2181/>>,10.0.0.2:2181
>>>>>>    <http://10.0.0.2:2181/ <http://10.0.0.2:2181/>>,10.0.0.3:2181 <http://10.0.0.3:2181/ <http://10.0.0.3:2181/>>
>>>>>>    Sep 20 20:09:24 prod002 hbase[97285]: 2018-09-20 20:09:24,549
>>>>>>    INFO
>>>>>>     [RpcServer.default.FPBQ.Fifo.handler=98,queue=8,port=60020]
>>>>>>    zookeeper.ZooKeeper: Initiating client
>>>>>>    connection, connectString=10.0.0.1:2181
>>>>>>    <http://10.0.0.1:2181/ <http://10.0.0.1:2181/>>,10.0.0.2:2181
>>>>>>    <http://10.0.0.2:2181/ <http://10.0.0.2:2181/>>,10.0.0.3:2181 <http://10.0.0.3:2181/ <http://10.0.0.3:2181/>>
>>>>>>    sessionTimeout=90000
>>>>>>    watcher=org.apache.hadoop.hbase.zookeeper.PendingWatcher@3ef61f7
>>>>>>    Sep 20 20:09:24 prod002 hbase[97285]: 2018-09-20 20:09:24,562
>>>>>>    INFO
>>>>>>     [RpcServer.default.FPBQ.Fifo.handler=98,queue=8,port=60020-SendThread(10.0.0.3:2181
>>>>>>    <http://10.0.0.3:2181/ <http://10.0.0.3:2181/>>)] zookeeper.ClientCnxn: Opening
>>>>>>    socket connection to server 10.0.0.3/10.0.0.3:2181
>>>>>>    <http://10.0.0.3/10.0.0.3:2181 <http://10.0.0.3/10.0.0.3:2181>>. Will not attempt to
>>>>>>    authenticate using SASL (unknown error)
>>>>>>    Sep 20 20:09:24 prod002 hbase[97285]: 2018-09-20 20:09:24,570
>>>>>>    INFO
>>>>>>     [RpcServer.default.FPBQ.Fifo.handler=98,queue=8,port=60020-SendThread(10.0.0.3:2181
>>>>>>    <http://10.0.0.3:2181/ <http://10.0.0.3:2181/>>)] zookeeper.ClientCnxn: Socket
>>>>>>    connection established to 10.0.0.3/10.0.0.3:2181
>>>>>>    <http://10.0.0.3/10.0.0.3:2181 <http://10.0.0.3/10.0.0.3:2181>>, initiating session
>>>>>>    Sep 20 20:09:24 prod002 hbase[97285]: 2018-09-20 20:09:24,572
>>>>>>    INFO
>>>>>>     [RpcServer.default.FPBQ.Fifo.handler=98,queue=8,port=60020-SendThread(10.0.0.3:2181
>>>>>>    <http://10.0.0.3:2181/ <http://10.0.0.3:2181/>>)] zookeeper.ClientCnxn:
>>>>>>    Session establishment complete on server 10.0.0.3/10.0.0.3:2181
>>>>>>    <http://10.0.0.3/10.0.0.3:2181 <http://10.0.0.3/10.0.0.3:2181>>, sessionid = 0x30000e039e01c7f,
>>>>>>    negotiated timeout = 40000
>>>>>>    Sep 20 20:09:24 prod002 hbase[97285]: 2018-09-20 20:09:24,628
>>>>>>    INFO
>>>>>>     [RpcServer.default.FPBQ.Fifo.handler=98,queue=8,port=60020]
>>>>>>    index.PhoenixIndexFailurePolicy: Successfully
>>>>>>    update INDEX_DISABLE_TIMESTAMP for IDX_MARK_O due to an
>>>>>>    exception while writing updates. indexState=PENDING_DISABLE
>>>>>>    Sep 20 20:09:24 prod002 hbase[97285]:
>>>>>>    org.apache.phoenix.hbase.index.exception.MultiIndexWriteFailureException:
>>>>>>     disableIndexOnFailure=true, Failed to write to multiple index
>>>>>>    tables: [IDX_MARK_O]
>>>>>>    Sep 20 20:09:24 prod002 hbase[97285]:         at
>>>>>>    org.apache.phoenix.hbase.index.write.TrackingParallelWriterIndexCommitter.write(TrackingParallelWriterIndexCommitter.java:235)
>>>>>>    Sep 20 20:09:24 prod002 hbase[97285]:         at
>>>>>>    org.apache.phoenix.hbase.index.write.IndexWriter.write(IndexWriter.java:195)
>>>>>>    Sep 20 20:09:24 prod002 hbase[97285]:         at
>>>>>>    org.apache.phoenix.hbase.index.write.IndexWriter.writeAndKillYourselfOnFailure(IndexWriter.java:156)
>>>>>>    Sep 20 20:09:24 prod002 hbase[97285]:         at
>>>>>>    org.apache.phoenix.hbase.index.write.IndexWriter.writeAndKillYourselfOnFailure(IndexWriter.java:145)
>>>>>>    Sep 20 20:09:24 prod002 hbase[97285]:         at
>>>>>>    org.apache.phoenix.hbase.index.Indexer.doPostWithExceptions(Indexer.java:620)
>>>>>>    Sep 20 20:09:24 prod002 hbase[97285]:         at
>>>>>>    org.apache.phoenix.hbase.index.Indexer.doPost(Indexer.java:595)
>>>>>>    Sep 20 20:09:24 prod002 hbase[97285]:         at
>>>>>>    org.apache.phoenix.hbase.index.Indexer.postBatchMutateIndispensably(Indexer.java:578)
>>>>>>    Sep 20 20:09:24 prod002 hbase[97285]:         at
>>>>>>    org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost$37.call(RegionCoprocessorHost.java:1048)
>>>>>>    Sep 20 20:09:24 prod002 hbase[97285]:         at
>>>>>>    org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost$RegionOperation.call(RegionCoprocessorHost.java:1711)
>>>>>>    Sep 20 20:09:24 prod002 hbase[97285]:         at
>>>>>>    org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost.execOperation(RegionCoprocessorHost.java:1789)
>>>>>>    Sep 20 20:09:24 prod002 hbase[97285]:         at
>>>>>>    org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost.execOperation(RegionCoprocessorHost.java:1745)
>>>>>>    Sep 20 20:09:24 prod002 hbase[97285]:         at
>>>>>>    org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost.postBatchMutateIndispensably(RegionCoprocessorHost.java:1044)
>>>>>>    Sep 20 20:09:24 prod002 hbase[97285]:         at
>>>>>>    org.apache.hadoop.hbase.regionserver.HRegion.doMiniBatchMutation(HRegion.java:3646)
>>>>>>    Sep 20 20:09:24 prod002 hbase[97285]:         at
>>>>>>    org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(HRegion.java:3108)
>>>>>>    Sep 20 20:09:24 prod002 hbase[97285]:         at
>>>>>>    org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(HRegion.java:3050)
>>>>>>    Sep 20 20:09:24 prod002 hbase[97285]:         at
>>>>>>    org.apache.hadoop.hbase.regionserver.RSRpcServices.doBatchOp(RSRpcServices.java:916)
>>>>>>    Sep 20 20:09:24 prod002 hbase[97285]:         at
>>>>>>    org.apache.hadoop.hbase.regionserver.RSRpcServices.doNonAtomicRegionMutation(RSRpcServices.java:844)
>>>>>>    Sep 20 20:09:24 prod002 hbase[97285]:         at
>>>>>>    org.apache.hadoop.hbase.regionserver.RSRpcServices.multi(RSRpcServices.java:2405)
>>>>>>    Sep 20 20:09:24 prod002 hbase[97285]:         at
>>>>>>    org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:36621)
>>>>>>    Sep 20 20:09:24 prod002 hbase[97285]:         at
>>>>>>    org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2359)
>>>>>>    Sep 20 20:09:24 prod002 hbase[97285]:         at
>>>>>>    org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:124)
>>>>>>    Sep 20 20:09:24 prod002 hbase[97285]:         at
>>>>>>    org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:297)
>>>>>>    Sep 20 20:09:24 prod002 hbase[97285]:         at
>>>>>>    org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:277)
>>>>>>    Sep 20 20:09:24 prod002 hbase[97285]: 2018-09-20 20:09:24,632
>>>>>>    INFO
>>>>>>     [RpcServer.default.FPBQ.Fifo.handler=98,queue=8,port=60020]
>>>>>>    util.IndexManagementUtil:
>>>>>>    Rethrowing org.apache.hadoop.hbase.DoNotRetryIOException: ERROR
>>>>>>    1121 (XCL21): Write to the index failed.
>>>>>>     disableIndexOnFailure=true, Failed to write to multiple index
>>>>>>    tables: [IDX_MARK_O] ,serverTimestamp=1537463364504,
>>>>>> 
>>>>>> 
>>>>>>>    On 20 Sep 2018, at 21:01, Batyrshin Alexander
>>>>>>>    <0x62ash@gmail.com <ma...@gmail.com> <mailto:0x62ash@gmail.com <ma...@gmail.com>>> wrote:
>>>>>>> 
>>>>>>>    Our setup:
>>>>>>>    HBase-1.4.7
>>>>>>>    Phoenix-4.14-hbase-1.4
>>>>>>> 
>>>>>>> 
>>>>>>>>    On 20 Sep 2018, at 20:19, Batyrshin Alexander
>>>>>>>>    <0x62ash@gmail.com <ma...@gmail.com> <mailto:0x62ash@gmail.com <ma...@gmail.com>>> wrote:
>>>>>>>> 
>>>>>>>>     Hello,
>>>>>>>>    Looks live we got dead lock with repeating "ERROR 1120
>>>>>>>>    (XCL20)" exception. At this time all indexes is ACTIVE.
>>>>>>>>    Can you help to make deeper diagnose?
>>>>>>>> 
>>>>>>>>    java.sql.SQLException: ERROR 1120 (XCL20): Writes to table
>>>>>>>>    blocked until index can be updated. tableName=TBL_MARK
>>>>>>>>    at
>>>>>>>>    org.apache.phoenix.exception.SQLExceptionCode$Factory$1.newException(SQLExceptionCode.java:494)
>>>>>>>>    at
>>>>>>>>    org.apache.phoenix.exception.SQLExceptionInfo.buildException(SQLExceptionInfo.java:150)
>>>>>>>>    at
>>>>>>>>    org.apache.phoenix.execute.MutationState.validateAndGetServerTimestamp(MutationState.java:815)
>>>>>>>>    at
>>>>>>>>    org.apache.phoenix.execute.MutationState.validateAll(MutationState.java:789)
>>>>>>>>    at
>>>>>>>>    org.apache.phoenix.execute.MutationState.send(MutationState.java:981)
>>>>>>>>    at
>>>>>>>>    org.apache.phoenix.execute.MutationState.send(MutationState.java:1514)
>>>>>>>>    at
>>>>>>>>    org.apache.phoenix.execute.MutationState.commit(MutationState.java:1337)
>>>>>>>>    at
>>>>>>>>    org.apache.phoenix.jdbc.PhoenixConnection$3.call(PhoenixConnection.java:670)
>>>>>>>>    at
>>>>>>>>    org.apache.phoenix.jdbc.PhoenixConnection$3.call(PhoenixConnection.java:666)
>>>>>>>>    at org.apache.phoenix.call.CallRunner.run(CallRunner.java:53)
>>>>>>>>    at
>>>>>>>>    org.apache.phoenix.jdbc.PhoenixConnection.commit(PhoenixConnection.java:666)
>>>>>>>>    at
>>>>>>>>    x.persistence.phoenix.PhoenixDao.$anonfun$doUpsert$1(PhoenixDao.scala:103)
>>>>>>>>    at scala.util.Try$.apply(Try.scala:209)
>>>>>>>>    at
>>>>>>>>    x.persistence.phoenix.PhoenixDao.doUpsert(PhoenixDao.scala:101)
>>>>>>>>    at
>>>>>>>>    x.persistence.phoenix.PhoenixDao.$anonfun$batchInsert$2(PhoenixDao.scala:45)
>>>>>>>>    at
>>>>>>>>    x.persistence.phoenix.PhoenixDao.$anonfun$batchInsert$2$adapted(PhoenixDao.scala:45)
>>>>>>>>    at scala.collection.immutable.Stream.flatMap(Stream.scala:486)
>>>>>>>>    at
>>>>>>>>    scala.collection.immutable.Stream.$anonfun$flatMap$1(Stream.scala:494)
>>>>>>>>    at
>>>>>>>>    scala.collection.immutable.Stream.$anonfun$append$1(Stream.scala:252)
>>>>>>>>    at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1169)
>>>>>>>>    at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1159)
>>>>>>>>    at scala.collection.immutable.Stream.length(Stream.scala:309)
>>>>>>>>    at scala.collection.SeqLike.size(SeqLike.scala:105)
>>>>>>>>    at scala.collection.SeqLike.size$(SeqLike.scala:105)
>>>>>>>>    at scala.collection.AbstractSeq.size(Seq.scala:41)
>>>>>>>>    at
>>>>>>>>    scala.collection.TraversableOnce.toArray(TraversableOnce.scala:285)
>>>>>>>>    at
>>>>>>>>    scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:283)
>>>>>>>>    at
>>>>>>>>    scala.collection.AbstractTraversable.toArray(Traversable.scala:104)
>>>>>>>>    at
>>>>>>>>    x.persistence.phoenix.PhoenixDao.$anonfun$batchInsert$1(PhoenixDao.scala:45)
>>>>>>>>    at scala.util.Try$.apply(Try.scala:209)
>>>>>>>>    at
>>>>>>>>    x.persistence.phoenix.PhoenixDao.batchInsert(PhoenixDao.scala:45)
>>>>>>>>    at
>>>>>>>>    x.persistence.phoenix.PhoenixDao.$anonfun$insert$2(PhoenixDao.scala:35)
>>>>>>>>    at scala.concurrent.Future$.$anonfun$apply$1(Future.scala:655)
>>>>>>>>    at scala.util.Success.$anonfun$map$1(Try.scala:251)
>>>>>>>>    at scala.util.Success.map(Try.scala:209)
>>>>>>>>    at scala.concurrent.Future.$anonfun$map$1(Future.scala:289)
>>>>>>>>    at scala.concurrent.impl.Promise.liftedTree1$1(Promise.scala:29)
>>>>>>>>    at
>>>>>>>>    scala.concurrent.impl.Promise.$anonfun$transform$1(Promise.scala:29)
>>>>>>>>    at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:60)
>>>>>>>>    at
>>>>>>>>    java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>>>>>>>>    at
>>>>>>>>    java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>>>>>>>>    at java.lang.Thread.run(Thread.java:748)

Re: Table dead lock: ERROR 1120 (XCL20): Writes to table blocked until index can be updated

Posted by Josh Elser <el...@apache.org>.

HBase will invalidate the location of a Region on seeing certain 
exceptions (including NotServingRegionException). After it sees the 
exception you have copied below, it should re-fetch the location of the 
Region.

If HBase keeps trying to access a Region on a RS that isn't hosting it, 
either hbase:meta is wrong or the HBase client has a bug.

However, to the point here, if that region was split successfully, 
clients should not be reading from that region anymore -- they would 
read from the daughters of that split region.

On 10/2/18 2:34 PM, Batyrshin Alexander wrote:
> We tried branch 4.14-HBase-1.4 at commit 
> https://github.com/apache/phoenix/commit/52893c240e4f24e2bfac0834d35205f866c16ed8
> 
> Is there any way to invalidate meta-cache on event of index regions 
> split? Maybe there is some option to set max time to live for cache?
> 
> Watching this on regions servers:
> 
> At 09:34 regions *96c3ede1c40c98959e60bd6fc0e07269* split on prod019
> 
> Oct 02 09:34:39 prod019 hbase[152127]: 2018-10-02 09:34:39,719 INFO 
>   [regionserver/prod019/10.0.0.19:60020-splits-1538462079117] 
> regionserver.SplitRequest: Region split, hbase:meta updated, and report 
> to master. Parent=IDX_MARK_O,\x0B\x0000000046200020qC8kovh\x00\x01\x80\x00\x
> 01e\x89\x8B\x99@\x00\x00\x00\x00,1537400033958.*96c3ede1c40c98959e60bd6fc0e07269*., 
> new regions: 
> IDX_MARK_O,\x0B\x0000000046200020qC8kovh\x00\x01\x80\x00\x01e\x89\x8B\x99@\x00\x00\x00\x00,1538462079161.80fc2516619d8665789b0c5a2bca8a8b., 
> IDX_MARK_O,\x0BON_SCHFDOPPR_2AL-5602
> 2B7D-2F90-4AA5-8125-4F4001B5BE0D-00000_2AL-C0D76C01-EE7E-496B-BCD6-F6488956F75A-00000_20180228_7E372181-F23D-4EBE-9CAD-5F5218C9798I\x0000000046186195_5.UHQ=\x00\x02\x80\x00\x01a\xD3\xEA@\x80\x00\x00\x00\x00,1538462079161.24b6675d9e51067a21e58f294a9f816b.. 
> Split took 0sec
> 
> Fail at 11:51 prod018
> 
> Oct 02 11:51:13 prod018 hbase[108476]: 2018-10-02 11:51:13,752 WARN 
>   [hconnection-0x4131af19-shared--pool24-t26652] client.AsyncProcess: 
> #164, table=IDX_MARK_O, attempt=1/1 failed=1ops, last exception: 
> org.apache.hadoop.hbase.NotServingRegionException: 
> org.apache.hadoop.hbase.NotServingRegionException: 
> Region IDX_MARK_O,\x0B\x0000000046200020qC8kovh\x00\x01\x80\x00\x01e\x89\x8B\x99@\x00\x00\x00\x00,1537400033958.*96c3ede1c40c98959e60bd6fc0e07269*. 
> is not online on prod019,60020,1538417663874
> 
> Fail at 13:38 on prod005
> 
> Oct 02 13:38:06 prod005 hbase[197079]: 2018-10-02 13:38:06,040 WARN 
>   [hconnection-0x5e744e65-shared--pool8-t31214] client.AsyncProcess: 
> #53, table=IDX_MARK_O, attempt=1/1 failed=11ops, last exception: 
> org.apache.hadoop.hbase.NotServingRegionException: 
> org.apache.hadoop.hbase.NotServingRegionException: 
> Region IDX_MARK_O,\x0B\x0000000046200020qC8kovh\x00\x01\x80\x00\x01e\x89\x8B\x99@\x00\x00\x00\x00,1537400033958.*96c3ede1c40c98959e60bd6fc0e07269*. 
> is not online on prod019,60020,1538417663874
> 
>> On 27 Sep 2018, at 01:04, Ankit Singhal <ankitsinghal59@gmail.com 
>> <ma...@gmail.com>> wrote:
>>
>> You might be hitting PHOENIX-4785 
>> <https://jira.apache.org/jira/browse/PHOENIX-4785>,  you can apply the 
>> patch on top of 4.14 and see if it fixes your problem.
>>
>> Regards,
>> Ankit Singhal
>>
>> On Wed, Sep 26, 2018 at 2:33 PM Batyrshin Alexander <0x62ash@gmail.com 
>> <ma...@gmail.com>> wrote:
>>
>>     Any advices? Helps?
>>     I can reproduce problem and capture more logs if needed.
>>
>>>     On 21 Sep 2018, at 02:13, Batyrshin Alexander <0x62ash@gmail.com
>>>     <ma...@gmail.com>> wrote:
>>>
>>>     Looks like lock goes away 30 minutes after index region split.
>>>     So i can assume that this issue comes from cache that configured
>>>     by this option:*phoenix.coprocessor.maxMetaDataCacheTimeToLiveMs*
>>>
>>>
>>>
>>>>     On 21 Sep 2018, at 00:15, Batyrshin Alexander <0x62ash@gmail.com
>>>>     <ma...@gmail.com>> wrote:
>>>>
>>>>     And how this split looks at Master logs:
>>>>
>>>>     Sep 20 19:45:04 prod001 hbase[10838]: 2018-09-20 19:45:04,888
>>>>     INFO  [AM.ZK.Worker-pool5-t282] master.RegionStates: Transition
>>>>     {3e44b85ddf407da831dbb9a871496986 state=OPEN,
>>>>     ts=1537304859509, server=prod013,60020,1537304282885} to
>>>>     {3e44b85ddf407da831dbb9a871496986 state=SPLITTING,
>>>>     ts=1537461904888, server=prod
>>>>     Sep 20 19:45:05 prod001 hbase[10838]: 2018-09-20 19:45:05,340
>>>>     INFO  [AM.ZK.Worker-pool5-t284] master.RegionStates: Transition
>>>>     {3e44b85ddf407da831dbb9a871496986
>>>>     state=SPLITTING, ts=1537461905340,
>>>>     server=prod013,60020,1537304282885} to
>>>>     {3e44b85ddf407da831dbb9a871496986 state=SPLIT, ts=1537461905340,
>>>>     server=pro
>>>>     Sep 20 19:45:05 prod001 hbase[10838]: 2018-09-20 19:45:05,340
>>>>     INFO  [AM.ZK.Worker-pool5-t284] master.RegionStates: Offlined
>>>>     3e44b85ddf407da831dbb9a871496986 from prod013,60020,1537304282885
>>>>     Sep 20 19:45:05 prod001 hbase[10838]: 2018-09-20 19:45:05,341
>>>>     INFO  [AM.ZK.Worker-pool5-t284] master.RegionStates: Transition
>>>>     {33cba925c7acb347ac3f5e70e839c3cb
>>>>     state=SPLITTING_NEW, ts=1537461905340,
>>>>     server=prod013,60020,1537304282885} to
>>>>     {33cba925c7acb347ac3f5e70e839c3cb state=OPEN, ts=1537461905341,
>>>>     server=
>>>>     Sep 20 19:45:05 prod001 hbase[10838]: 2018-09-20 19:45:05,341
>>>>     INFO  [AM.ZK.Worker-pool5-t284] master.RegionStates: Transition
>>>>     {acb8f16a004a894c8706f6e12cd26144
>>>>     state=SPLITTING_NEW, ts=1537461905340,
>>>>     server=prod013,60020,1537304282885} to
>>>>     {acb8f16a004a894c8706f6e12cd26144 state=OPEN, ts=1537461905341,
>>>>     server=
>>>>     Sep 20 19:45:05 prod001 hbase[10838]: 2018-09-20 19:45:05,343
>>>>     INFO  [AM.ZK.Worker-pool5-t284] master.AssignmentManager:
>>>>     Handled SPLIT
>>>>     event; parent=IDX_MARK_O,\x107834005168\x0000000046200020LWfBS4c,1536637905252.3e44b85ddf407da831dbb9a871496986.,
>>>>     daughter a=IDX_MARK_O,\x107834005168\x0000000046200020LWfBS4c,1
>>>>     Sep 20 19:47:41 prod001 hbase[10838]: 2018-09-20 19:47:41,972
>>>>     INFO  [prod001,60000,1537304851459_ChoreService_2]
>>>>     balancer.StochasticLoadBalancer: Skipping load balancing because
>>>>     balanced cluster; total cost is 17.82282205608522, sum
>>>>     multiplier is 1102.0 min cost which need balance is 0.05
>>>>     Sep 20 19:47:42 prod001 hbase[10838]: 2018-09-20 19:47:42,021
>>>>     INFO  [prod001,60000,1537304851459_ChoreService_1]
>>>>     hbase.MetaTableAccessor:
>>>>     Deleted IDX_MARK_O,\x107834005168\x0000000046200020LWfBS4c,1536637905252.3e44b85ddf407da831dbb9a871496986.
>>>>     Sep 20 19:47:42 prod001 hbase[10838]: 2018-09-20 19:47:42,022
>>>>     INFO  [prod001,60000,1537304851459_ChoreService_1]
>>>>     master.CatalogJanitor: Scanned 779 catalog row(s), gc'd 0
>>>>     unreferenced merged region(s) and 1 unreferenced parent region(s)
>>>>
>>>>>     On 20 Sep 2018, at 21:43, Batyrshin Alexander
>>>>>     <0x62ash@gmail.com <ma...@gmail.com>> wrote:
>>>>>
>>>>>     Looks like problem was because of index region split
>>>>>
>>>>>     Index region split at prod013:
>>>>>     Sep 20 19:45:05 prod013 hbase[193055]: 2018-09-20 19:45:05,441
>>>>>     INFO
>>>>>      [regionserver/prod013/10.0.0.13:60020-splits-1537400010677]
>>>>>     regionserver.SplitRequest: Region split, hbase:meta updated,
>>>>>     and report to master.
>>>>>     Parent=IDX_MARK_O,\x107834005168\x0000000046200020LWfBS4c,1536637905252.3e44b85ddf407da831dbb9a871496986.,
>>>>>     new
>>>>>     regions: IDX_MARK_O,\x107834005168\x0000000046200020LWfBS4c,1537461904877.33cba925c7acb347ac3f5e70e839c3cb., IDX_MARK_O,\x107834005168\x0000000046200068=4YF!YI,1537461904877.acb8f16a004a894c8706f6e12cd26144..
>>>>>     Split took 0sec
>>>>>     Sep 20 19:45:05 prod013 hbase[193055]: 2018-09-20 19:45:05,441
>>>>>     INFO
>>>>>      [regionserver/prod013/10.0.0.13:60020-splits-1537400010677]
>>>>>     regionserver.SplitRequest: Split transaction journal:
>>>>>     Sep 20 19:45:05 prod013 hbase[193055]:         STARTED at
>>>>>     1537461904853
>>>>>     Sep 20 19:45:05 prod013 hbase[193055]:         PREPARED at
>>>>>     1537461904877
>>>>>     Sep 20 19:45:05 prod013 hbase[193055]:        
>>>>>     BEFORE_PRE_SPLIT_HOOK at 1537461904877
>>>>>     Sep 20 19:45:05 prod013 hbase[193055]:        
>>>>>     AFTER_PRE_SPLIT_HOOK at 1537461904877
>>>>>     Sep 20 19:45:05 prod013 hbase[193055]:         SET_SPLITTING at
>>>>>     1537461904880
>>>>>     Sep 20 19:45:05 prod013 hbase[193055]:         CREATE_SPLIT_DIR
>>>>>     at 1537461904987
>>>>>     Sep 20 19:45:05 prod013 hbase[193055]:        
>>>>>     CLOSED_PARENT_REGION at 1537461905002
>>>>>     Sep 20 19:45:05 prod013 hbase[193055]:         OFFLINED_PARENT
>>>>>     at 1537461905002
>>>>>     Sep 20 19:45:05 prod013 hbase[193055]:        
>>>>>     STARTED_REGION_A_CREATION at 1537461905056
>>>>>     Sep 20 19:45:05 prod013 hbase[193055]:        
>>>>>     STARTED_REGION_B_CREATION at 1537461905131
>>>>>     Sep 20 19:45:05 prod013 hbase[193055]:         PONR at
>>>>>     1537461905192
>>>>>     Sep 20 19:45:05 prod013 hbase[193055]:         OPENED_REGION_A
>>>>>     at 1537461905249
>>>>>     Sep 20 19:45:05 prod013 hbase[193055]:         OPENED_REGION_B
>>>>>     at 1537461905252
>>>>>     Sep 20 19:45:05 prod013 hbase[193055]:        
>>>>>     BEFORE_POST_SPLIT_HOOK at 1537461905439
>>>>>     Sep 20 19:45:05 prod013 hbase[193055]:        
>>>>>     AFTER_POST_SPLIT_HOOK at 1537461905439
>>>>>     Sep 20 19:45:05 prod013 hbase[193055]:         COMPLETED at
>>>>>     1537461905439
>>>>>
>>>>>     Index update failed at prod002:
>>>>>     Sep 20 20:09:24 prod002 hbase[97285]: 2018-09-20 20:09:24,520
>>>>>     WARN  [hconnection-0x4f3242a0-shared--pool32-t36014]
>>>>>     client.AsyncProcess: #220, table=IDX_MARK_O, attempt=1/1
>>>>>     failed=1ops, last exception:
>>>>>     org.apache.hadoop.hbase.NotServingRegionException:
>>>>>     org.apache.hadoop.hbase.NotServingRegionException: Re
>>>>>     gion
>>>>>     IDX_MARK_O,\x107834005168\x0000000046200020LWfBS4c,1536637905252.3e44b85ddf407da831dbb9a871496986.
>>>>>     is not online on prod013,60020,1537304282885
>>>>>     Sep 20 20:09:24 prod002 hbase[97285]:         at
>>>>>     org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3081)
>>>>>     Sep 20 20:09:24 prod002 hbase[97285]:         at
>>>>>     org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1271)
>>>>>     Sep 20 20:09:24 prod002 hbase[97285]:         at
>>>>>     org.apache.hadoop.hbase.regionserver.RSRpcServices.multi(RSRpcServices.java:2365)
>>>>>     Sep 20 20:09:24 prod002 hbase[97285]:         at
>>>>>     org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:36621)
>>>>>     Sep 20 20:09:24 prod002 hbase[97285]:         at
>>>>>     org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2359)
>>>>>     Sep 20 20:09:24 prod002 hbase[97285]:         at
>>>>>     org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:124)
>>>>>     Sep 20 20:09:24 prod002 hbase[97285]:         at
>>>>>     org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:297)
>>>>>     Sep 20 20:09:24 prod002 hbase[97285]:         at
>>>>>     org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:277)
>>>>>     Sep 20 20:09:24 prod002 hbase[97285]:  on
>>>>>     prod013,60020,1537304282885, tracking started Thu Sep 20
>>>>>     20:09:24 MSK 2018; not retrying 1 - final failure
>>>>>     Sep 20 20:09:24 prod002 hbase[97285]: 2018-09-20 20:09:24,549
>>>>>     INFO
>>>>>      [RpcServer.default.FPBQ.Fifo.handler=98,queue=8,port=60020]
>>>>>     zookeeper.RecoverableZooKeeper: Process
>>>>>     identifier=hconnection-0x39beae45 connecting to ZooKeeper
>>>>>     ensemble=10.0.0.1:2181 <http://10.0.0.1:2181/>,10.0.0.2:2181
>>>>>     <http://10.0.0.2:2181/>,10.0.0.3:2181 <http://10.0.0.3:2181/>
>>>>>     Sep 20 20:09:24 prod002 hbase[97285]: 2018-09-20 20:09:24,549
>>>>>     INFO
>>>>>      [RpcServer.default.FPBQ.Fifo.handler=98,queue=8,port=60020]
>>>>>     zookeeper.ZooKeeper: Initiating client
>>>>>     connection, connectString=10.0.0.1:2181
>>>>>     <http://10.0.0.1:2181/>,10.0.0.2:2181
>>>>>     <http://10.0.0.2:2181/>,10.0.0.3:2181 <http://10.0.0.3:2181/>
>>>>>     sessionTimeout=90000
>>>>>     watcher=org.apache.hadoop.hbase.zookeeper.PendingWatcher@3ef61f7
>>>>>     Sep 20 20:09:24 prod002 hbase[97285]: 2018-09-20 20:09:24,562
>>>>>     INFO
>>>>>      [RpcServer.default.FPBQ.Fifo.handler=98,queue=8,port=60020-SendThread(10.0.0.3:2181
>>>>>     <http://10.0.0.3:2181/>)] zookeeper.ClientCnxn: Opening
>>>>>     socket connection to server 10.0.0.3/10.0.0.3:2181
>>>>>     <http://10.0.0.3/10.0.0.3:2181>. Will not attempt to
>>>>>     authenticate using SASL (unknown error)
>>>>>     Sep 20 20:09:24 prod002 hbase[97285]: 2018-09-20 20:09:24,570
>>>>>     INFO
>>>>>      [RpcServer.default.FPBQ.Fifo.handler=98,queue=8,port=60020-SendThread(10.0.0.3:2181
>>>>>     <http://10.0.0.3:2181/>)] zookeeper.ClientCnxn: Socket
>>>>>     connection established to 10.0.0.3/10.0.0.3:2181
>>>>>     <http://10.0.0.3/10.0.0.3:2181>, initiating session
>>>>>     Sep 20 20:09:24 prod002 hbase[97285]: 2018-09-20 20:09:24,572
>>>>>     INFO
>>>>>      [RpcServer.default.FPBQ.Fifo.handler=98,queue=8,port=60020-SendThread(10.0.0.3:2181
>>>>>     <http://10.0.0.3:2181/>)] zookeeper.ClientCnxn:
>>>>>     Session establishment complete on server 10.0.0.3/10.0.0.3:2181
>>>>>     <http://10.0.0.3/10.0.0.3:2181>, sessionid = 0x30000e039e01c7f,
>>>>>     negotiated timeout = 40000
>>>>>     Sep 20 20:09:24 prod002 hbase[97285]: 2018-09-20 20:09:24,628
>>>>>     INFO
>>>>>      [RpcServer.default.FPBQ.Fifo.handler=98,queue=8,port=60020]
>>>>>     index.PhoenixIndexFailurePolicy: Successfully
>>>>>     update INDEX_DISABLE_TIMESTAMP for IDX_MARK_O due to an
>>>>>     exception while writing updates. indexState=PENDING_DISABLE
>>>>>     Sep 20 20:09:24 prod002 hbase[97285]:
>>>>>     org.apache.phoenix.hbase.index.exception.MultiIndexWriteFailureException:
>>>>>      disableIndexOnFailure=true, Failed to write to multiple index
>>>>>     tables: [IDX_MARK_O]
>>>>>     Sep 20 20:09:24 prod002 hbase[97285]:         at
>>>>>     org.apache.phoenix.hbase.index.write.TrackingParallelWriterIndexCommitter.write(TrackingParallelWriterIndexCommitter.java:235)
>>>>>     Sep 20 20:09:24 prod002 hbase[97285]:         at
>>>>>     org.apache.phoenix.hbase.index.write.IndexWriter.write(IndexWriter.java:195)
>>>>>     Sep 20 20:09:24 prod002 hbase[97285]:         at
>>>>>     org.apache.phoenix.hbase.index.write.IndexWriter.writeAndKillYourselfOnFailure(IndexWriter.java:156)
>>>>>     Sep 20 20:09:24 prod002 hbase[97285]:         at
>>>>>     org.apache.phoenix.hbase.index.write.IndexWriter.writeAndKillYourselfOnFailure(IndexWriter.java:145)
>>>>>     Sep 20 20:09:24 prod002 hbase[97285]:         at
>>>>>     org.apache.phoenix.hbase.index.Indexer.doPostWithExceptions(Indexer.java:620)
>>>>>     Sep 20 20:09:24 prod002 hbase[97285]:         at
>>>>>     org.apache.phoenix.hbase.index.Indexer.doPost(Indexer.java:595)
>>>>>     Sep 20 20:09:24 prod002 hbase[97285]:         at
>>>>>     org.apache.phoenix.hbase.index.Indexer.postBatchMutateIndispensably(Indexer.java:578)
>>>>>     Sep 20 20:09:24 prod002 hbase[97285]:         at
>>>>>     org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost$37.call(RegionCoprocessorHost.java:1048)
>>>>>     Sep 20 20:09:24 prod002 hbase[97285]:         at
>>>>>     org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost$RegionOperation.call(RegionCoprocessorHost.java:1711)
>>>>>     Sep 20 20:09:24 prod002 hbase[97285]:         at
>>>>>     org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost.execOperation(RegionCoprocessorHost.java:1789)
>>>>>     Sep 20 20:09:24 prod002 hbase[97285]:         at
>>>>>     org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost.execOperation(RegionCoprocessorHost.java:1745)
>>>>>     Sep 20 20:09:24 prod002 hbase[97285]:         at
>>>>>     org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost.postBatchMutateIndispensably(RegionCoprocessorHost.java:1044)
>>>>>     Sep 20 20:09:24 prod002 hbase[97285]:         at
>>>>>     org.apache.hadoop.hbase.regionserver.HRegion.doMiniBatchMutation(HRegion.java:3646)
>>>>>     Sep 20 20:09:24 prod002 hbase[97285]:         at
>>>>>     org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(HRegion.java:3108)
>>>>>     Sep 20 20:09:24 prod002 hbase[97285]:         at
>>>>>     org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(HRegion.java:3050)
>>>>>     Sep 20 20:09:24 prod002 hbase[97285]:         at
>>>>>     org.apache.hadoop.hbase.regionserver.RSRpcServices.doBatchOp(RSRpcServices.java:916)
>>>>>     Sep 20 20:09:24 prod002 hbase[97285]:         at
>>>>>     org.apache.hadoop.hbase.regionserver.RSRpcServices.doNonAtomicRegionMutation(RSRpcServices.java:844)
>>>>>     Sep 20 20:09:24 prod002 hbase[97285]:         at
>>>>>     org.apache.hadoop.hbase.regionserver.RSRpcServices.multi(RSRpcServices.java:2405)
>>>>>     Sep 20 20:09:24 prod002 hbase[97285]:         at
>>>>>     org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:36621)
>>>>>     Sep 20 20:09:24 prod002 hbase[97285]:         at
>>>>>     org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2359)
>>>>>     Sep 20 20:09:24 prod002 hbase[97285]:         at
>>>>>     org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:124)
>>>>>     Sep 20 20:09:24 prod002 hbase[97285]:         at
>>>>>     org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:297)
>>>>>     Sep 20 20:09:24 prod002 hbase[97285]:         at
>>>>>     org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:277)
>>>>>     Sep 20 20:09:24 prod002 hbase[97285]: 2018-09-20 20:09:24,632
>>>>>     INFO
>>>>>      [RpcServer.default.FPBQ.Fifo.handler=98,queue=8,port=60020]
>>>>>     util.IndexManagementUtil:
>>>>>     Rethrowing org.apache.hadoop.hbase.DoNotRetryIOException: ERROR
>>>>>     1121 (XCL21): Write to the index failed.
>>>>>      disableIndexOnFailure=true, Failed to write to multiple index
>>>>>     tables: [IDX_MARK_O] ,serverTimestamp=1537463364504,
>>>>>
>>>>>
>>>>>>     On 20 Sep 2018, at 21:01, Batyrshin Alexander
>>>>>>     <0x62ash@gmail.com <ma...@gmail.com>> wrote:
>>>>>>
>>>>>>     Our setup:
>>>>>>     HBase-1.4.7
>>>>>>     Phoenix-4.14-hbase-1.4
>>>>>>
>>>>>>
>>>>>>>     On 20 Sep 2018, at 20:19, Batyrshin Alexander
>>>>>>>     <0x62ash@gmail.com <ma...@gmail.com>> wrote:
>>>>>>>
>>>>>>>      Hello,
>>>>>>>     Looks live we got dead lock with repeating "ERROR 1120
>>>>>>>     (XCL20)" exception. At this time all indexes is ACTIVE.
>>>>>>>     Can you help to make deeper diagnose?
>>>>>>>
>>>>>>>     java.sql.SQLException: ERROR 1120 (XCL20): Writes to table
>>>>>>>     blocked until index can be updated. tableName=TBL_MARK
>>>>>>>     at
>>>>>>>     org.apache.phoenix.exception.SQLExceptionCode$Factory$1.newException(SQLExceptionCode.java:494)
>>>>>>>     at
>>>>>>>     org.apache.phoenix.exception.SQLExceptionInfo.buildException(SQLExceptionInfo.java:150)
>>>>>>>     at
>>>>>>>     org.apache.phoenix.execute.MutationState.validateAndGetServerTimestamp(MutationState.java:815)
>>>>>>>     at
>>>>>>>     org.apache.phoenix.execute.MutationState.validateAll(MutationState.java:789)
>>>>>>>     at
>>>>>>>     org.apache.phoenix.execute.MutationState.send(MutationState.java:981)
>>>>>>>     at
>>>>>>>     org.apache.phoenix.execute.MutationState.send(MutationState.java:1514)
>>>>>>>     at
>>>>>>>     org.apache.phoenix.execute.MutationState.commit(MutationState.java:1337)
>>>>>>>     at
>>>>>>>     org.apache.phoenix.jdbc.PhoenixConnection$3.call(PhoenixConnection.java:670)
>>>>>>>     at
>>>>>>>     org.apache.phoenix.jdbc.PhoenixConnection$3.call(PhoenixConnection.java:666)
>>>>>>>     at org.apache.phoenix.call.CallRunner.run(CallRunner.java:53)
>>>>>>>     at
>>>>>>>     org.apache.phoenix.jdbc.PhoenixConnection.commit(PhoenixConnection.java:666)
>>>>>>>     at
>>>>>>>     x.persistence.phoenix.PhoenixDao.$anonfun$doUpsert$1(PhoenixDao.scala:103)
>>>>>>>     at scala.util.Try$.apply(Try.scala:209)
>>>>>>>     at
>>>>>>>     x.persistence.phoenix.PhoenixDao.doUpsert(PhoenixDao.scala:101)
>>>>>>>     at
>>>>>>>     x.persistence.phoenix.PhoenixDao.$anonfun$batchInsert$2(PhoenixDao.scala:45)
>>>>>>>     at
>>>>>>>     x.persistence.phoenix.PhoenixDao.$anonfun$batchInsert$2$adapted(PhoenixDao.scala:45)
>>>>>>>     at scala.collection.immutable.Stream.flatMap(Stream.scala:486)
>>>>>>>     at
>>>>>>>     scala.collection.immutable.Stream.$anonfun$flatMap$1(Stream.scala:494)
>>>>>>>     at
>>>>>>>     scala.collection.immutable.Stream.$anonfun$append$1(Stream.scala:252)
>>>>>>>     at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1169)
>>>>>>>     at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1159)
>>>>>>>     at scala.collection.immutable.Stream.length(Stream.scala:309)
>>>>>>>     at scala.collection.SeqLike.size(SeqLike.scala:105)
>>>>>>>     at scala.collection.SeqLike.size$(SeqLike.scala:105)
>>>>>>>     at scala.collection.AbstractSeq.size(Seq.scala:41)
>>>>>>>     at
>>>>>>>     scala.collection.TraversableOnce.toArray(TraversableOnce.scala:285)
>>>>>>>     at
>>>>>>>     scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:283)
>>>>>>>     at
>>>>>>>     scala.collection.AbstractTraversable.toArray(Traversable.scala:104)
>>>>>>>     at
>>>>>>>     x.persistence.phoenix.PhoenixDao.$anonfun$batchInsert$1(PhoenixDao.scala:45)
>>>>>>>     at scala.util.Try$.apply(Try.scala:209)
>>>>>>>     at
>>>>>>>     x.persistence.phoenix.PhoenixDao.batchInsert(PhoenixDao.scala:45)
>>>>>>>     at
>>>>>>>     x.persistence.phoenix.PhoenixDao.$anonfun$insert$2(PhoenixDao.scala:35)
>>>>>>>     at scala.concurrent.Future$.$anonfun$apply$1(Future.scala:655)
>>>>>>>     at scala.util.Success.$anonfun$map$1(Try.scala:251)
>>>>>>>     at scala.util.Success.map(Try.scala:209)
>>>>>>>     at scala.concurrent.Future.$anonfun$map$1(Future.scala:289)
>>>>>>>     at scala.concurrent.impl.Promise.liftedTree1$1(Promise.scala:29)
>>>>>>>     at
>>>>>>>     scala.concurrent.impl.Promise.$anonfun$transform$1(Promise.scala:29)
>>>>>>>     at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:60)
>>>>>>>     at
>>>>>>>     java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>>>>>>>     at
>>>>>>>     java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>>>>>>>     at java.lang.Thread.run(Thread.java:748)
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Table dead lock: ERROR 1120 (XCL20): Writes to table blocked until index can be updated

Posted by Batyrshin Alexander <0x...@gmail.com>.

We tried branch 4.14-HBase-1.4 at commit https://github.com/apache/phoenix/commit/52893c240e4f24e2bfac0834d35205f866c16ed8 <https://github.com/apache/phoenix/commit/52893c240e4f24e2bfac0834d35205f866c16ed8>

Is there any way to invalidate meta-cache on event of index regions split? Maybe there is some option to set max time to live for cache?

Watching this on regions servers:

At 09:34 regions 96c3ede1c40c98959e60bd6fc0e07269 split on prod019 

Oct 02 09:34:39 prod019 hbase[152127]: 2018-10-02 09:34:39,719 INFO  [regionserver/prod019/10.0.0.19:60020-splits-1538462079117] regionserver.SplitRequest: Region split, hbase:meta updated, and report to master. Parent=IDX_MARK_O,\x0B\x0000000046200020qC8kovh\x00\x01\x80\x00\x
01e\x89\x8B\x99@\x00\x00\x00\x00,1537400033958.96c3ede1c40c98959e60bd6fc0e07269., new regions: IDX_MARK_O,\x0B\x0000000046200020qC8kovh\x00\x01\x80\x00\x01e\x89\x8B\x99@\x00\x00\x00\x00,1538462079161.80fc2516619d8665789b0c5a2bca8a8b., IDX_MARK_O,\x0BON_SCHFDOPPR_2AL-5602
2B7D-2F90-4AA5-8125-4F4001B5BE0D-00000_2AL-C0D76C01-EE7E-496B-BCD6-F6488956F75A-00000_20180228_7E372181-F23D-4EBE-9CAD-5F5218C9798I\x0000000046186195_5.UHQ=\x00\x02\x80\x00\x01a\xD3\xEA@\x80\x00\x00\x00\x00,1538462079161.24b6675d9e51067a21e58f294a9f816b.. Split took 0sec

Fail at 11:51 prod018

Oct 02 11:51:13 prod018 hbase[108476]: 2018-10-02 11:51:13,752 WARN  [hconnection-0x4131af19-shared--pool24-t26652] client.AsyncProcess: #164, table=IDX_MARK_O, attempt=1/1 failed=1ops, last exception: org.apache.hadoop.hbase.NotServingRegionException: org.apache.hadoop.hbase.NotServingRegionException: Region IDX_MARK_O,\x0B\x0000000046200020qC8kovh\x00\x01\x80\x00\x01e\x89\x8B\x99@\x00\x00\x00\x00,1537400033958.96c3ede1c40c98959e60bd6fc0e07269. is not online on prod019,60020,1538417663874

Fail at 13:38 on prod005

Oct 02 13:38:06 prod005 hbase[197079]: 2018-10-02 13:38:06,040 WARN  [hconnection-0x5e744e65-shared--pool8-t31214] client.AsyncProcess: #53, table=IDX_MARK_O, attempt=1/1 failed=11ops, last exception: org.apache.hadoop.hbase.NotServingRegionException: org.apache.hadoop.hbase.NotServingRegionException: Region IDX_MARK_O,\x0B\x0000000046200020qC8kovh\x00\x01\x80\x00\x01e\x89\x8B\x99@\x00\x00\x00\x00,1537400033958.96c3ede1c40c98959e60bd6fc0e07269. is not online on prod019,60020,1538417663874

> On 27 Sep 2018, at 01:04, Ankit Singhal <an...@gmail.com> wrote:
> 
> You might be hitting PHOENIX-4785 <https://jira.apache.org/jira/browse/PHOENIX-4785>,  you can apply the patch on top of 4.14 and see if it fixes your problem.
> 
> Regards,
> Ankit Singhal
> 
> On Wed, Sep 26, 2018 at 2:33 PM Batyrshin Alexander <0x62ash@gmail.com <ma...@gmail.com>> wrote:
> Any advices? Helps?
> I can reproduce problem and capture more logs if needed.
> 
>> On 21 Sep 2018, at 02:13, Batyrshin Alexander <0x62ash@gmail.com <ma...@gmail.com>> wrote:
>> 
>> Looks like lock goes away 30 minutes after index region split.
>> So i can assume that this issue comes from cache that configured by this option: phoenix.coprocessor.maxMetaDataCacheTimeToLiveMs
>> 
>> 
>> 
>>> On 21 Sep 2018, at 00:15, Batyrshin Alexander <0x62ash@gmail.com <ma...@gmail.com>> wrote:
>>> 
>>> And how this split looks at Master logs:
>>> 
>>> Sep 20 19:45:04 prod001 hbase[10838]: 2018-09-20 19:45:04,888 INFO  [AM.ZK.Worker-pool5-t282] master.RegionStates: Transition {3e44b85ddf407da831dbb9a871496986 state=OPEN, ts=1537304859509, server=prod013,60020,1537304282885} to {3e44b85ddf407da831dbb9a871496986 state=SPLITTING, ts=1537461904888, server=prod
>>> Sep 20 19:45:05 prod001 hbase[10838]: 2018-09-20 19:45:05,340 INFO  [AM.ZK.Worker-pool5-t284] master.RegionStates: Transition {3e44b85ddf407da831dbb9a871496986 state=SPLITTING, ts=1537461905340, server=prod013,60020,1537304282885} to {3e44b85ddf407da831dbb9a871496986 state=SPLIT, ts=1537461905340, server=pro
>>> Sep 20 19:45:05 prod001 hbase[10838]: 2018-09-20 19:45:05,340 INFO  [AM.ZK.Worker-pool5-t284] master.RegionStates: Offlined 3e44b85ddf407da831dbb9a871496986 from prod013,60020,1537304282885
>>> Sep 20 19:45:05 prod001 hbase[10838]: 2018-09-20 19:45:05,341 INFO  [AM.ZK.Worker-pool5-t284] master.RegionStates: Transition {33cba925c7acb347ac3f5e70e839c3cb state=SPLITTING_NEW, ts=1537461905340, server=prod013,60020,1537304282885} to {33cba925c7acb347ac3f5e70e839c3cb state=OPEN, ts=1537461905341, server=
>>> Sep 20 19:45:05 prod001 hbase[10838]: 2018-09-20 19:45:05,341 INFO  [AM.ZK.Worker-pool5-t284] master.RegionStates: Transition {acb8f16a004a894c8706f6e12cd26144 state=SPLITTING_NEW, ts=1537461905340, server=prod013,60020,1537304282885} to {acb8f16a004a894c8706f6e12cd26144 state=OPEN, ts=1537461905341, server=
>>> Sep 20 19:45:05 prod001 hbase[10838]: 2018-09-20 19:45:05,343 INFO  [AM.ZK.Worker-pool5-t284] master.AssignmentManager: Handled SPLIT event; parent=IDX_MARK_O,\x107834005168\x0000000046200020LWfBS4c,1536637905252.3e44b85ddf407da831dbb9a871496986., daughter a=IDX_MARK_O,\x107834005168\x0000000046200020LWfBS4c,1
>>> Sep 20 19:47:41 prod001 hbase[10838]: 2018-09-20 19:47:41,972 INFO  [prod001,60000,1537304851459_ChoreService_2] balancer.StochasticLoadBalancer: Skipping load balancing because balanced cluster; total cost is 17.82282205608522, sum multiplier is 1102.0 min cost which need balance is 0.05
>>> Sep 20 19:47:42 prod001 hbase[10838]: 2018-09-20 19:47:42,021 INFO  [prod001,60000,1537304851459_ChoreService_1] hbase.MetaTableAccessor: Deleted IDX_MARK_O,\x107834005168\x0000000046200020LWfBS4c,1536637905252.3e44b85ddf407da831dbb9a871496986.
>>> Sep 20 19:47:42 prod001 hbase[10838]: 2018-09-20 19:47:42,022 INFO  [prod001,60000,1537304851459_ChoreService_1] master.CatalogJanitor: Scanned 779 catalog row(s), gc'd 0 unreferenced merged region(s) and 1 unreferenced parent region(s)
>>> 
>>>> On 20 Sep 2018, at 21:43, Batyrshin Alexander <0x62ash@gmail.com <ma...@gmail.com>> wrote:
>>>> 
>>>> Looks like problem was because of index region split
>>>> 
>>>> Index region split at prod013:
>>>> Sep 20 19:45:05 prod013 hbase[193055]: 2018-09-20 19:45:05,441 INFO  [regionserver/prod013/10.0.0.13:60020-splits-1537400010677] regionserver.SplitRequest: Region split, hbase:meta updated, and report to master. Parent=IDX_MARK_O,\x107834005168\x0000000046200020LWfBS4c,1536637905252.3e44b85ddf407da831dbb9a871496986., new regions: IDX_MARK_O,\x107834005168\x0000000046200020LWfBS4c,1537461904877.33cba925c7acb347ac3f5e70e839c3cb., IDX_MARK_O,\x107834005168\x0000000046200068=4YF!YI,1537461904877.acb8f16a004a894c8706f6e12cd26144.. Split took 0sec
>>>> Sep 20 19:45:05 prod013 hbase[193055]: 2018-09-20 19:45:05,441 INFO  [regionserver/prod013/10.0.0.13:60020-splits-1537400010677] regionserver.SplitRequest: Split transaction journal:
>>>> Sep 20 19:45:05 prod013 hbase[193055]:         STARTED at 1537461904853
>>>> Sep 20 19:45:05 prod013 hbase[193055]:         PREPARED at 1537461904877
>>>> Sep 20 19:45:05 prod013 hbase[193055]:         BEFORE_PRE_SPLIT_HOOK at 1537461904877
>>>> Sep 20 19:45:05 prod013 hbase[193055]:         AFTER_PRE_SPLIT_HOOK at 1537461904877
>>>> Sep 20 19:45:05 prod013 hbase[193055]:         SET_SPLITTING at 1537461904880
>>>> Sep 20 19:45:05 prod013 hbase[193055]:         CREATE_SPLIT_DIR at 1537461904987
>>>> Sep 20 19:45:05 prod013 hbase[193055]:         CLOSED_PARENT_REGION at 1537461905002
>>>> Sep 20 19:45:05 prod013 hbase[193055]:         OFFLINED_PARENT at 1537461905002
>>>> Sep 20 19:45:05 prod013 hbase[193055]:         STARTED_REGION_A_CREATION at 1537461905056
>>>> Sep 20 19:45:05 prod013 hbase[193055]:         STARTED_REGION_B_CREATION at 1537461905131
>>>> Sep 20 19:45:05 prod013 hbase[193055]:         PONR at 1537461905192
>>>> Sep 20 19:45:05 prod013 hbase[193055]:         OPENED_REGION_A at 1537461905249
>>>> Sep 20 19:45:05 prod013 hbase[193055]:         OPENED_REGION_B at 1537461905252
>>>> Sep 20 19:45:05 prod013 hbase[193055]:         BEFORE_POST_SPLIT_HOOK at 1537461905439
>>>> Sep 20 19:45:05 prod013 hbase[193055]:         AFTER_POST_SPLIT_HOOK at 1537461905439
>>>> Sep 20 19:45:05 prod013 hbase[193055]:         COMPLETED at 1537461905439
>>>> 
>>>> Index update failed at prod002:
>>>> Sep 20 20:09:24 prod002 hbase[97285]: 2018-09-20 20:09:24,520 WARN  [hconnection-0x4f3242a0-shared--pool32-t36014] client.AsyncProcess: #220, table=IDX_MARK_O, attempt=1/1 failed=1ops, last exception: org.apache.hadoop.hbase.NotServingRegionException: org.apache.hadoop.hbase.NotServingRegionException: Re
>>>> gion IDX_MARK_O,\x107834005168\x0000000046200020LWfBS4c,1536637905252.3e44b85ddf407da831dbb9a871496986. is not online on prod013,60020,1537304282885
>>>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3081)
>>>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1271)
>>>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.regionserver.RSRpcServices.multi(RSRpcServices.java:2365)
>>>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:36621)
>>>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2359)
>>>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:124)
>>>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:297)
>>>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:277)
>>>> Sep 20 20:09:24 prod002 hbase[97285]:  on prod013,60020,1537304282885, tracking started Thu Sep 20 20:09:24 MSK 2018; not retrying 1 - final failure
>>>> Sep 20 20:09:24 prod002 hbase[97285]: 2018-09-20 20:09:24,549 INFO  [RpcServer.default.FPBQ.Fifo.handler=98,queue=8,port=60020] zookeeper.RecoverableZooKeeper: Process identifier=hconnection-0x39beae45 connecting to ZooKeeper ensemble=10.0.0.1:2181 <http://10.0.0.1:2181/>,10.0.0.2:2181 <http://10.0.0.2:2181/>,10.0.0.3:2181 <http://10.0.0.3:2181/>
>>>> Sep 20 20:09:24 prod002 hbase[97285]: 2018-09-20 20:09:24,549 INFO  [RpcServer.default.FPBQ.Fifo.handler=98,queue=8,port=60020] zookeeper.ZooKeeper: Initiating client connection, connectString=10.0.0.1:2181 <http://10.0.0.1:2181/>,10.0.0.2:2181 <http://10.0.0.2:2181/>,10.0.0.3:2181 <http://10.0.0.3:2181/> sessionTimeout=90000 watcher=org.apache.hadoop.hbase.zookeeper.PendingWatcher@3ef61f7
>>>> Sep 20 20:09:24 prod002 hbase[97285]: 2018-09-20 20:09:24,562 INFO  [RpcServer.default.FPBQ.Fifo.handler=98,queue=8,port=60020-SendThread(10.0.0.3:2181 <http://10.0.0.3:2181/>)] zookeeper.ClientCnxn: Opening socket connection to server 10.0.0.3/10.0.0.3:2181 <http://10.0.0.3/10.0.0.3:2181>. Will not attempt to authenticate using SASL (unknown error)
>>>> Sep 20 20:09:24 prod002 hbase[97285]: 2018-09-20 20:09:24,570 INFO  [RpcServer.default.FPBQ.Fifo.handler=98,queue=8,port=60020-SendThread(10.0.0.3:2181 <http://10.0.0.3:2181/>)] zookeeper.ClientCnxn: Socket connection established to 10.0.0.3/10.0.0.3:2181 <http://10.0.0.3/10.0.0.3:2181>, initiating session
>>>> Sep 20 20:09:24 prod002 hbase[97285]: 2018-09-20 20:09:24,572 INFO  [RpcServer.default.FPBQ.Fifo.handler=98,queue=8,port=60020-SendThread(10.0.0.3:2181 <http://10.0.0.3:2181/>)] zookeeper.ClientCnxn: Session establishment complete on server 10.0.0.3/10.0.0.3:2181 <http://10.0.0.3/10.0.0.3:2181>, sessionid = 0x30000e039e01c7f, negotiated timeout = 40000
>>>> Sep 20 20:09:24 prod002 hbase[97285]: 2018-09-20 20:09:24,628 INFO  [RpcServer.default.FPBQ.Fifo.handler=98,queue=8,port=60020] index.PhoenixIndexFailurePolicy: Successfully update INDEX_DISABLE_TIMESTAMP for IDX_MARK_O due to an exception while writing updates. indexState=PENDING_DISABLE
>>>> Sep 20 20:09:24 prod002 hbase[97285]: org.apache.phoenix.hbase.index.exception.MultiIndexWriteFailureException:  disableIndexOnFailure=true, Failed to write to multiple index tables: [IDX_MARK_O]
>>>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.phoenix.hbase.index.write.TrackingParallelWriterIndexCommitter.write(TrackingParallelWriterIndexCommitter.java:235)
>>>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.phoenix.hbase.index.write.IndexWriter.write(IndexWriter.java:195)
>>>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.phoenix.hbase.index.write.IndexWriter.writeAndKillYourselfOnFailure(IndexWriter.java:156)
>>>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.phoenix.hbase.index.write.IndexWriter.writeAndKillYourselfOnFailure(IndexWriter.java:145)
>>>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.phoenix.hbase.index.Indexer.doPostWithExceptions(Indexer.java:620)
>>>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.phoenix.hbase.index.Indexer.doPost(Indexer.java:595)
>>>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.phoenix.hbase.index.Indexer.postBatchMutateIndispensably(Indexer.java:578)
>>>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost$37.call(RegionCoprocessorHost.java:1048)
>>>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost$RegionOperation.call(RegionCoprocessorHost.java:1711)
>>>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost.execOperation(RegionCoprocessorHost.java:1789)
>>>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost.execOperation(RegionCoprocessorHost.java:1745)
>>>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost.postBatchMutateIndispensably(RegionCoprocessorHost.java:1044)
>>>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.regionserver.HRegion.doMiniBatchMutation(HRegion.java:3646)
>>>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(HRegion.java:3108)
>>>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(HRegion.java:3050)
>>>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.regionserver.RSRpcServices.doBatchOp(RSRpcServices.java:916)
>>>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.regionserver.RSRpcServices.doNonAtomicRegionMutation(RSRpcServices.java:844)
>>>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.regionserver.RSRpcServices.multi(RSRpcServices.java:2405)
>>>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:36621)
>>>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2359)
>>>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:124)
>>>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:297)
>>>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:277)
>>>> Sep 20 20:09:24 prod002 hbase[97285]: 2018-09-20 20:09:24,632 INFO  [RpcServer.default.FPBQ.Fifo.handler=98,queue=8,port=60020] util.IndexManagementUtil: Rethrowing org.apache.hadoop.hbase.DoNotRetryIOException: ERROR 1121 (XCL21): Write to the index failed.  disableIndexOnFailure=true, Failed to write to multiple index tables: [IDX_MARK_O] ,serverTimestamp=1537463364504,
>>>> 
>>>> 
>>>>> On 20 Sep 2018, at 21:01, Batyrshin Alexander <0x62ash@gmail.com <ma...@gmail.com>> wrote:
>>>>> 
>>>>> Our setup:
>>>>> HBase-1.4.7
>>>>> Phoenix-4.14-hbase-1.4
>>>>> 
>>>>> 
>>>>>> On 20 Sep 2018, at 20:19, Batyrshin Alexander <0x62ash@gmail.com <ma...@gmail.com>> wrote:
>>>>>> 
>>>>>>  Hello,
>>>>>> Looks live we got dead lock with repeating "ERROR 1120 (XCL20)" exception. At this time all indexes is ACTIVE.
>>>>>> Can you help to make deeper diagnose?
>>>>>> 
>>>>>> java.sql.SQLException: ERROR 1120 (XCL20): Writes to table blocked until index can be updated. tableName=TBL_MARK
>>>>>> 	at org.apache.phoenix.exception.SQLExceptionCode$Factory$1.newException(SQLExceptionCode.java:494)
>>>>>> 	at org.apache.phoenix.exception.SQLExceptionInfo.buildException(SQLExceptionInfo.java:150)
>>>>>> 	at org.apache.phoenix.execute.MutationState.validateAndGetServerTimestamp(MutationState.java:815)
>>>>>> 	at org.apache.phoenix.execute.MutationState.validateAll(MutationState.java:789)
>>>>>> 	at org.apache.phoenix.execute.MutationState.send(MutationState.java:981)
>>>>>> 	at org.apache.phoenix.execute.MutationState.send(MutationState.java:1514)
>>>>>> 	at org.apache.phoenix.execute.MutationState.commit(MutationState.java:1337)
>>>>>> 	at org.apache.phoenix.jdbc.PhoenixConnection$3.call(PhoenixConnection.java:670)
>>>>>> 	at org.apache.phoenix.jdbc.PhoenixConnection$3.call(PhoenixConnection.java:666)
>>>>>> 	at org.apache.phoenix.call.CallRunner.run(CallRunner.java:53)
>>>>>> 	at org.apache.phoenix.jdbc.PhoenixConnection.commit(PhoenixConnection.java:666)
>>>>>> 	at x.persistence.phoenix.PhoenixDao.$anonfun$doUpsert$1(PhoenixDao.scala:103)
>>>>>> 	at scala.util.Try$.apply(Try.scala:209)
>>>>>> 	at x.persistence.phoenix.PhoenixDao.doUpsert(PhoenixDao.scala:101)
>>>>>> 	at x.persistence.phoenix.PhoenixDao.$anonfun$batchInsert$2(PhoenixDao.scala:45)
>>>>>> 	at x.persistence.phoenix.PhoenixDao.$anonfun$batchInsert$2$adapted(PhoenixDao.scala:45)
>>>>>> 	at scala.collection.immutable.Stream.flatMap(Stream.scala:486)
>>>>>> 	at scala.collection.immutable.Stream.$anonfun$flatMap$1(Stream.scala:494)
>>>>>> 	at scala.collection.immutable.Stream.$anonfun$append$1(Stream.scala:252)
>>>>>> 	at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1169)
>>>>>> 	at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1159)
>>>>>> 	at scala.collection.immutable.Stream.length(Stream.scala:309)
>>>>>> 	at scala.collection.SeqLike.size(SeqLike.scala:105)
>>>>>> 	at scala.collection.SeqLike.size$(SeqLike.scala:105)
>>>>>> 	at scala.collection.AbstractSeq.size(Seq.scala:41)
>>>>>> 	at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:285)
>>>>>> 	at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:283)
>>>>>> 	at scala.collection.AbstractTraversable.toArray(Traversable.scala:104)
>>>>>> 	at x.persistence.phoenix.PhoenixDao.$anonfun$batchInsert$1(PhoenixDao.scala:45)
>>>>>> 	at scala.util.Try$.apply(Try.scala:209)
>>>>>> 	at x.persistence.phoenix.PhoenixDao.batchInsert(PhoenixDao.scala:45)
>>>>>> 	at x.persistence.phoenix.PhoenixDao.$anonfun$insert$2(PhoenixDao.scala:35)
>>>>>> 	at scala.concurrent.Future$.$anonfun$apply$1(Future.scala:655)
>>>>>> 	at scala.util.Success.$anonfun$map$1(Try.scala:251)
>>>>>> 	at scala.util.Success.map(Try.scala:209)
>>>>>> 	at scala.concurrent.Future.$anonfun$map$1(Future.scala:289)
>>>>>> 	at scala.concurrent.impl.Promise.liftedTree1$1(Promise.scala:29)
>>>>>> 	at scala.concurrent.impl.Promise.$anonfun$transform$1(Promise.scala:29)
>>>>>> 	at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:60)
>>>>>> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>>>>>> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>>>>>> 	at java.lang.Thread.run(Thread.java:748)
>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> 
>

Re: Table dead lock: ERROR 1120 (XCL20): Writes to table blocked until index can be updated

Posted by Ankit Singhal <an...@gmail.com>.

You might be hitting PHOENIX-4785
<https://jira.apache.org/jira/browse/PHOENIX-4785>,  you can apply the
patch on top of 4.14 and see if it fixes your problem.

Regards,
Ankit Singhal

On Wed, Sep 26, 2018 at 2:33 PM Batyrshin Alexander <0x...@gmail.com>
wrote:

> Any advices? Helps?
> I can reproduce problem and capture more logs if needed.
>
> On 21 Sep 2018, at 02:13, Batyrshin Alexander <0x...@gmail.com> wrote:
>
> Looks like lock goes away 30 minutes after index region split.
> So i can assume that this issue comes from cache that configured by this
> option:* phoenix.coprocessor.maxMetaDataCacheTimeToLiveMs*
>
>
>
> On 21 Sep 2018, at 00:15, Batyrshin Alexander <0x...@gmail.com> wrote:
>
> And how this split looks at Master logs:
>
> Sep 20 19:45:04 prod001 hbase[10838]: 2018-09-20 19:45:04,888 INFO
>  [AM.ZK.Worker-pool5-t282] master.RegionStates: Transition
> {3e44b85ddf407da831dbb9a871496986 state=OPEN,
> ts=1537304859509, server=prod013,60020,1537304282885} to
> {3e44b85ddf407da831dbb9a871496986 state=SPLITTING, ts=1537461904888,
> server=prod
> Sep 20 19:45:05 prod001 hbase[10838]: 2018-09-20 19:45:05,340 INFO
>  [AM.ZK.Worker-pool5-t284] master.RegionStates: Transition
> {3e44b85ddf407da831dbb9a871496986 state=SPLITTING, ts=1537461905340,
> server=prod013,60020,1537304282885} to {3e44b85ddf407da831dbb9a871496986
> state=SPLIT, ts=1537461905340, server=pro
> Sep 20 19:45:05 prod001 hbase[10838]: 2018-09-20 19:45:05,340 INFO
>  [AM.ZK.Worker-pool5-t284] master.RegionStates: Offlined
> 3e44b85ddf407da831dbb9a871496986 from prod013,60020,1537304282885
> Sep 20 19:45:05 prod001 hbase[10838]: 2018-09-20 19:45:05,341 INFO
>  [AM.ZK.Worker-pool5-t284] master.RegionStates: Transition
> {33cba925c7acb347ac3f5e70e839c3cb state=SPLITTING_NEW, ts=1537461905340,
> server=prod013,60020,1537304282885} to {33cba925c7acb347ac3f5e70e839c3cb
> state=OPEN, ts=1537461905341, server=
> Sep 20 19:45:05 prod001 hbase[10838]: 2018-09-20 19:45:05,341 INFO
>  [AM.ZK.Worker-pool5-t284] master.RegionStates: Transition
> {acb8f16a004a894c8706f6e12cd26144 state=SPLITTING_NEW, ts=1537461905340,
> server=prod013,60020,1537304282885} to {acb8f16a004a894c8706f6e12cd26144
> state=OPEN, ts=1537461905341, server=
> Sep 20 19:45:05 prod001 hbase[10838]: 2018-09-20 19:45:05,343 INFO
>  [AM.ZK.Worker-pool5-t284] master.AssignmentManager: Handled SPLIT
> event; parent=IDX_MARK_O,\x107834005168\x0000000046200020LWfBS4c,1536637905252.3e44b85ddf407da831dbb9a871496986.,
> daughter a=IDX_MARK_O,\x107834005168\x0000000046200020LWfBS4c,1
> Sep 20 19:47:41 prod001 hbase[10838]: 2018-09-20 19:47:41,972 INFO
>  [prod001,60000,1537304851459_ChoreService_2]
> balancer.StochasticLoadBalancer: Skipping load balancing because balanced
> cluster; total cost is 17.82282205608522, sum multiplier is 1102.0 min cost
> which need balance is 0.05
> Sep 20 19:47:42 prod001 hbase[10838]: 2018-09-20 19:47:42,021 INFO
>  [prod001,60000,1537304851459_ChoreService_1] hbase.MetaTableAccessor:
> Deleted IDX_MARK_O,\x107834005168\x0000000046200020LWfBS4c,1536637905252.3e44b85ddf407da831dbb9a871496986.
> Sep 20 19:47:42 prod001 hbase[10838]: 2018-09-20 19:47:42,022 INFO
>  [prod001,60000,1537304851459_ChoreService_1] master.CatalogJanitor:
> Scanned 779 catalog row(s), gc'd 0 unreferenced merged region(s) and 1
> unreferenced parent region(s)
>
> On 20 Sep 2018, at 21:43, Batyrshin Alexander <0x...@gmail.com> wrote:
>
> Looks like problem was because of index region split
>
> Index region split at prod013:
> Sep 20 19:45:05 prod013 hbase[193055]: 2018-09-20 19:45:05,441 INFO
>  [regionserver/prod013/10.0.0.13:60020-splits-1537400010677]
> regionserver.SplitRequest: Region split, hbase:meta updated, and report to
> master.
> Parent=IDX_MARK_O,\x107834005168\x0000000046200020LWfBS4c,1536637905252.3e44b85ddf407da831dbb9a871496986.,
> new
> regions: IDX_MARK_O,\x107834005168\x0000000046200020LWfBS4c,1537461904877.33cba925c7acb347ac3f5e70e839c3cb., IDX_MARK_O,\x107834005168\x0000000046200068=4YF!YI,1537461904877.acb8f16a004a894c8706f6e12cd26144..
> Split took 0sec
> Sep 20 19:45:05 prod013 hbase[193055]: 2018-09-20 19:45:05,441 INFO
>  [regionserver/prod013/10.0.0.13:60020-splits-1537400010677]
> regionserver.SplitRequest: Split transaction journal:
> Sep 20 19:45:05 prod013 hbase[193055]:         STARTED at 1537461904853
> Sep 20 19:45:05 prod013 hbase[193055]:         PREPARED at 1537461904877
> Sep 20 19:45:05 prod013 hbase[193055]:         BEFORE_PRE_SPLIT_HOOK at
> 1537461904877
> Sep 20 19:45:05 prod013 hbase[193055]:         AFTER_PRE_SPLIT_HOOK at
> 1537461904877
> Sep 20 19:45:05 prod013 hbase[193055]:         SET_SPLITTING at
> 1537461904880
> Sep 20 19:45:05 prod013 hbase[193055]:         CREATE_SPLIT_DIR at
> 1537461904987
> Sep 20 19:45:05 prod013 hbase[193055]:         CLOSED_PARENT_REGION at
> 1537461905002
> Sep 20 19:45:05 prod013 hbase[193055]:         OFFLINED_PARENT at
> 1537461905002
> Sep 20 19:45:05 prod013 hbase[193055]:         STARTED_REGION_A_CREATION
> at 1537461905056
> Sep 20 19:45:05 prod013 hbase[193055]:         STARTED_REGION_B_CREATION
> at 1537461905131
> Sep 20 19:45:05 prod013 hbase[193055]:         PONR at 1537461905192
> Sep 20 19:45:05 prod013 hbase[193055]:         OPENED_REGION_A at
> 1537461905249
> Sep 20 19:45:05 prod013 hbase[193055]:         OPENED_REGION_B at
> 1537461905252
> Sep 20 19:45:05 prod013 hbase[193055]:         BEFORE_POST_SPLIT_HOOK at
> 1537461905439
> Sep 20 19:45:05 prod013 hbase[193055]:         AFTER_POST_SPLIT_HOOK at
> 1537461905439
> Sep 20 19:45:05 prod013 hbase[193055]:         COMPLETED at 1537461905439
>
> Index update failed at prod002:
> Sep 20 20:09:24 prod002 hbase[97285]: 2018-09-20 20:09:24,520 WARN
>  [hconnection-0x4f3242a0-shared--pool32-t36014] client.AsyncProcess: #220,
> table=IDX_MARK_O, attempt=1/1 failed=1ops, last exception:
> org.apache.hadoop.hbase.NotServingRegionException:
> org.apache.hadoop.hbase.NotServingRegionException: Re
> gion
> IDX_MARK_O,\x107834005168\x0000000046200020LWfBS4c,1536637905252.3e44b85ddf407da831dbb9a871496986.
> is not online on prod013,60020,1537304282885
> Sep 20 20:09:24 prod002 hbase[97285]:         at
> org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3081)
> Sep 20 20:09:24 prod002 hbase[97285]:         at
> org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1271)
> Sep 20 20:09:24 prod002 hbase[97285]:         at
> org.apache.hadoop.hbase.regionserver.RSRpcServices.multi(RSRpcServices.java:2365)
> Sep 20 20:09:24 prod002 hbase[97285]:         at
> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:36621)
> Sep 20 20:09:24 prod002 hbase[97285]:         at
> org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2359)
> Sep 20 20:09:24 prod002 hbase[97285]:         at
> org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:124)
> Sep 20 20:09:24 prod002 hbase[97285]:         at
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:297)
> Sep 20 20:09:24 prod002 hbase[97285]:         at
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:277)
> Sep 20 20:09:24 prod002 hbase[97285]:  on prod013,60020,1537304282885,
> tracking started Thu Sep 20 20:09:24 MSK 2018; not retrying 1 - final
> failure
> Sep 20 20:09:24 prod002 hbase[97285]: 2018-09-20 20:09:24,549 INFO
>  [RpcServer.default.FPBQ.Fifo.handler=98,queue=8,port=60020]
> zookeeper.RecoverableZooKeeper: Process identifier=hconnection-0x39beae45
> connecting to ZooKeeper ensemble=10.0.0.1:2181,10.0.0.2:2181,10.0.0.3:2181
> Sep 20 20:09:24 prod002 hbase[97285]: 2018-09-20 20:09:24,549 INFO
>  [RpcServer.default.FPBQ.Fifo.handler=98,queue=8,port=60020]
> zookeeper.ZooKeeper: Initiating client connection, connectString=
> 10.0.0.1:2181,10.0.0.2:2181,10.0.0.3:2181 sessionTimeout=90000
> watcher=org.apache.hadoop.hbase.zookeeper.PendingWatcher@3ef61f7
> Sep 20 20:09:24 prod002 hbase[97285]: 2018-09-20 20:09:24,562 INFO
>  [RpcServer.default.FPBQ.Fifo.handler=98,queue=8,port=60020-SendThread(
> 10.0.0.3:2181)] zookeeper.ClientCnxn: Opening socket connection to server
> 10.0.0.3/10.0.0.3:2181. Will not attempt to authenticate using SASL
> (unknown error)
> Sep 20 20:09:24 prod002 hbase[97285]: 2018-09-20 20:09:24,570 INFO
>  [RpcServer.default.FPBQ.Fifo.handler=98,queue=8,port=60020-SendThread(
> 10.0.0.3:2181)] zookeeper.ClientCnxn: Socket connection established to
> 10.0.0.3/10.0.0.3:2181, initiating session
> Sep 20 20:09:24 prod002 hbase[97285]: 2018-09-20 20:09:24,572 INFO
>  [RpcServer.default.FPBQ.Fifo.handler=98,queue=8,port=60020-SendThread(
> 10.0.0.3:2181)] zookeeper.ClientCnxn: Session establishment complete on
> server 10.0.0.3/10.0.0.3:2181, sessionid = 0x30000e039e01c7f, negotiated
> timeout = 40000
> Sep 20 20:09:24 prod002 hbase[97285]: 2018-09-20 20:09:24,628 INFO
>  [RpcServer.default.FPBQ.Fifo.handler=98,queue=8,port=60020]
> index.PhoenixIndexFailurePolicy: Successfully
> update INDEX_DISABLE_TIMESTAMP for IDX_MARK_O due to an exception while
> writing updates. indexState=PENDING_DISABLE
> Sep 20 20:09:24 prod002 hbase[97285]:
> org.apache.phoenix.hbase.index.exception.MultiIndexWriteFailureException:
>  disableIndexOnFailure=true, Failed to write to multiple index tables:
> [IDX_MARK_O]
> Sep 20 20:09:24 prod002 hbase[97285]:         at
> org.apache.phoenix.hbase.index.write.TrackingParallelWriterIndexCommitter.write(TrackingParallelWriterIndexCommitter.java:235)
> Sep 20 20:09:24 prod002 hbase[97285]:         at
> org.apache.phoenix.hbase.index.write.IndexWriter.write(IndexWriter.java:195)
> Sep 20 20:09:24 prod002 hbase[97285]:         at
> org.apache.phoenix.hbase.index.write.IndexWriter.writeAndKillYourselfOnFailure(IndexWriter.java:156)
> Sep 20 20:09:24 prod002 hbase[97285]:         at
> org.apache.phoenix.hbase.index.write.IndexWriter.writeAndKillYourselfOnFailure(IndexWriter.java:145)
> Sep 20 20:09:24 prod002 hbase[97285]:         at
> org.apache.phoenix.hbase.index.Indexer.doPostWithExceptions(Indexer.java:620)
> Sep 20 20:09:24 prod002 hbase[97285]:         at
> org.apache.phoenix.hbase.index.Indexer.doPost(Indexer.java:595)
> Sep 20 20:09:24 prod002 hbase[97285]:         at
> org.apache.phoenix.hbase.index.Indexer.postBatchMutateIndispensably(Indexer.java:578)
> Sep 20 20:09:24 prod002 hbase[97285]:         at
> org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost$37.call(RegionCoprocessorHost.java:1048)
> Sep 20 20:09:24 prod002 hbase[97285]:         at
> org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost$RegionOperation.call(RegionCoprocessorHost.java:1711)
> Sep 20 20:09:24 prod002 hbase[97285]:         at
> org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost.execOperation(RegionCoprocessorHost.java:1789)
> Sep 20 20:09:24 prod002 hbase[97285]:         at
> org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost.execOperation(RegionCoprocessorHost.java:1745)
> Sep 20 20:09:24 prod002 hbase[97285]:         at
> org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost.postBatchMutateIndispensably(RegionCoprocessorHost.java:1044)
> Sep 20 20:09:24 prod002 hbase[97285]:         at
> org.apache.hadoop.hbase.regionserver.HRegion.doMiniBatchMutation(HRegion.java:3646)
> Sep 20 20:09:24 prod002 hbase[97285]:         at
> org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(HRegion.java:3108)
> Sep 20 20:09:24 prod002 hbase[97285]:         at
> org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(HRegion.java:3050)
> Sep 20 20:09:24 prod002 hbase[97285]:         at
> org.apache.hadoop.hbase.regionserver.RSRpcServices.doBatchOp(RSRpcServices.java:916)
> Sep 20 20:09:24 prod002 hbase[97285]:         at
> org.apache.hadoop.hbase.regionserver.RSRpcServices.doNonAtomicRegionMutation(RSRpcServices.java:844)
> Sep 20 20:09:24 prod002 hbase[97285]:         at
> org.apache.hadoop.hbase.regionserver.RSRpcServices.multi(RSRpcServices.java:2405)
> Sep 20 20:09:24 prod002 hbase[97285]:         at
> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:36621)
> Sep 20 20:09:24 prod002 hbase[97285]:         at
> org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2359)
> Sep 20 20:09:24 prod002 hbase[97285]:         at
> org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:124)
> Sep 20 20:09:24 prod002 hbase[97285]:         at
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:297)
> Sep 20 20:09:24 prod002 hbase[97285]:         at
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:277)
> Sep 20 20:09:24 prod002 hbase[97285]: 2018-09-20 20:09:24,632 INFO
>  [RpcServer.default.FPBQ.Fifo.handler=98,queue=8,port=60020]
> util.IndexManagementUtil:
> Rethrowing org.apache.hadoop.hbase.DoNotRetryIOException: ERROR 1121
> (XCL21): Write to the index failed.  disableIndexOnFailure=true, Failed to
> write to multiple index tables: [IDX_MARK_O] ,serverTimestamp=1537463364504,
>
>
> On 20 Sep 2018, at 21:01, Batyrshin Alexander <0x...@gmail.com> wrote:
>
> Our setup:
> HBase-1.4.7
> Phoenix-4.14-hbase-1.4
>
>
> On 20 Sep 2018, at 20:19, Batyrshin Alexander <0x...@gmail.com> wrote:
>
>  Hello,
> Looks live we got dead lock with repeating "ERROR 1120 (XCL20)"
> exception. At this time all indexes is ACTIVE.
> Can you help to make deeper diagnose?
>
> java.sql.SQLException: ERROR 1120 (XCL20): Writes to table blocked until
> index can be updated. tableName=TBL_MARK
> at
> org.apache.phoenix.exception.SQLExceptionCode$Factory$1.newException(SQLExceptionCode.java:494)
> at
> org.apache.phoenix.exception.SQLExceptionInfo.buildException(SQLExceptionInfo.java:150)
> at
> org.apache.phoenix.execute.MutationState.validateAndGetServerTimestamp(MutationState.java:815)
> at
> org.apache.phoenix.execute.MutationState.validateAll(MutationState.java:789)
> at org.apache.phoenix.execute.MutationState.send(MutationState.java:981)
> at org.apache.phoenix.execute.MutationState.send(MutationState.java:1514)
> at org.apache.phoenix.execute.MutationState.commit(MutationState.java:1337)
> at
> org.apache.phoenix.jdbc.PhoenixConnection$3.call(PhoenixConnection.java:670)
> at
> org.apache.phoenix.jdbc.PhoenixConnection$3.call(PhoenixConnection.java:666)
> at org.apache.phoenix.call.CallRunner.run(CallRunner.java:53)
> at
> org.apache.phoenix.jdbc.PhoenixConnection.commit(PhoenixConnection.java:666)
> at
> x.persistence.phoenix.PhoenixDao.$anonfun$doUpsert$1(PhoenixDao.scala:103)
> at scala.util.Try$.apply(Try.scala:209)
> at x.persistence.phoenix.PhoenixDao.doUpsert(PhoenixDao.scala:101)
> at
> x.persistence.phoenix.PhoenixDao.$anonfun$batchInsert$2(PhoenixDao.scala:45)
> at
> x.persistence.phoenix.PhoenixDao.$anonfun$batchInsert$2$adapted(PhoenixDao.scala:45)
> at scala.collection.immutable.Stream.flatMap(Stream.scala:486)
> at scala.collection.immutable.Stream.$anonfun$flatMap$1(Stream.scala:494)
> at scala.collection.immutable.Stream.$anonfun$append$1(Stream.scala:252)
> at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1169)
> at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1159)
> at scala.collection.immutable.Stream.length(Stream.scala:309)
> at scala.collection.SeqLike.size(SeqLike.scala:105)
> at scala.collection.SeqLike.size$(SeqLike.scala:105)
> at scala.collection.AbstractSeq.size(Seq.scala:41)
> at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:285)
> at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:283)
> at scala.collection.AbstractTraversable.toArray(Traversable.scala:104)
> at
> x.persistence.phoenix.PhoenixDao.$anonfun$batchInsert$1(PhoenixDao.scala:45)
> at scala.util.Try$.apply(Try.scala:209)
> at x.persistence.phoenix.PhoenixDao.batchInsert(PhoenixDao.scala:45)
> at x.persistence.phoenix.PhoenixDao.$anonfun$insert$2(PhoenixDao.scala:35)
> at scala.concurrent.Future$.$anonfun$apply$1(Future.scala:655)
> at scala.util.Success.$anonfun$map$1(Try.scala:251)
> at scala.util.Success.map(Try.scala:209)
> at scala.concurrent.Future.$anonfun$map$1(Future.scala:289)
> at scala.concurrent.impl.Promise.liftedTree1$1(Promise.scala:29)
> at scala.concurrent.impl.Promise.$anonfun$transform$1(Promise.scala:29)
> at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:60)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
>
>
>
>
>
>
>
>

Re: Table dead lock: ERROR 1120 (XCL20): Writes to table blocked until index can be updated

Posted by Batyrshin Alexander <0x...@gmail.com>.

Any advices? Helps?
I can reproduce problem and capture more logs if needed.

> On 21 Sep 2018, at 02:13, Batyrshin Alexander <0x...@gmail.com> wrote:
> 
> Looks like lock goes away 30 minutes after index region split.
> So i can assume that this issue comes from cache that configured by this option: phoenix.coprocessor.maxMetaDataCacheTimeToLiveMs
> 
> 
> 
>> On 21 Sep 2018, at 00:15, Batyrshin Alexander <0x62ash@gmail.com <ma...@gmail.com>> wrote:
>> 
>> And how this split looks at Master logs:
>> 
>> Sep 20 19:45:04 prod001 hbase[10838]: 2018-09-20 19:45:04,888 INFO  [AM.ZK.Worker-pool5-t282] master.RegionStates: Transition {3e44b85ddf407da831dbb9a871496986 state=OPEN, ts=1537304859509, server=prod013,60020,1537304282885} to {3e44b85ddf407da831dbb9a871496986 state=SPLITTING, ts=1537461904888, server=prod
>> Sep 20 19:45:05 prod001 hbase[10838]: 2018-09-20 19:45:05,340 INFO  [AM.ZK.Worker-pool5-t284] master.RegionStates: Transition {3e44b85ddf407da831dbb9a871496986 state=SPLITTING, ts=1537461905340, server=prod013,60020,1537304282885} to {3e44b85ddf407da831dbb9a871496986 state=SPLIT, ts=1537461905340, server=pro
>> Sep 20 19:45:05 prod001 hbase[10838]: 2018-09-20 19:45:05,340 INFO  [AM.ZK.Worker-pool5-t284] master.RegionStates: Offlined 3e44b85ddf407da831dbb9a871496986 from prod013,60020,1537304282885
>> Sep 20 19:45:05 prod001 hbase[10838]: 2018-09-20 19:45:05,341 INFO  [AM.ZK.Worker-pool5-t284] master.RegionStates: Transition {33cba925c7acb347ac3f5e70e839c3cb state=SPLITTING_NEW, ts=1537461905340, server=prod013,60020,1537304282885} to {33cba925c7acb347ac3f5e70e839c3cb state=OPEN, ts=1537461905341, server=
>> Sep 20 19:45:05 prod001 hbase[10838]: 2018-09-20 19:45:05,341 INFO  [AM.ZK.Worker-pool5-t284] master.RegionStates: Transition {acb8f16a004a894c8706f6e12cd26144 state=SPLITTING_NEW, ts=1537461905340, server=prod013,60020,1537304282885} to {acb8f16a004a894c8706f6e12cd26144 state=OPEN, ts=1537461905341, server=
>> Sep 20 19:45:05 prod001 hbase[10838]: 2018-09-20 19:45:05,343 INFO  [AM.ZK.Worker-pool5-t284] master.AssignmentManager: Handled SPLIT event; parent=IDX_MARK_O,\x107834005168\x0000000046200020LWfBS4c,1536637905252.3e44b85ddf407da831dbb9a871496986., daughter a=IDX_MARK_O,\x107834005168\x0000000046200020LWfBS4c,1
>> Sep 20 19:47:41 prod001 hbase[10838]: 2018-09-20 19:47:41,972 INFO  [prod001,60000,1537304851459_ChoreService_2] balancer.StochasticLoadBalancer: Skipping load balancing because balanced cluster; total cost is 17.82282205608522, sum multiplier is 1102.0 min cost which need balance is 0.05
>> Sep 20 19:47:42 prod001 hbase[10838]: 2018-09-20 19:47:42,021 INFO  [prod001,60000,1537304851459_ChoreService_1] hbase.MetaTableAccessor: Deleted IDX_MARK_O,\x107834005168\x0000000046200020LWfBS4c,1536637905252.3e44b85ddf407da831dbb9a871496986.
>> Sep 20 19:47:42 prod001 hbase[10838]: 2018-09-20 19:47:42,022 INFO  [prod001,60000,1537304851459_ChoreService_1] master.CatalogJanitor: Scanned 779 catalog row(s), gc'd 0 unreferenced merged region(s) and 1 unreferenced parent region(s)
>> 
>>> On 20 Sep 2018, at 21:43, Batyrshin Alexander <0x62ash@gmail.com <ma...@gmail.com>> wrote:
>>> 
>>> Looks like problem was because of index region split
>>> 
>>> Index region split at prod013:
>>> Sep 20 19:45:05 prod013 hbase[193055]: 2018-09-20 19:45:05,441 INFO  [regionserver/prod013/10.0.0.13:60020-splits-1537400010677] regionserver.SplitRequest: Region split, hbase:meta updated, and report to master. Parent=IDX_MARK_O,\x107834005168\x0000000046200020LWfBS4c,1536637905252.3e44b85ddf407da831dbb9a871496986., new regions: IDX_MARK_O,\x107834005168\x0000000046200020LWfBS4c,1537461904877.33cba925c7acb347ac3f5e70e839c3cb., IDX_MARK_O,\x107834005168\x0000000046200068=4YF!YI,1537461904877.acb8f16a004a894c8706f6e12cd26144.. Split took 0sec
>>> Sep 20 19:45:05 prod013 hbase[193055]: 2018-09-20 19:45:05,441 INFO  [regionserver/prod013/10.0.0.13:60020-splits-1537400010677] regionserver.SplitRequest: Split transaction journal:
>>> Sep 20 19:45:05 prod013 hbase[193055]:         STARTED at 1537461904853
>>> Sep 20 19:45:05 prod013 hbase[193055]:         PREPARED at 1537461904877
>>> Sep 20 19:45:05 prod013 hbase[193055]:         BEFORE_PRE_SPLIT_HOOK at 1537461904877
>>> Sep 20 19:45:05 prod013 hbase[193055]:         AFTER_PRE_SPLIT_HOOK at 1537461904877
>>> Sep 20 19:45:05 prod013 hbase[193055]:         SET_SPLITTING at 1537461904880
>>> Sep 20 19:45:05 prod013 hbase[193055]:         CREATE_SPLIT_DIR at 1537461904987
>>> Sep 20 19:45:05 prod013 hbase[193055]:         CLOSED_PARENT_REGION at 1537461905002
>>> Sep 20 19:45:05 prod013 hbase[193055]:         OFFLINED_PARENT at 1537461905002
>>> Sep 20 19:45:05 prod013 hbase[193055]:         STARTED_REGION_A_CREATION at 1537461905056
>>> Sep 20 19:45:05 prod013 hbase[193055]:         STARTED_REGION_B_CREATION at 1537461905131
>>> Sep 20 19:45:05 prod013 hbase[193055]:         PONR at 1537461905192
>>> Sep 20 19:45:05 prod013 hbase[193055]:         OPENED_REGION_A at 1537461905249
>>> Sep 20 19:45:05 prod013 hbase[193055]:         OPENED_REGION_B at 1537461905252
>>> Sep 20 19:45:05 prod013 hbase[193055]:         BEFORE_POST_SPLIT_HOOK at 1537461905439
>>> Sep 20 19:45:05 prod013 hbase[193055]:         AFTER_POST_SPLIT_HOOK at 1537461905439
>>> Sep 20 19:45:05 prod013 hbase[193055]:         COMPLETED at 1537461905439
>>> 
>>> Index update failed at prod002:
>>> Sep 20 20:09:24 prod002 hbase[97285]: 2018-09-20 20:09:24,520 WARN  [hconnection-0x4f3242a0-shared--pool32-t36014] client.AsyncProcess: #220, table=IDX_MARK_O, attempt=1/1 failed=1ops, last exception: org.apache.hadoop.hbase.NotServingRegionException: org.apache.hadoop.hbase.NotServingRegionException: Re
>>> gion IDX_MARK_O,\x107834005168\x0000000046200020LWfBS4c,1536637905252.3e44b85ddf407da831dbb9a871496986. is not online on prod013,60020,1537304282885
>>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3081)
>>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1271)
>>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.regionserver.RSRpcServices.multi(RSRpcServices.java:2365)
>>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:36621)
>>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2359)
>>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:124)
>>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:297)
>>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:277)
>>> Sep 20 20:09:24 prod002 hbase[97285]:  on prod013,60020,1537304282885, tracking started Thu Sep 20 20:09:24 MSK 2018; not retrying 1 - final failure
>>> Sep 20 20:09:24 prod002 hbase[97285]: 2018-09-20 20:09:24,549 INFO  [RpcServer.default.FPBQ.Fifo.handler=98,queue=8,port=60020] zookeeper.RecoverableZooKeeper: Process identifier=hconnection-0x39beae45 connecting to ZooKeeper ensemble=10.0.0.1:2181,10.0.0.2:2181,10.0.0.3:2181
>>> Sep 20 20:09:24 prod002 hbase[97285]: 2018-09-20 20:09:24,549 INFO  [RpcServer.default.FPBQ.Fifo.handler=98,queue=8,port=60020] zookeeper.ZooKeeper: Initiating client connection, connectString=10.0.0.1:2181,10.0.0.2:2181,10.0.0.3:2181 sessionTimeout=90000 watcher=org.apache.hadoop.hbase.zookeeper.PendingWatcher@3ef61f7
>>> Sep 20 20:09:24 prod002 hbase[97285]: 2018-09-20 20:09:24,562 INFO  [RpcServer.default.FPBQ.Fifo.handler=98,queue=8,port=60020-SendThread(10.0.0.3:2181)] zookeeper.ClientCnxn: Opening socket connection to server 10.0.0.3/10.0.0.3:2181. Will not attempt to authenticate using SASL (unknown error)
>>> Sep 20 20:09:24 prod002 hbase[97285]: 2018-09-20 20:09:24,570 INFO  [RpcServer.default.FPBQ.Fifo.handler=98,queue=8,port=60020-SendThread(10.0.0.3:2181)] zookeeper.ClientCnxn: Socket connection established to 10.0.0.3/10.0.0.3:2181, initiating session
>>> Sep 20 20:09:24 prod002 hbase[97285]: 2018-09-20 20:09:24,572 INFO  [RpcServer.default.FPBQ.Fifo.handler=98,queue=8,port=60020-SendThread(10.0.0.3:2181)] zookeeper.ClientCnxn: Session establishment complete on server 10.0.0.3/10.0.0.3:2181, sessionid = 0x30000e039e01c7f, negotiated timeout = 40000
>>> Sep 20 20:09:24 prod002 hbase[97285]: 2018-09-20 20:09:24,628 INFO  [RpcServer.default.FPBQ.Fifo.handler=98,queue=8,port=60020] index.PhoenixIndexFailurePolicy: Successfully update INDEX_DISABLE_TIMESTAMP for IDX_MARK_O due to an exception while writing updates. indexState=PENDING_DISABLE
>>> Sep 20 20:09:24 prod002 hbase[97285]: org.apache.phoenix.hbase.index.exception.MultiIndexWriteFailureException:  disableIndexOnFailure=true, Failed to write to multiple index tables: [IDX_MARK_O]
>>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.phoenix.hbase.index.write.TrackingParallelWriterIndexCommitter.write(TrackingParallelWriterIndexCommitter.java:235)
>>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.phoenix.hbase.index.write.IndexWriter.write(IndexWriter.java:195)
>>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.phoenix.hbase.index.write.IndexWriter.writeAndKillYourselfOnFailure(IndexWriter.java:156)
>>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.phoenix.hbase.index.write.IndexWriter.writeAndKillYourselfOnFailure(IndexWriter.java:145)
>>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.phoenix.hbase.index.Indexer.doPostWithExceptions(Indexer.java:620)
>>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.phoenix.hbase.index.Indexer.doPost(Indexer.java:595)
>>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.phoenix.hbase.index.Indexer.postBatchMutateIndispensably(Indexer.java:578)
>>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost$37.call(RegionCoprocessorHost.java:1048)
>>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost$RegionOperation.call(RegionCoprocessorHost.java:1711)
>>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost.execOperation(RegionCoprocessorHost.java:1789)
>>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost.execOperation(RegionCoprocessorHost.java:1745)
>>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost.postBatchMutateIndispensably(RegionCoprocessorHost.java:1044)
>>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.regionserver.HRegion.doMiniBatchMutation(HRegion.java:3646)
>>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(HRegion.java:3108)
>>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(HRegion.java:3050)
>>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.regionserver.RSRpcServices.doBatchOp(RSRpcServices.java:916)
>>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.regionserver.RSRpcServices.doNonAtomicRegionMutation(RSRpcServices.java:844)
>>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.regionserver.RSRpcServices.multi(RSRpcServices.java:2405)
>>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:36621)
>>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2359)
>>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:124)
>>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:297)
>>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:277)
>>> Sep 20 20:09:24 prod002 hbase[97285]: 2018-09-20 20:09:24,632 INFO  [RpcServer.default.FPBQ.Fifo.handler=98,queue=8,port=60020] util.IndexManagementUtil: Rethrowing org.apache.hadoop.hbase.DoNotRetryIOException: ERROR 1121 (XCL21): Write to the index failed.  disableIndexOnFailure=true, Failed to write to multiple index tables: [IDX_MARK_O] ,serverTimestamp=1537463364504,
>>> 
>>> 
>>>> On 20 Sep 2018, at 21:01, Batyrshin Alexander <0x...@gmail.com> wrote:
>>>> 
>>>> Our setup:
>>>> HBase-1.4.7
>>>> Phoenix-4.14-hbase-1.4
>>>> 
>>>> 
>>>>> On 20 Sep 2018, at 20:19, Batyrshin Alexander <0x...@gmail.com> wrote:
>>>>> 
>>>>>  Hello,
>>>>> Looks live we got dead lock with repeating "ERROR 1120 (XCL20)" exception. At this time all indexes is ACTIVE.
>>>>> Can you help to make deeper diagnose?
>>>>> 
>>>>> java.sql.SQLException: ERROR 1120 (XCL20): Writes to table blocked until index can be updated. tableName=TBL_MARK
>>>>> 	at org.apache.phoenix.exception.SQLExceptionCode$Factory$1.newException(SQLExceptionCode.java:494)
>>>>> 	at org.apache.phoenix.exception.SQLExceptionInfo.buildException(SQLExceptionInfo.java:150)
>>>>> 	at org.apache.phoenix.execute.MutationState.validateAndGetServerTimestamp(MutationState.java:815)
>>>>> 	at org.apache.phoenix.execute.MutationState.validateAll(MutationState.java:789)
>>>>> 	at org.apache.phoenix.execute.MutationState.send(MutationState.java:981)
>>>>> 	at org.apache.phoenix.execute.MutationState.send(MutationState.java:1514)
>>>>> 	at org.apache.phoenix.execute.MutationState.commit(MutationState.java:1337)
>>>>> 	at org.apache.phoenix.jdbc.PhoenixConnection$3.call(PhoenixConnection.java:670)
>>>>> 	at org.apache.phoenix.jdbc.PhoenixConnection$3.call(PhoenixConnection.java:666)
>>>>> 	at org.apache.phoenix.call.CallRunner.run(CallRunner.java:53)
>>>>> 	at org.apache.phoenix.jdbc.PhoenixConnection.commit(PhoenixConnection.java:666)
>>>>> 	at x.persistence.phoenix.PhoenixDao.$anonfun$doUpsert$1(PhoenixDao.scala:103)
>>>>> 	at scala.util.Try$.apply(Try.scala:209)
>>>>> 	at x.persistence.phoenix.PhoenixDao.doUpsert(PhoenixDao.scala:101)
>>>>> 	at x.persistence.phoenix.PhoenixDao.$anonfun$batchInsert$2(PhoenixDao.scala:45)
>>>>> 	at x.persistence.phoenix.PhoenixDao.$anonfun$batchInsert$2$adapted(PhoenixDao.scala:45)
>>>>> 	at scala.collection.immutable.Stream.flatMap(Stream.scala:486)
>>>>> 	at scala.collection.immutable.Stream.$anonfun$flatMap$1(Stream.scala:494)
>>>>> 	at scala.collection.immutable.Stream.$anonfun$append$1(Stream.scala:252)
>>>>> 	at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1169)
>>>>> 	at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1159)
>>>>> 	at scala.collection.immutable.Stream.length(Stream.scala:309)
>>>>> 	at scala.collection.SeqLike.size(SeqLike.scala:105)
>>>>> 	at scala.collection.SeqLike.size$(SeqLike.scala:105)
>>>>> 	at scala.collection.AbstractSeq.size(Seq.scala:41)
>>>>> 	at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:285)
>>>>> 	at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:283)
>>>>> 	at scala.collection.AbstractTraversable.toArray(Traversable.scala:104)
>>>>> 	at x.persistence.phoenix.PhoenixDao.$anonfun$batchInsert$1(PhoenixDao.scala:45)
>>>>> 	at scala.util.Try$.apply(Try.scala:209)
>>>>> 	at x.persistence.phoenix.PhoenixDao.batchInsert(PhoenixDao.scala:45)
>>>>> 	at x.persistence.phoenix.PhoenixDao.$anonfun$insert$2(PhoenixDao.scala:35)
>>>>> 	at scala.concurrent.Future$.$anonfun$apply$1(Future.scala:655)
>>>>> 	at scala.util.Success.$anonfun$map$1(Try.scala:251)
>>>>> 	at scala.util.Success.map(Try.scala:209)
>>>>> 	at scala.concurrent.Future.$anonfun$map$1(Future.scala:289)
>>>>> 	at scala.concurrent.impl.Promise.liftedTree1$1(Promise.scala:29)
>>>>> 	at scala.concurrent.impl.Promise.$anonfun$transform$1(Promise.scala:29)
>>>>> 	at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:60)
>>>>> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>>>>> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>>>>> 	at java.lang.Thread.run(Thread.java:748)
>>>>> 
>>>>> 
>>>> 
>>> 
>> 
>

Re: Table dead lock: ERROR 1120 (XCL20): Writes to table blocked until index can be updated

Posted by Batyrshin Alexander <0x...@gmail.com>.

Looks like lock goes away 30 minutes after index region split.
So i can assume that this issue comes from cache that configured by this option: phoenix.coprocessor.maxMetaDataCacheTimeToLiveMs



> On 21 Sep 2018, at 00:15, Batyrshin Alexander <0x...@gmail.com> wrote:
> 
> And how this split looks at Master logs:
> 
> Sep 20 19:45:04 prod001 hbase[10838]: 2018-09-20 19:45:04,888 INFO  [AM.ZK.Worker-pool5-t282] master.RegionStates: Transition {3e44b85ddf407da831dbb9a871496986 state=OPEN, ts=1537304859509, server=prod013,60020,1537304282885} to {3e44b85ddf407da831dbb9a871496986 state=SPLITTING, ts=1537461904888, server=prod
> Sep 20 19:45:05 prod001 hbase[10838]: 2018-09-20 19:45:05,340 INFO  [AM.ZK.Worker-pool5-t284] master.RegionStates: Transition {3e44b85ddf407da831dbb9a871496986 state=SPLITTING, ts=1537461905340, server=prod013,60020,1537304282885} to {3e44b85ddf407da831dbb9a871496986 state=SPLIT, ts=1537461905340, server=pro
> Sep 20 19:45:05 prod001 hbase[10838]: 2018-09-20 19:45:05,340 INFO  [AM.ZK.Worker-pool5-t284] master.RegionStates: Offlined 3e44b85ddf407da831dbb9a871496986 from prod013,60020,1537304282885
> Sep 20 19:45:05 prod001 hbase[10838]: 2018-09-20 19:45:05,341 INFO  [AM.ZK.Worker-pool5-t284] master.RegionStates: Transition {33cba925c7acb347ac3f5e70e839c3cb state=SPLITTING_NEW, ts=1537461905340, server=prod013,60020,1537304282885} to {33cba925c7acb347ac3f5e70e839c3cb state=OPEN, ts=1537461905341, server=
> Sep 20 19:45:05 prod001 hbase[10838]: 2018-09-20 19:45:05,341 INFO  [AM.ZK.Worker-pool5-t284] master.RegionStates: Transition {acb8f16a004a894c8706f6e12cd26144 state=SPLITTING_NEW, ts=1537461905340, server=prod013,60020,1537304282885} to {acb8f16a004a894c8706f6e12cd26144 state=OPEN, ts=1537461905341, server=
> Sep 20 19:45:05 prod001 hbase[10838]: 2018-09-20 19:45:05,343 INFO  [AM.ZK.Worker-pool5-t284] master.AssignmentManager: Handled SPLIT event; parent=IDX_MARK_O,\x107834005168\x0000000046200020LWfBS4c,1536637905252.3e44b85ddf407da831dbb9a871496986., daughter a=IDX_MARK_O,\x107834005168\x0000000046200020LWfBS4c,1
> Sep 20 19:47:41 prod001 hbase[10838]: 2018-09-20 19:47:41,972 INFO  [prod001,60000,1537304851459_ChoreService_2] balancer.StochasticLoadBalancer: Skipping load balancing because balanced cluster; total cost is 17.82282205608522, sum multiplier is 1102.0 min cost which need balance is 0.05
> Sep 20 19:47:42 prod001 hbase[10838]: 2018-09-20 19:47:42,021 INFO  [prod001,60000,1537304851459_ChoreService_1] hbase.MetaTableAccessor: Deleted IDX_MARK_O,\x107834005168\x0000000046200020LWfBS4c,1536637905252.3e44b85ddf407da831dbb9a871496986.
> Sep 20 19:47:42 prod001 hbase[10838]: 2018-09-20 19:47:42,022 INFO  [prod001,60000,1537304851459_ChoreService_1] master.CatalogJanitor: Scanned 779 catalog row(s), gc'd 0 unreferenced merged region(s) and 1 unreferenced parent region(s)
> 
>> On 20 Sep 2018, at 21:43, Batyrshin Alexander <0x...@gmail.com> wrote:
>> 
>> Looks like problem was because of index region split
>> 
>> Index region split at prod013:
>> Sep 20 19:45:05 prod013 hbase[193055]: 2018-09-20 19:45:05,441 INFO  [regionserver/prod013/10.0.0.13:60020-splits-1537400010677] regionserver.SplitRequest: Region split, hbase:meta updated, and report to master. Parent=IDX_MARK_O,\x107834005168\x0000000046200020LWfBS4c,1536637905252.3e44b85ddf407da831dbb9a871496986., new regions: IDX_MARK_O,\x107834005168\x0000000046200020LWfBS4c,1537461904877.33cba925c7acb347ac3f5e70e839c3cb., IDX_MARK_O,\x107834005168\x0000000046200068=4YF!YI,1537461904877.acb8f16a004a894c8706f6e12cd26144.. Split took 0sec
>> Sep 20 19:45:05 prod013 hbase[193055]: 2018-09-20 19:45:05,441 INFO  [regionserver/prod013/10.0.0.13:60020-splits-1537400010677] regionserver.SplitRequest: Split transaction journal:
>> Sep 20 19:45:05 prod013 hbase[193055]:         STARTED at 1537461904853
>> Sep 20 19:45:05 prod013 hbase[193055]:         PREPARED at 1537461904877
>> Sep 20 19:45:05 prod013 hbase[193055]:         BEFORE_PRE_SPLIT_HOOK at 1537461904877
>> Sep 20 19:45:05 prod013 hbase[193055]:         AFTER_PRE_SPLIT_HOOK at 1537461904877
>> Sep 20 19:45:05 prod013 hbase[193055]:         SET_SPLITTING at 1537461904880
>> Sep 20 19:45:05 prod013 hbase[193055]:         CREATE_SPLIT_DIR at 1537461904987
>> Sep 20 19:45:05 prod013 hbase[193055]:         CLOSED_PARENT_REGION at 1537461905002
>> Sep 20 19:45:05 prod013 hbase[193055]:         OFFLINED_PARENT at 1537461905002
>> Sep 20 19:45:05 prod013 hbase[193055]:         STARTED_REGION_A_CREATION at 1537461905056
>> Sep 20 19:45:05 prod013 hbase[193055]:         STARTED_REGION_B_CREATION at 1537461905131
>> Sep 20 19:45:05 prod013 hbase[193055]:         PONR at 1537461905192
>> Sep 20 19:45:05 prod013 hbase[193055]:         OPENED_REGION_A at 1537461905249
>> Sep 20 19:45:05 prod013 hbase[193055]:         OPENED_REGION_B at 1537461905252
>> Sep 20 19:45:05 prod013 hbase[193055]:         BEFORE_POST_SPLIT_HOOK at 1537461905439
>> Sep 20 19:45:05 prod013 hbase[193055]:         AFTER_POST_SPLIT_HOOK at 1537461905439
>> Sep 20 19:45:05 prod013 hbase[193055]:         COMPLETED at 1537461905439
>> 
>> Index update failed at prod002:
>> Sep 20 20:09:24 prod002 hbase[97285]: 2018-09-20 20:09:24,520 WARN  [hconnection-0x4f3242a0-shared--pool32-t36014] client.AsyncProcess: #220, table=IDX_MARK_O, attempt=1/1 failed=1ops, last exception: org.apache.hadoop.hbase.NotServingRegionException: org.apache.hadoop.hbase.NotServingRegionException: Re
>> gion IDX_MARK_O,\x107834005168\x0000000046200020LWfBS4c,1536637905252.3e44b85ddf407da831dbb9a871496986. is not online on prod013,60020,1537304282885
>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3081)
>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1271)
>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.regionserver.RSRpcServices.multi(RSRpcServices.java:2365)
>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:36621)
>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2359)
>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:124)
>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:297)
>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:277)
>> Sep 20 20:09:24 prod002 hbase[97285]:  on prod013,60020,1537304282885, tracking started Thu Sep 20 20:09:24 MSK 2018; not retrying 1 - final failure
>> Sep 20 20:09:24 prod002 hbase[97285]: 2018-09-20 20:09:24,549 INFO  [RpcServer.default.FPBQ.Fifo.handler=98,queue=8,port=60020] zookeeper.RecoverableZooKeeper: Process identifier=hconnection-0x39beae45 connecting to ZooKeeper ensemble=10.0.0.1:2181,10.0.0.2:2181,10.0.0.3:2181
>> Sep 20 20:09:24 prod002 hbase[97285]: 2018-09-20 20:09:24,549 INFO  [RpcServer.default.FPBQ.Fifo.handler=98,queue=8,port=60020] zookeeper.ZooKeeper: Initiating client connection, connectString=10.0.0.1:2181,10.0.0.2:2181,10.0.0.3:2181 sessionTimeout=90000 watcher=org.apache.hadoop.hbase.zookeeper.PendingWatcher@3ef61f7
>> Sep 20 20:09:24 prod002 hbase[97285]: 2018-09-20 20:09:24,562 INFO  [RpcServer.default.FPBQ.Fifo.handler=98,queue=8,port=60020-SendThread(10.0.0.3:2181)] zookeeper.ClientCnxn: Opening socket connection to server 10.0.0.3/10.0.0.3:2181. Will not attempt to authenticate using SASL (unknown error)
>> Sep 20 20:09:24 prod002 hbase[97285]: 2018-09-20 20:09:24,570 INFO  [RpcServer.default.FPBQ.Fifo.handler=98,queue=8,port=60020-SendThread(10.0.0.3:2181)] zookeeper.ClientCnxn: Socket connection established to 10.0.0.3/10.0.0.3:2181, initiating session
>> Sep 20 20:09:24 prod002 hbase[97285]: 2018-09-20 20:09:24,572 INFO  [RpcServer.default.FPBQ.Fifo.handler=98,queue=8,port=60020-SendThread(10.0.0.3:2181)] zookeeper.ClientCnxn: Session establishment complete on server 10.0.0.3/10.0.0.3:2181, sessionid = 0x30000e039e01c7f, negotiated timeout = 40000
>> Sep 20 20:09:24 prod002 hbase[97285]: 2018-09-20 20:09:24,628 INFO  [RpcServer.default.FPBQ.Fifo.handler=98,queue=8,port=60020] index.PhoenixIndexFailurePolicy: Successfully update INDEX_DISABLE_TIMESTAMP for IDX_MARK_O due to an exception while writing updates. indexState=PENDING_DISABLE
>> Sep 20 20:09:24 prod002 hbase[97285]: org.apache.phoenix.hbase.index.exception.MultiIndexWriteFailureException:  disableIndexOnFailure=true, Failed to write to multiple index tables: [IDX_MARK_O]
>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.phoenix.hbase.index.write.TrackingParallelWriterIndexCommitter.write(TrackingParallelWriterIndexCommitter.java:235)
>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.phoenix.hbase.index.write.IndexWriter.write(IndexWriter.java:195)
>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.phoenix.hbase.index.write.IndexWriter.writeAndKillYourselfOnFailure(IndexWriter.java:156)
>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.phoenix.hbase.index.write.IndexWriter.writeAndKillYourselfOnFailure(IndexWriter.java:145)
>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.phoenix.hbase.index.Indexer.doPostWithExceptions(Indexer.java:620)
>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.phoenix.hbase.index.Indexer.doPost(Indexer.java:595)
>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.phoenix.hbase.index.Indexer.postBatchMutateIndispensably(Indexer.java:578)
>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost$37.call(RegionCoprocessorHost.java:1048)
>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost$RegionOperation.call(RegionCoprocessorHost.java:1711)
>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost.execOperation(RegionCoprocessorHost.java:1789)
>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost.execOperation(RegionCoprocessorHost.java:1745)
>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost.postBatchMutateIndispensably(RegionCoprocessorHost.java:1044)
>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.regionserver.HRegion.doMiniBatchMutation(HRegion.java:3646)
>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(HRegion.java:3108)
>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(HRegion.java:3050)
>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.regionserver.RSRpcServices.doBatchOp(RSRpcServices.java:916)
>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.regionserver.RSRpcServices.doNonAtomicRegionMutation(RSRpcServices.java:844)
>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.regionserver.RSRpcServices.multi(RSRpcServices.java:2405)
>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:36621)
>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2359)
>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:124)
>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:297)
>> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:277)
>> Sep 20 20:09:24 prod002 hbase[97285]: 2018-09-20 20:09:24,632 INFO  [RpcServer.default.FPBQ.Fifo.handler=98,queue=8,port=60020] util.IndexManagementUtil: Rethrowing org.apache.hadoop.hbase.DoNotRetryIOException: ERROR 1121 (XCL21): Write to the index failed.  disableIndexOnFailure=true, Failed to write to multiple index tables: [IDX_MARK_O] ,serverTimestamp=1537463364504,
>> 
>> 
>>> On 20 Sep 2018, at 21:01, Batyrshin Alexander <0x...@gmail.com> wrote:
>>> 
>>> Our setup:
>>> HBase-1.4.7
>>> Phoenix-4.14-hbase-1.4
>>> 
>>> 
>>>> On 20 Sep 2018, at 20:19, Batyrshin Alexander <0x...@gmail.com> wrote:
>>>> 
>>>>  Hello,
>>>> Looks live we got dead lock with repeating "ERROR 1120 (XCL20)" exception. At this time all indexes is ACTIVE.
>>>> Can you help to make deeper diagnose?
>>>> 
>>>> java.sql.SQLException: ERROR 1120 (XCL20): Writes to table blocked until index can be updated. tableName=TBL_MARK
>>>> 	at org.apache.phoenix.exception.SQLExceptionCode$Factory$1.newException(SQLExceptionCode.java:494)
>>>> 	at org.apache.phoenix.exception.SQLExceptionInfo.buildException(SQLExceptionInfo.java:150)
>>>> 	at org.apache.phoenix.execute.MutationState.validateAndGetServerTimestamp(MutationState.java:815)
>>>> 	at org.apache.phoenix.execute.MutationState.validateAll(MutationState.java:789)
>>>> 	at org.apache.phoenix.execute.MutationState.send(MutationState.java:981)
>>>> 	at org.apache.phoenix.execute.MutationState.send(MutationState.java:1514)
>>>> 	at org.apache.phoenix.execute.MutationState.commit(MutationState.java:1337)
>>>> 	at org.apache.phoenix.jdbc.PhoenixConnection$3.call(PhoenixConnection.java:670)
>>>> 	at org.apache.phoenix.jdbc.PhoenixConnection$3.call(PhoenixConnection.java:666)
>>>> 	at org.apache.phoenix.call.CallRunner.run(CallRunner.java:53)
>>>> 	at org.apache.phoenix.jdbc.PhoenixConnection.commit(PhoenixConnection.java:666)
>>>> 	at x.persistence.phoenix.PhoenixDao.$anonfun$doUpsert$1(PhoenixDao.scala:103)
>>>> 	at scala.util.Try$.apply(Try.scala:209)
>>>> 	at x.persistence.phoenix.PhoenixDao.doUpsert(PhoenixDao.scala:101)
>>>> 	at x.persistence.phoenix.PhoenixDao.$anonfun$batchInsert$2(PhoenixDao.scala:45)
>>>> 	at x.persistence.phoenix.PhoenixDao.$anonfun$batchInsert$2$adapted(PhoenixDao.scala:45)
>>>> 	at scala.collection.immutable.Stream.flatMap(Stream.scala:486)
>>>> 	at scala.collection.immutable.Stream.$anonfun$flatMap$1(Stream.scala:494)
>>>> 	at scala.collection.immutable.Stream.$anonfun$append$1(Stream.scala:252)
>>>> 	at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1169)
>>>> 	at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1159)
>>>> 	at scala.collection.immutable.Stream.length(Stream.scala:309)
>>>> 	at scala.collection.SeqLike.size(SeqLike.scala:105)
>>>> 	at scala.collection.SeqLike.size$(SeqLike.scala:105)
>>>> 	at scala.collection.AbstractSeq.size(Seq.scala:41)
>>>> 	at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:285)
>>>> 	at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:283)
>>>> 	at scala.collection.AbstractTraversable.toArray(Traversable.scala:104)
>>>> 	at x.persistence.phoenix.PhoenixDao.$anonfun$batchInsert$1(PhoenixDao.scala:45)
>>>> 	at scala.util.Try$.apply(Try.scala:209)
>>>> 	at x.persistence.phoenix.PhoenixDao.batchInsert(PhoenixDao.scala:45)
>>>> 	at x.persistence.phoenix.PhoenixDao.$anonfun$insert$2(PhoenixDao.scala:35)
>>>> 	at scala.concurrent.Future$.$anonfun$apply$1(Future.scala:655)
>>>> 	at scala.util.Success.$anonfun$map$1(Try.scala:251)
>>>> 	at scala.util.Success.map(Try.scala:209)
>>>> 	at scala.concurrent.Future.$anonfun$map$1(Future.scala:289)
>>>> 	at scala.concurrent.impl.Promise.liftedTree1$1(Promise.scala:29)
>>>> 	at scala.concurrent.impl.Promise.$anonfun$transform$1(Promise.scala:29)
>>>> 	at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:60)
>>>> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>>>> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>>>> 	at java.lang.Thread.run(Thread.java:748)
>>>> 
>>>> 
>>> 
>> 
>

Re: Table dead lock: ERROR 1120 (XCL20): Writes to table blocked until index can be updated

Posted by Batyrshin Alexander <0x...@gmail.com>.

And how this split looks at Master logs:

Sep 20 19:45:04 prod001 hbase[10838]: 2018-09-20 19:45:04,888 INFO  [AM.ZK.Worker-pool5-t282] master.RegionStates: Transition {3e44b85ddf407da831dbb9a871496986 state=OPEN, ts=1537304859509, server=prod013,60020,1537304282885} to {3e44b85ddf407da831dbb9a871496986 state=SPLITTING, ts=1537461904888, server=prod
Sep 20 19:45:05 prod001 hbase[10838]: 2018-09-20 19:45:05,340 INFO  [AM.ZK.Worker-pool5-t284] master.RegionStates: Transition {3e44b85ddf407da831dbb9a871496986 state=SPLITTING, ts=1537461905340, server=prod013,60020,1537304282885} to {3e44b85ddf407da831dbb9a871496986 state=SPLIT, ts=1537461905340, server=pro
Sep 20 19:45:05 prod001 hbase[10838]: 2018-09-20 19:45:05,340 INFO  [AM.ZK.Worker-pool5-t284] master.RegionStates: Offlined 3e44b85ddf407da831dbb9a871496986 from prod013,60020,1537304282885
Sep 20 19:45:05 prod001 hbase[10838]: 2018-09-20 19:45:05,341 INFO  [AM.ZK.Worker-pool5-t284] master.RegionStates: Transition {33cba925c7acb347ac3f5e70e839c3cb state=SPLITTING_NEW, ts=1537461905340, server=prod013,60020,1537304282885} to {33cba925c7acb347ac3f5e70e839c3cb state=OPEN, ts=1537461905341, server=
Sep 20 19:45:05 prod001 hbase[10838]: 2018-09-20 19:45:05,341 INFO  [AM.ZK.Worker-pool5-t284] master.RegionStates: Transition {acb8f16a004a894c8706f6e12cd26144 state=SPLITTING_NEW, ts=1537461905340, server=prod013,60020,1537304282885} to {acb8f16a004a894c8706f6e12cd26144 state=OPEN, ts=1537461905341, server=
Sep 20 19:45:05 prod001 hbase[10838]: 2018-09-20 19:45:05,343 INFO  [AM.ZK.Worker-pool5-t284] master.AssignmentManager: Handled SPLIT event; parent=IDX_MARK_O,\x107834005168\x0000000046200020LWfBS4c,1536637905252.3e44b85ddf407da831dbb9a871496986., daughter a=IDX_MARK_O,\x107834005168\x0000000046200020LWfBS4c,1
Sep 20 19:47:41 prod001 hbase[10838]: 2018-09-20 19:47:41,972 INFO  [prod001,60000,1537304851459_ChoreService_2] balancer.StochasticLoadBalancer: Skipping load balancing because balanced cluster; total cost is 17.82282205608522, sum multiplier is 1102.0 min cost which need balance is 0.05
Sep 20 19:47:42 prod001 hbase[10838]: 2018-09-20 19:47:42,021 INFO  [prod001,60000,1537304851459_ChoreService_1] hbase.MetaTableAccessor: Deleted IDX_MARK_O,\x107834005168\x0000000046200020LWfBS4c,1536637905252.3e44b85ddf407da831dbb9a871496986.
Sep 20 19:47:42 prod001 hbase[10838]: 2018-09-20 19:47:42,022 INFO  [prod001,60000,1537304851459_ChoreService_1] master.CatalogJanitor: Scanned 779 catalog row(s), gc'd 0 unreferenced merged region(s) and 1 unreferenced parent region(s)

> On 20 Sep 2018, at 21:43, Batyrshin Alexander <0x...@gmail.com> wrote:
> 
> Looks like problem was because of index region split
> 
> Index region split at prod013:
> Sep 20 19:45:05 prod013 hbase[193055]: 2018-09-20 19:45:05,441 INFO  [regionserver/prod013/10.0.0.13:60020-splits-1537400010677] regionserver.SplitRequest: Region split, hbase:meta updated, and report to master. Parent=IDX_MARK_O,\x107834005168\x0000000046200020LWfBS4c,1536637905252.3e44b85ddf407da831dbb9a871496986., new regions: IDX_MARK_O,\x107834005168\x0000000046200020LWfBS4c,1537461904877.33cba925c7acb347ac3f5e70e839c3cb., IDX_MARK_O,\x107834005168\x0000000046200068=4YF!YI,1537461904877.acb8f16a004a894c8706f6e12cd26144.. Split took 0sec
> Sep 20 19:45:05 prod013 hbase[193055]: 2018-09-20 19:45:05,441 INFO  [regionserver/prod013/10.0.0.13:60020-splits-1537400010677] regionserver.SplitRequest: Split transaction journal:
> Sep 20 19:45:05 prod013 hbase[193055]:         STARTED at 1537461904853
> Sep 20 19:45:05 prod013 hbase[193055]:         PREPARED at 1537461904877
> Sep 20 19:45:05 prod013 hbase[193055]:         BEFORE_PRE_SPLIT_HOOK at 1537461904877
> Sep 20 19:45:05 prod013 hbase[193055]:         AFTER_PRE_SPLIT_HOOK at 1537461904877
> Sep 20 19:45:05 prod013 hbase[193055]:         SET_SPLITTING at 1537461904880
> Sep 20 19:45:05 prod013 hbase[193055]:         CREATE_SPLIT_DIR at 1537461904987
> Sep 20 19:45:05 prod013 hbase[193055]:         CLOSED_PARENT_REGION at 1537461905002
> Sep 20 19:45:05 prod013 hbase[193055]:         OFFLINED_PARENT at 1537461905002
> Sep 20 19:45:05 prod013 hbase[193055]:         STARTED_REGION_A_CREATION at 1537461905056
> Sep 20 19:45:05 prod013 hbase[193055]:         STARTED_REGION_B_CREATION at 1537461905131
> Sep 20 19:45:05 prod013 hbase[193055]:         PONR at 1537461905192
> Sep 20 19:45:05 prod013 hbase[193055]:         OPENED_REGION_A at 1537461905249
> Sep 20 19:45:05 prod013 hbase[193055]:         OPENED_REGION_B at 1537461905252
> Sep 20 19:45:05 prod013 hbase[193055]:         BEFORE_POST_SPLIT_HOOK at 1537461905439
> Sep 20 19:45:05 prod013 hbase[193055]:         AFTER_POST_SPLIT_HOOK at 1537461905439
> Sep 20 19:45:05 prod013 hbase[193055]:         COMPLETED at 1537461905439
> 
> Index update failed at prod002:
> Sep 20 20:09:24 prod002 hbase[97285]: 2018-09-20 20:09:24,520 WARN  [hconnection-0x4f3242a0-shared--pool32-t36014] client.AsyncProcess: #220, table=IDX_MARK_O, attempt=1/1 failed=1ops, last exception: org.apache.hadoop.hbase.NotServingRegionException: org.apache.hadoop.hbase.NotServingRegionException: Re
> gion IDX_MARK_O,\x107834005168\x0000000046200020LWfBS4c,1536637905252.3e44b85ddf407da831dbb9a871496986. is not online on prod013,60020,1537304282885
> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3081)
> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1271)
> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.regionserver.RSRpcServices.multi(RSRpcServices.java:2365)
> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:36621)
> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2359)
> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:124)
> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:297)
> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:277)
> Sep 20 20:09:24 prod002 hbase[97285]:  on prod013,60020,1537304282885, tracking started Thu Sep 20 20:09:24 MSK 2018; not retrying 1 - final failure
> Sep 20 20:09:24 prod002 hbase[97285]: 2018-09-20 20:09:24,549 INFO  [RpcServer.default.FPBQ.Fifo.handler=98,queue=8,port=60020] zookeeper.RecoverableZooKeeper: Process identifier=hconnection-0x39beae45 connecting to ZooKeeper ensemble=10.0.0.1:2181,10.0.0.2:2181,10.0.0.3:2181
> Sep 20 20:09:24 prod002 hbase[97285]: 2018-09-20 20:09:24,549 INFO  [RpcServer.default.FPBQ.Fifo.handler=98,queue=8,port=60020] zookeeper.ZooKeeper: Initiating client connection, connectString=10.0.0.1:2181,10.0.0.2:2181,10.0.0.3:2181 sessionTimeout=90000 watcher=org.apache.hadoop.hbase.zookeeper.PendingWatcher@3ef61f7
> Sep 20 20:09:24 prod002 hbase[97285]: 2018-09-20 20:09:24,562 INFO  [RpcServer.default.FPBQ.Fifo.handler=98,queue=8,port=60020-SendThread(10.0.0.3:2181)] zookeeper.ClientCnxn: Opening socket connection to server 10.0.0.3/10.0.0.3:2181. Will not attempt to authenticate using SASL (unknown error)
> Sep 20 20:09:24 prod002 hbase[97285]: 2018-09-20 20:09:24,570 INFO  [RpcServer.default.FPBQ.Fifo.handler=98,queue=8,port=60020-SendThread(10.0.0.3:2181)] zookeeper.ClientCnxn: Socket connection established to 10.0.0.3/10.0.0.3:2181, initiating session
> Sep 20 20:09:24 prod002 hbase[97285]: 2018-09-20 20:09:24,572 INFO  [RpcServer.default.FPBQ.Fifo.handler=98,queue=8,port=60020-SendThread(10.0.0.3:2181)] zookeeper.ClientCnxn: Session establishment complete on server 10.0.0.3/10.0.0.3:2181, sessionid = 0x30000e039e01c7f, negotiated timeout = 40000
> Sep 20 20:09:24 prod002 hbase[97285]: 2018-09-20 20:09:24,628 INFO  [RpcServer.default.FPBQ.Fifo.handler=98,queue=8,port=60020] index.PhoenixIndexFailurePolicy: Successfully update INDEX_DISABLE_TIMESTAMP for IDX_MARK_O due to an exception while writing updates. indexState=PENDING_DISABLE
> Sep 20 20:09:24 prod002 hbase[97285]: org.apache.phoenix.hbase.index.exception.MultiIndexWriteFailureException:  disableIndexOnFailure=true, Failed to write to multiple index tables: [IDX_MARK_O]
> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.phoenix.hbase.index.write.TrackingParallelWriterIndexCommitter.write(TrackingParallelWriterIndexCommitter.java:235)
> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.phoenix.hbase.index.write.IndexWriter.write(IndexWriter.java:195)
> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.phoenix.hbase.index.write.IndexWriter.writeAndKillYourselfOnFailure(IndexWriter.java:156)
> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.phoenix.hbase.index.write.IndexWriter.writeAndKillYourselfOnFailure(IndexWriter.java:145)
> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.phoenix.hbase.index.Indexer.doPostWithExceptions(Indexer.java:620)
> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.phoenix.hbase.index.Indexer.doPost(Indexer.java:595)
> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.phoenix.hbase.index.Indexer.postBatchMutateIndispensably(Indexer.java:578)
> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost$37.call(RegionCoprocessorHost.java:1048)
> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost$RegionOperation.call(RegionCoprocessorHost.java:1711)
> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost.execOperation(RegionCoprocessorHost.java:1789)
> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost.execOperation(RegionCoprocessorHost.java:1745)
> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost.postBatchMutateIndispensably(RegionCoprocessorHost.java:1044)
> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.regionserver.HRegion.doMiniBatchMutation(HRegion.java:3646)
> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(HRegion.java:3108)
> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(HRegion.java:3050)
> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.regionserver.RSRpcServices.doBatchOp(RSRpcServices.java:916)
> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.regionserver.RSRpcServices.doNonAtomicRegionMutation(RSRpcServices.java:844)
> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.regionserver.RSRpcServices.multi(RSRpcServices.java:2405)
> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:36621)
> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2359)
> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:124)
> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:297)
> Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:277)
> Sep 20 20:09:24 prod002 hbase[97285]: 2018-09-20 20:09:24,632 INFO  [RpcServer.default.FPBQ.Fifo.handler=98,queue=8,port=60020] util.IndexManagementUtil: Rethrowing org.apache.hadoop.hbase.DoNotRetryIOException: ERROR 1121 (XCL21): Write to the index failed.  disableIndexOnFailure=true, Failed to write to multiple index tables: [IDX_MARK_O] ,serverTimestamp=1537463364504,
> 
> 
>> On 20 Sep 2018, at 21:01, Batyrshin Alexander <0x62ash@gmail.com <ma...@gmail.com>> wrote:
>> 
>> Our setup:
>> HBase-1.4.7
>> Phoenix-4.14-hbase-1.4
>> 
>> 
>>> On 20 Sep 2018, at 20:19, Batyrshin Alexander <0x62ash@gmail.com <ma...@gmail.com>> wrote:
>>> 
>>>  Hello,
>>> Looks live we got dead lock with repeating "ERROR 1120 (XCL20)" exception. At this time all indexes is ACTIVE.
>>> Can you help to make deeper diagnose?
>>> 
>>> java.sql.SQLException: ERROR 1120 (XCL20): Writes to table blocked until index can be updated. tableName=TBL_MARK
>>> 	at org.apache.phoenix.exception.SQLExceptionCode$Factory$1.newException(SQLExceptionCode.java:494)
>>> 	at org.apache.phoenix.exception.SQLExceptionInfo.buildException(SQLExceptionInfo.java:150)
>>> 	at org.apache.phoenix.execute.MutationState.validateAndGetServerTimestamp(MutationState.java:815)
>>> 	at org.apache.phoenix.execute.MutationState.validateAll(MutationState.java:789)
>>> 	at org.apache.phoenix.execute.MutationState.send(MutationState.java:981)
>>> 	at org.apache.phoenix.execute.MutationState.send(MutationState.java:1514)
>>> 	at org.apache.phoenix.execute.MutationState.commit(MutationState.java:1337)
>>> 	at org.apache.phoenix.jdbc.PhoenixConnection$3.call(PhoenixConnection.java:670)
>>> 	at org.apache.phoenix.jdbc.PhoenixConnection$3.call(PhoenixConnection.java:666)
>>> 	at org.apache.phoenix.call.CallRunner.run(CallRunner.java:53)
>>> 	at org.apache.phoenix.jdbc.PhoenixConnection.commit(PhoenixConnection.java:666)
>>> 	at x.persistence.phoenix.PhoenixDao.$anonfun$doUpsert$1(PhoenixDao.scala:103)
>>> 	at scala.util.Try$.apply(Try.scala:209)
>>> 	at x.persistence.phoenix.PhoenixDao.doUpsert(PhoenixDao.scala:101)
>>> 	at x.persistence.phoenix.PhoenixDao.$anonfun$batchInsert$2(PhoenixDao.scala:45)
>>> 	at x.persistence.phoenix.PhoenixDao.$anonfun$batchInsert$2$adapted(PhoenixDao.scala:45)
>>> 	at scala.collection.immutable.Stream.flatMap(Stream.scala:486)
>>> 	at scala.collection.immutable.Stream.$anonfun$flatMap$1(Stream.scala:494)
>>> 	at scala.collection.immutable.Stream.$anonfun$append$1(Stream.scala:252)
>>> 	at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1169)
>>> 	at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1159)
>>> 	at scala.collection.immutable.Stream.length(Stream.scala:309)
>>> 	at scala.collection.SeqLike.size(SeqLike.scala:105)
>>> 	at scala.collection.SeqLike.size$(SeqLike.scala:105)
>>> 	at scala.collection.AbstractSeq.size(Seq.scala:41)
>>> 	at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:285)
>>> 	at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:283)
>>> 	at scala.collection.AbstractTraversable.toArray(Traversable.scala:104)
>>> 	at x.persistence.phoenix.PhoenixDao.$anonfun$batchInsert$1(PhoenixDao.scala:45)
>>> 	at scala.util.Try$.apply(Try.scala:209)
>>> 	at x.persistence.phoenix.PhoenixDao.batchInsert(PhoenixDao.scala:45)
>>> 	at x.persistence.phoenix.PhoenixDao.$anonfun$insert$2(PhoenixDao.scala:35)
>>> 	at scala.concurrent.Future$.$anonfun$apply$1(Future.scala:655)
>>> 	at scala.util.Success.$anonfun$map$1(Try.scala:251)
>>> 	at scala.util.Success.map(Try.scala:209)
>>> 	at scala.concurrent.Future.$anonfun$map$1(Future.scala:289)
>>> 	at scala.concurrent.impl.Promise.liftedTree1$1(Promise.scala:29)
>>> 	at scala.concurrent.impl.Promise.$anonfun$transform$1(Promise.scala:29)
>>> 	at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:60)
>>> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>>> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>>> 	at java.lang.Thread.run(Thread.java:748)
>>> 
>> 
>

Re: Table dead lock: ERROR 1120 (XCL20): Writes to table blocked until index can be updated

Posted by Batyrshin Alexander <0x...@gmail.com>.

Looks like problem was because of index region split

Index region split at prod013:
Sep 20 19:45:05 prod013 hbase[193055]: 2018-09-20 19:45:05,441 INFO  [regionserver/prod013/10.0.0.13:60020-splits-1537400010677] regionserver.SplitRequest: Region split, hbase:meta updated, and report to master. Parent=IDX_MARK_O,\x107834005168\x0000000046200020LWfBS4c,1536637905252.3e44b85ddf407da831dbb9a871496986., new regions: IDX_MARK_O,\x107834005168\x0000000046200020LWfBS4c,1537461904877.33cba925c7acb347ac3f5e70e839c3cb., IDX_MARK_O,\x107834005168\x0000000046200068=4YF!YI,1537461904877.acb8f16a004a894c8706f6e12cd26144.. Split took 0sec
Sep 20 19:45:05 prod013 hbase[193055]: 2018-09-20 19:45:05,441 INFO  [regionserver/prod013/10.0.0.13:60020-splits-1537400010677] regionserver.SplitRequest: Split transaction journal:
Sep 20 19:45:05 prod013 hbase[193055]:         STARTED at 1537461904853
Sep 20 19:45:05 prod013 hbase[193055]:         PREPARED at 1537461904877
Sep 20 19:45:05 prod013 hbase[193055]:         BEFORE_PRE_SPLIT_HOOK at 1537461904877
Sep 20 19:45:05 prod013 hbase[193055]:         AFTER_PRE_SPLIT_HOOK at 1537461904877
Sep 20 19:45:05 prod013 hbase[193055]:         SET_SPLITTING at 1537461904880
Sep 20 19:45:05 prod013 hbase[193055]:         CREATE_SPLIT_DIR at 1537461904987
Sep 20 19:45:05 prod013 hbase[193055]:         CLOSED_PARENT_REGION at 1537461905002
Sep 20 19:45:05 prod013 hbase[193055]:         OFFLINED_PARENT at 1537461905002
Sep 20 19:45:05 prod013 hbase[193055]:         STARTED_REGION_A_CREATION at 1537461905056
Sep 20 19:45:05 prod013 hbase[193055]:         STARTED_REGION_B_CREATION at 1537461905131
Sep 20 19:45:05 prod013 hbase[193055]:         PONR at 1537461905192
Sep 20 19:45:05 prod013 hbase[193055]:         OPENED_REGION_A at 1537461905249
Sep 20 19:45:05 prod013 hbase[193055]:         OPENED_REGION_B at 1537461905252
Sep 20 19:45:05 prod013 hbase[193055]:         BEFORE_POST_SPLIT_HOOK at 1537461905439
Sep 20 19:45:05 prod013 hbase[193055]:         AFTER_POST_SPLIT_HOOK at 1537461905439
Sep 20 19:45:05 prod013 hbase[193055]:         COMPLETED at 1537461905439

Index update failed at prod002:
Sep 20 20:09:24 prod002 hbase[97285]: 2018-09-20 20:09:24,520 WARN  [hconnection-0x4f3242a0-shared--pool32-t36014] client.AsyncProcess: #220, table=IDX_MARK_O, attempt=1/1 failed=1ops, last exception: org.apache.hadoop.hbase.NotServingRegionException: org.apache.hadoop.hbase.NotServingRegionException: Re
gion IDX_MARK_O,\x107834005168\x0000000046200020LWfBS4c,1536637905252.3e44b85ddf407da831dbb9a871496986. is not online on prod013,60020,1537304282885
Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3081)
Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1271)
Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.regionserver.RSRpcServices.multi(RSRpcServices.java:2365)
Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:36621)
Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2359)
Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:124)
Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:297)
Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:277)
Sep 20 20:09:24 prod002 hbase[97285]:  on prod013,60020,1537304282885, tracking started Thu Sep 20 20:09:24 MSK 2018; not retrying 1 - final failure
Sep 20 20:09:24 prod002 hbase[97285]: 2018-09-20 20:09:24,549 INFO  [RpcServer.default.FPBQ.Fifo.handler=98,queue=8,port=60020] zookeeper.RecoverableZooKeeper: Process identifier=hconnection-0x39beae45 connecting to ZooKeeper ensemble=10.0.0.1:2181,10.0.0.2:2181,10.0.0.3:2181
Sep 20 20:09:24 prod002 hbase[97285]: 2018-09-20 20:09:24,549 INFO  [RpcServer.default.FPBQ.Fifo.handler=98,queue=8,port=60020] zookeeper.ZooKeeper: Initiating client connection, connectString=10.0.0.1:2181,10.0.0.2:2181,10.0.0.3:2181 sessionTimeout=90000 watcher=org.apache.hadoop.hbase.zookeeper.PendingWatcher@3ef61f7
Sep 20 20:09:24 prod002 hbase[97285]: 2018-09-20 20:09:24,562 INFO  [RpcServer.default.FPBQ.Fifo.handler=98,queue=8,port=60020-SendThread(10.0.0.3:2181)] zookeeper.ClientCnxn: Opening socket connection to server 10.0.0.3/10.0.0.3:2181. Will not attempt to authenticate using SASL (unknown error)
Sep 20 20:09:24 prod002 hbase[97285]: 2018-09-20 20:09:24,570 INFO  [RpcServer.default.FPBQ.Fifo.handler=98,queue=8,port=60020-SendThread(10.0.0.3:2181)] zookeeper.ClientCnxn: Socket connection established to 10.0.0.3/10.0.0.3:2181, initiating session
Sep 20 20:09:24 prod002 hbase[97285]: 2018-09-20 20:09:24,572 INFO  [RpcServer.default.FPBQ.Fifo.handler=98,queue=8,port=60020-SendThread(10.0.0.3:2181)] zookeeper.ClientCnxn: Session establishment complete on server 10.0.0.3/10.0.0.3:2181, sessionid = 0x30000e039e01c7f, negotiated timeout = 40000
Sep 20 20:09:24 prod002 hbase[97285]: 2018-09-20 20:09:24,628 INFO  [RpcServer.default.FPBQ.Fifo.handler=98,queue=8,port=60020] index.PhoenixIndexFailurePolicy: Successfully update INDEX_DISABLE_TIMESTAMP for IDX_MARK_O due to an exception while writing updates. indexState=PENDING_DISABLE
Sep 20 20:09:24 prod002 hbase[97285]: org.apache.phoenix.hbase.index.exception.MultiIndexWriteFailureException:  disableIndexOnFailure=true, Failed to write to multiple index tables: [IDX_MARK_O]
Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.phoenix.hbase.index.write.TrackingParallelWriterIndexCommitter.write(TrackingParallelWriterIndexCommitter.java:235)
Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.phoenix.hbase.index.write.IndexWriter.write(IndexWriter.java:195)
Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.phoenix.hbase.index.write.IndexWriter.writeAndKillYourselfOnFailure(IndexWriter.java:156)
Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.phoenix.hbase.index.write.IndexWriter.writeAndKillYourselfOnFailure(IndexWriter.java:145)
Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.phoenix.hbase.index.Indexer.doPostWithExceptions(Indexer.java:620)
Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.phoenix.hbase.index.Indexer.doPost(Indexer.java:595)
Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.phoenix.hbase.index.Indexer.postBatchMutateIndispensably(Indexer.java:578)
Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost$37.call(RegionCoprocessorHost.java:1048)
Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost$RegionOperation.call(RegionCoprocessorHost.java:1711)
Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost.execOperation(RegionCoprocessorHost.java:1789)
Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost.execOperation(RegionCoprocessorHost.java:1745)
Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost.postBatchMutateIndispensably(RegionCoprocessorHost.java:1044)
Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.regionserver.HRegion.doMiniBatchMutation(HRegion.java:3646)
Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(HRegion.java:3108)
Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(HRegion.java:3050)
Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.regionserver.RSRpcServices.doBatchOp(RSRpcServices.java:916)
Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.regionserver.RSRpcServices.doNonAtomicRegionMutation(RSRpcServices.java:844)
Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.regionserver.RSRpcServices.multi(RSRpcServices.java:2405)
Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:36621)
Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2359)
Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:124)
Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:297)
Sep 20 20:09:24 prod002 hbase[97285]:         at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:277)
Sep 20 20:09:24 prod002 hbase[97285]: 2018-09-20 20:09:24,632 INFO  [RpcServer.default.FPBQ.Fifo.handler=98,queue=8,port=60020] util.IndexManagementUtil: Rethrowing org.apache.hadoop.hbase.DoNotRetryIOException: ERROR 1121 (XCL21): Write to the index failed.  disableIndexOnFailure=true, Failed to write to multiple index tables: [IDX_MARK_O] ,serverTimestamp=1537463364504,


> On 20 Sep 2018, at 21:01, Batyrshin Alexander <0x...@gmail.com> wrote:
> 
> Our setup:
> HBase-1.4.7
> Phoenix-4.14-hbase-1.4
> 
> 
>> On 20 Sep 2018, at 20:19, Batyrshin Alexander <0x62ash@gmail.com <ma...@gmail.com>> wrote:
>> 
>>  Hello,
>> Looks live we got dead lock with repeating "ERROR 1120 (XCL20)" exception. At this time all indexes is ACTIVE.
>> Can you help to make deeper diagnose?
>> 
>> java.sql.SQLException: ERROR 1120 (XCL20): Writes to table blocked until index can be updated. tableName=TBL_MARK
>> 	at org.apache.phoenix.exception.SQLExceptionCode$Factory$1.newException(SQLExceptionCode.java:494)
>> 	at org.apache.phoenix.exception.SQLExceptionInfo.buildException(SQLExceptionInfo.java:150)
>> 	at org.apache.phoenix.execute.MutationState.validateAndGetServerTimestamp(MutationState.java:815)
>> 	at org.apache.phoenix.execute.MutationState.validateAll(MutationState.java:789)
>> 	at org.apache.phoenix.execute.MutationState.send(MutationState.java:981)
>> 	at org.apache.phoenix.execute.MutationState.send(MutationState.java:1514)
>> 	at org.apache.phoenix.execute.MutationState.commit(MutationState.java:1337)
>> 	at org.apache.phoenix.jdbc.PhoenixConnection$3.call(PhoenixConnection.java:670)
>> 	at org.apache.phoenix.jdbc.PhoenixConnection$3.call(PhoenixConnection.java:666)
>> 	at org.apache.phoenix.call.CallRunner.run(CallRunner.java:53)
>> 	at org.apache.phoenix.jdbc.PhoenixConnection.commit(PhoenixConnection.java:666)
>> 	at x.persistence.phoenix.PhoenixDao.$anonfun$doUpsert$1(PhoenixDao.scala:103)
>> 	at scala.util.Try$.apply(Try.scala:209)
>> 	at x.persistence.phoenix.PhoenixDao.doUpsert(PhoenixDao.scala:101)
>> 	at x.persistence.phoenix.PhoenixDao.$anonfun$batchInsert$2(PhoenixDao.scala:45)
>> 	at x.persistence.phoenix.PhoenixDao.$anonfun$batchInsert$2$adapted(PhoenixDao.scala:45)
>> 	at scala.collection.immutable.Stream.flatMap(Stream.scala:486)
>> 	at scala.collection.immutable.Stream.$anonfun$flatMap$1(Stream.scala:494)
>> 	at scala.collection.immutable.Stream.$anonfun$append$1(Stream.scala:252)
>> 	at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1169)
>> 	at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1159)
>> 	at scala.collection.immutable.Stream.length(Stream.scala:309)
>> 	at scala.collection.SeqLike.size(SeqLike.scala:105)
>> 	at scala.collection.SeqLike.size$(SeqLike.scala:105)
>> 	at scala.collection.AbstractSeq.size(Seq.scala:41)
>> 	at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:285)
>> 	at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:283)
>> 	at scala.collection.AbstractTraversable.toArray(Traversable.scala:104)
>> 	at x.persistence.phoenix.PhoenixDao.$anonfun$batchInsert$1(PhoenixDao.scala:45)
>> 	at scala.util.Try$.apply(Try.scala:209)
>> 	at x.persistence.phoenix.PhoenixDao.batchInsert(PhoenixDao.scala:45)
>> 	at x.persistence.phoenix.PhoenixDao.$anonfun$insert$2(PhoenixDao.scala:35)
>> 	at scala.concurrent.Future$.$anonfun$apply$1(Future.scala:655)
>> 	at scala.util.Success.$anonfun$map$1(Try.scala:251)
>> 	at scala.util.Success.map(Try.scala:209)
>> 	at scala.concurrent.Future.$anonfun$map$1(Future.scala:289)
>> 	at scala.concurrent.impl.Promise.liftedTree1$1(Promise.scala:29)
>> 	at scala.concurrent.impl.Promise.$anonfun$transform$1(Promise.scala:29)
>> 	at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:60)
>> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>> 	at java.lang.Thread.run(Thread.java:748)
>> 
>

Re: Table dead lock: ERROR 1120 (XCL20): Writes to table blocked until index can be updated

Posted by Batyrshin Alexander <0x...@gmail.com>.

Our setup:
HBase-1.4.7
Phoenix-4.14-hbase-1.4


> On 20 Sep 2018, at 20:19, Batyrshin Alexander <0x...@gmail.com> wrote:
> 
>  Hello,
> Looks live we got dead lock with repeating "ERROR 1120 (XCL20)" exception. At this time all indexes is ACTIVE.
> Can you help to make deeper diagnose?
> 
> java.sql.SQLException: ERROR 1120 (XCL20): Writes to table blocked until index can be updated. tableName=TBL_MARK
> 	at org.apache.phoenix.exception.SQLExceptionCode$Factory$1.newException(SQLExceptionCode.java:494)
> 	at org.apache.phoenix.exception.SQLExceptionInfo.buildException(SQLExceptionInfo.java:150)
> 	at org.apache.phoenix.execute.MutationState.validateAndGetServerTimestamp(MutationState.java:815)
> 	at org.apache.phoenix.execute.MutationState.validateAll(MutationState.java:789)
> 	at org.apache.phoenix.execute.MutationState.send(MutationState.java:981)
> 	at org.apache.phoenix.execute.MutationState.send(MutationState.java:1514)
> 	at org.apache.phoenix.execute.MutationState.commit(MutationState.java:1337)
> 	at org.apache.phoenix.jdbc.PhoenixConnection$3.call(PhoenixConnection.java:670)
> 	at org.apache.phoenix.jdbc.PhoenixConnection$3.call(PhoenixConnection.java:666)
> 	at org.apache.phoenix.call.CallRunner.run(CallRunner.java:53)
> 	at org.apache.phoenix.jdbc.PhoenixConnection.commit(PhoenixConnection.java:666)
> 	at x.persistence.phoenix.PhoenixDao.$anonfun$doUpsert$1(PhoenixDao.scala:103)
> 	at scala.util.Try$.apply(Try.scala:209)
> 	at x.persistence.phoenix.PhoenixDao.doUpsert(PhoenixDao.scala:101)
> 	at x.persistence.phoenix.PhoenixDao.$anonfun$batchInsert$2(PhoenixDao.scala:45)
> 	at x.persistence.phoenix.PhoenixDao.$anonfun$batchInsert$2$adapted(PhoenixDao.scala:45)
> 	at scala.collection.immutable.Stream.flatMap(Stream.scala:486)
> 	at scala.collection.immutable.Stream.$anonfun$flatMap$1(Stream.scala:494)
> 	at scala.collection.immutable.Stream.$anonfun$append$1(Stream.scala:252)
> 	at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1169)
> 	at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1159)
> 	at scala.collection.immutable.Stream.length(Stream.scala:309)
> 	at scala.collection.SeqLike.size(SeqLike.scala:105)
> 	at scala.collection.SeqLike.size$(SeqLike.scala:105)
> 	at scala.collection.AbstractSeq.size(Seq.scala:41)
> 	at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:285)
> 	at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:283)
> 	at scala.collection.AbstractTraversable.toArray(Traversable.scala:104)
> 	at x.persistence.phoenix.PhoenixDao.$anonfun$batchInsert$1(PhoenixDao.scala:45)
> 	at scala.util.Try$.apply(Try.scala:209)
> 	at x.persistence.phoenix.PhoenixDao.batchInsert(PhoenixDao.scala:45)
> 	at x.persistence.phoenix.PhoenixDao.$anonfun$insert$2(PhoenixDao.scala:35)
> 	at scala.concurrent.Future$.$anonfun$apply$1(Future.scala:655)
> 	at scala.util.Success.$anonfun$map$1(Try.scala:251)
> 	at scala.util.Success.map(Try.scala:209)
> 	at scala.concurrent.Future.$anonfun$map$1(Future.scala:289)
> 	at scala.concurrent.impl.Promise.liftedTree1$1(Promise.scala:29)
> 	at scala.concurrent.impl.Promise.$anonfun$transform$1(Promise.scala:29)
> 	at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:60)
> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> 	at java.lang.Thread.run(Thread.java:748)
>